This report outlines a simple solution to the universal academic problem of remote, i.e., off-campus, access to on-campus (IP restricted) Web-based resources. This solution costs little; requires no special client software; works with a variety of authentication methods; allows fine-grained control over what services can be accessed; and offers both reasonable security and speed.
Note: This paper covers work done as of the summer of 2000; for a glimpse into work done since then, see, e.g., my 2001 portal paper.
Just two decades ago, access to vital university information resources was largely a matter of physical proximity: You had to be nearby, or make frequent trips, in order to do serious academic writing and research. In particular, to gain access to serials, papers, monographs, and other core research materials you had to present credentials in person at the library--usually in the form of a university ID.
As information resources have made their way onto the Internet, and in particular onto the Web, however, physical proximity has taken on an increasingly secondary role. To reach Web-based information resources today all one needs is a computer with a Web browser and some sort of Internet feed.
The increasingly secondary role played by physical proximity has created a critical problem for university communities and for their core information resources. If physical presence is no longer necessary, i.e., if someone can "knock at the door" without actually being at the door, how can we tell whether to let him or her in? How can we tell if the person knocking is a member in good standing of our community? How can we grant him or her access without accidentally letting anyone else in?
If you belong to a university MIS, computer support, or library reference department, you have struggled with these sorts of questions. Librarians, in particular, will have struggled through scenarios like the following:
Adjunct medical-school professor takes up a practice in a university-affiliated clinic. He turns on his new office computer, logs in via the clinic's Internet service provider (ISP), then goes to his university's BioMed web site, where he expects to find the Web version of Medline (to which his university subscribes). Instead of seeing a list of available databases he finds himself bounced to a username/password screen that he has never seen before--and for which he has no username or password.
Professor leaves on sabbatical to do research on eighteenth-century English scientific terminology. She settles into her cabin in upstate Maine--only to discover that when she tries to access the online Oxford English Dictionary, which she has been accustomed to using for complex search and retrieval operations from her campus office, the server on which the OED resides rejects her connection, offering instead a cryptic sign-in screen.
Computer support staff personnel report that the number of students, staff, and faculty coming in via the modem pool at night has tripled in the last two years, and that the university must either expand its dial-in facilities, or else farm out this service to another ISP. Preliminary figures indicate that it would be significantly less expensive to farm the service out. Users revolt, however, when they are told that key library databases will no longer be accessible from off-campus under this arrangement. The support staff revolts when they hear that the only way to make these databases available is to outfit all remote machines with special client software and plug-ins. Professors revolt when they realize that the client software and plug-ins cannot easily be installed on other universities' machines, or on kiosks available to them while they are away on sabbatical or at conferences.
In efforts to overcome these sorts of web-access problems, many organizations are setting up certificate authorities (CAs), which may be used to issue special public/private key-pairs to individual users--key-pairs that must be carried around, at least in part, on cards or floppy disks; or else generated separately on every machine used--or, worse yet, generated from scratch on every machine the user expects to connect from. These key-pairs can be used to authenticate servers and users (i.e., help verify that they are who they claim to be) and, in combination with other software, to determine whether users may be granted access to a particular resource.
While these and other such efforts may well prove to be the best long term strategies, they currently require that a whole new family of support software be put into place involving, among other things, maintenance of public and private keys, certificate revocation lists, and retrofitting of existing software systems for new network session and public/private key authentication protocols (known, collectively, as a PKI or public-key infrastructure). In the vast majority of cases a PKI is overkill, since all we are really looking for in an immediate solution to the web-access problem is a way to determine whether somebody belongs to our community--and thus whether he or she should be allowed access to the same sorts of Web resources that are freely accessible from campus. Although it is certainly possible to use a public/private key infrastructure to fill this need (and although this technology is likely to become viable over the next three to five years), right now it is better to fill the off-campus authentication niche with something simpler and easier to support.
Another strategy that libraries, in particular, are using to solve their web-access problems is to piggyback on separate cross-institutional user/password databases run by a vendor, by an institutional consortium, or by a state library system. What they are doing, in essence, is having patrons use a common password to access a pool of shared online resources spanning multiple institutions and/or multiple vendors (e.g., the UK's nationwide ATHENS, and the Missouri Education and Research Libraries [MERLIN], databases; various efforts are underway in the USA to leverage protocols and standards like LDAP, PKI, and Kerberos to do the same sorts of things).
Although this strategy is appropriate in many instances, in others it is once again overkill, requiring that a whole new infrastructure be put in place, not only for the participating institutions themselves, but also for the various vendors whose information resources are being incorporated into the system. Worse yet, the authentication methods used in such contexts are often weak, depending on simple passwords passed as plain or "clear" text over various unsecured networks. Even in cases where passwords are properly secured, though, the systems are hard to set up and fully maintain, and require a lot of knowledge, time, and coordination on the part of local systems people to deploy effectively.
Fortunately, many institutions already have the tools they need to authenticate users without establishing new cross-institutional databases, directories, or public/private key infrastructures. These tools include both standard and proprietary programmer's libraries for authenticating users via the institutional online patron access catalog (i.e., OPAC - which typically use people's PIN numbers), or via locally deployed authentication services like Kerberos (which is now part of Windows 2000, and which has been entrenched in many leading universities since the early 90s). The trick to any solid, flexible, near-term solution to the web-access problem is to leverage this existing infrastructure, and, if possible, to leverage the hardware that implements it as well (e.g., leveraging an existing OPAC with its database of usernames and PIN numbers; or leveraging a fully populated local Kerberos database, with its keyserver and feeds to local administrative systems).
Unfortunately, integrating existing authentication systems (in particular, Kerberos) into generic commercial Web browsers has proven an arduous task. With the advent of Windows 2000, which uses Kerberos V, Kerberos may yet achieve the status of a generic web-authentication protocol. But Microsoft's seems to be doing everything it can to force people into using their proprietary extensions to Kerberos, defeating the whole purpose of the open standard. So at least for the near future, integration of Kerberos into standard web browser seems unlikely, necessitating (for those who want to deploy it for web use) in-house development of Kerberos plug-ins for every major browser, server, and operating system--plug-ins that must be installed on every machine (as, e.g., with CMU's Minotaur and Project Mandarin Inc.'s SideCar, which appears no longer to be available). Even if one cuts down the range of supported browsers and servers to the major ones (i.e., Netscape, Internet Explorer, IIS, Apache, and maybe iPlanet), we are still talking about a huge distribution, maintenance, and support job. We are also talking about a system that will create often insuperable problems for people using kiosks or public machines at other institutions. (And we may, depending on how widely the software is supported, be leaving users running increasingly popular operating systems like Linux out in the cold.)
One can, of course, tack Kerberos support onto just the server end of the equation, and leave the clients out (one can do this by piggybacking Kerberos authentication on the standard basic Web authentication methods). For some servers, most notably Apache, doing this is easy, requiring only that one load and configure its optional proxy module. Such setups, however, force clients (i.e., people's Web browsers) to pass IDs and passwords over the network in the clear--readable even be the simplest password-sniffing software. Nevertheless there are at least several universities that have gone this route (two fairly well-known examples are NWU (cf. CWU's hybrid, SSL-based system).
It is also possible to set up the remote server(s) to accept Kerberos tickets or other authentication information passed on as so-called "cookies" (i.e., as session information sent over as part of the hypertext transfer protocol [HTTP] and stored locally as small information blocks by users' browsers). Such systems, however, typically require special server plug-ins or modifications (as, e.g., does the one used at UCDavis). And many servers already make extensive use of cookies for something else (e.g., for session tracking, and for their own proprietary authentication systems). On the client end, also, many library systems people configure their public workstations and browsers do not allow storage of any local information, including cookies. And many users simply don't like cookies, and hence restrict their use, or disable them via their browser's preferences menu. Cookies therefore pose some knotty, though not insuperable, problems when used for authentication. More will be said on this topic below.
In sum, then, any clean, workable solution to the web-access problem should, at least at this stage in the evolution of HTTP, tread very carefully when it comes to cookies. More importantly, though, it must not pass any IDs or passwords in the clear over the network--as many universities, amazingly, are currently doing. Such a solution must, however, still manage to leverage existing authentication equipment and methods. And it must not require special browser plug-ins or add-on packages. (Server plug-ins are less of an issue than browser plug-ins, because there are far fewer servers than clients, and because the people who administer servers can generally be assumed to possess a higher level of knowledge than the average user.)
Yet another problem that a clean, workable solution to the web-access problem must solve is that of disinterested vendors. When a library licenses a database from a vendor, that vendor typically sets limits on the number of simultaneous users who can access their database from any one licensing institution. In such cases, unauthorized users cause problems only for the licensing institution (where legitimate users may find themselves bumped by illegitimate ones, which is of little direct consequence to the vendor). So there is not very much incentive for the vendor itself to expend resources beefing up authentication.
It is true that in some cases the vendor may simply license on basis of enrollment or overall user community size. In this scenario, licensing costs are tied directly to the size of the potential user base. Even here we still find vendors that resist beefing up their authentication systems, though, because they vastly prefer simple, well-established, and easily maintained methods based on the user's IP address (an identifier that gets encoded into every chunk of information sent out over standard TCP/IP-based networks). Unfortunately, such authentication systems cannot distinguish a legitimate off-campus user from an off-campus stranger, since both the legitimate user and the stranger are coming in via via unaffiliated Internet service providers, through non-University networks. So such systems simply exclude all off-campus users, legitimate or not.
Because of such factors, any complete solution to the web-access problem must circumvent the vendors. It should, in other words, require no action on their parts (such as, e.g., augmenting their authentication scheme to cover digital certificates or to utilize new cross-institutional passwords; cf. CMU's Shelob, which would require vendors to install additional encrypting server plug-ins). Although one hopes that eventually a viable standard will emerge for cross-institutional authentication, one that the vendors can easily subscribe to and implement, the bottom line is that no such system yet exists, and most vendors are therefore reluctant to do much more than IP-restrict their pages.
One way of circumventing disinterested vendors that has come increasingly to the fore is the HTTP proxy server. What an HTTP proxy server does, in this instance, is make off-campus web clients look like on-campus ones by allowing the off-campus clients to route their Web traffic through it. Because the proxy sits on campus, and occupies an on-campus IP address, the vendors' machines will generally allow connections that are routed through it. In fact, depending on how the proxy is set up, the vendor's systems usually can't even tell that there is a proxy. Rather, the proxy looks to them just like a regular on-campus web client.
There are, in fact, a few vendors that try to deny proxy connections (e.g., EBSCO, as of 1999). They do this so that someone on campus can't just set up an open proxy, and then let the whole world access the vendor's database(s) through it. To use a proxy with such vendors' systems, one normally has to notify the vendor that the proxy is officially sanctioned, and that it isn't just a rogue proxy run by, say, a student. Even though it is possible to proxies so that the vendor' software can't tell they are there (e.g., by blocking HTTP headers like trace and x-forwarded-for) the fact is that even without them, simple traffic analysis (e.g., of the Agent headers) will typically identify the proxy as such. And some vendors, as noted above, will block access from the proxy unless contacted and specifically requested to accept traffic from that machine.
Even for proxies that pass muster with all the relevant vendors there is still the issue of authentication. For a standard proxy server to work securely, some form of encrypted password or keyed authentication must be enabled between off-campus web clients and the proxy. The proxy, that is, must allow the user to identify him or her self via a non-clear-text link. And it must not treat the source network (i.e., IP) address of the client machine as an integral authentication tool, since the whole idea of setting up a proxy is to deal with machines coming in via unknown, off-campus networks, from unknown, off-campus IP addresses. Nor may the proxy use the usual plain-text username/password authentication methods used by most standard proxy servers, since these may be intercepted ("sniffed") by intermediaries, or in some cases, by network peers (e.g., by the graduate student sitting in the next office down the hall). Unless people's user names and passwords are considered to be of little value, and the resources being proxied of no account, the proxy server must be outfitted with some means of sending and receiving encrypted authentication information.
It turns out that some proxy servers can be outfitted with specially built packages that facilitate encrypted client/proxy authentication. Encrypted proxy authentication, however, also requires specially built client software and/or plug-ins to work. Special client software and/or plug-ins pose a problem because they create a need for added client-side support. And they ultimately reduce the accessibility of the proxy.
To solve this problem, one might suppose that proxies could just transmit and receive the necessary authentication information using HTTP redirects and encrypted HTTP cookies. Use of HTTP cookies in place of standard proxy authentication, however, raises some knotty problems on the client end (no current HTTP clients know how to send cookies specifically to a proxy, rather than to, say, an origin webserver). On the server end, cookies are problematic because they were never designed to be used with proxies. Although a new PCookie (proxy-cookie) standard is in the works, trying to use cookies in this way can, for at least the next several years, only lead to grief for programmers and support personnel.
Even without cookies, standard proxies are problematic because they require changes in browser settings that are not always possible. It turns out that most large ISPs run caching proxies that, for performance reasons, keep temporary copies on hand of Web pages that their customers fetch. This creates a situation where, if customers keep requesting the same pages (as typically happens with popular sites), these customers don't have to fetch those same pages over and over again from possibly slow or (in network terms) distant machines. They can, rather, just just look at the temporary copies of those pages that are sitting in their ISP's proxy-cache. All of this happens, unbeknownst to the user, via the ISP's caching proxy.
So why does this make standard proxies problematic as a means of solving the web-access problem? Because a user who wants to take advantage of, say, an on-campus library proxy (i.e., one not officially sanctioned by the local ISP) must manually reset his or her browser to point away from the ISP's caching proxy, and toward the library's proxy server. In cases where this is actually possible, it reduces performance. Often, however, it is not possible, because many ISPs run firewalls that force customers to use their proxy and no other. If the user just barrels on ahead and resets the browser to point to the new proxy anyway, his or her browser may no longer work--as they have found at the Penn Library. Even for browsers running on on-campus machines, standard proxies are problematic if library databases fall into more than one authentication domain (as is the case, e.g., at the UCSB Library).
It might be added that in many cases (e.g., at public cluster machines) the whole issue of proxying is moot, because users simply are not allowed to go in and change the browsers' basic settings. In other words, the browsers themselves enforce use of a specific proxy (or of none) because they do not permit modification of their basic settings and "preferences."
A final problem with standard proxy servers is that they do not require users to authenticate to every remote database being used, nor can their authentications be easily configured to time out. Rather, they ask the user to authenticate just once, after which the user receives unlimited access to all web-based resources. Although in some situations a single sign-on no-timeout system like this might seem an advantage, there are times when it is a decided disadvantage. Suppose, for example, that someone using a single sign-on scheme accesses a resource via a proxy using a kiosk at another institution. Assuming the kiosk's browser can be so modified, and assuming that the local ISP will permit use of an alternate proxy server setting in the first place, doing this will create a number of potentially serious maintenance problems and security holes. For example, if, when finished, the user walks away from the kiosk without restoring the old proxy settings, then the browser remains in a state where anyone who uses it has access to every proxied resource that the user had access to. And, of course, the very fact that the proxy required a username and password means that the next time the machine boots or someone restarts the browser a prompt will suddenly appear asking for username/password information that local users presumably won't have--effectively disabling the kiosk until a knowledgeable user, or a systems person, can come in and reset the altered proxy settings.
Although it is possible to do some proxy configuration automatically using something called a PAC (i.e., a proxy auto-configuration) file, this whole mode of operation remains problematic for kiosks and other machines not maintained by the institution that runs the proxy service--again, because doing so requires making alterations to basic browser settings that often cannot, and typically should not, be changed (see, for example, the complex set of directions that patrons of MERLIN libraries must follow).
For a working example of a proxy server of this type (i.e., one that uses basic HTTP authentication, with clear-text passwords), see the University of Wisconsin Library's proxy service.
The bottom line for standard proxy servers is that they require changes to browser settings that many kiosk setups and ISPs simply do not allow. To make matters worse, they make authentication difficult, and, worst of all, create serious security problems by passing people's passwords around in the clear over the network.
The only way to run a proxy service that anyone can use with any ISP or kiosk is to run a reverse- or pass-through proxy (when outfitted with a cache, one also hears such proxies called accelerators). A pass-through proxy is a proxy that masquerades as the server it is proxying for, such that the proxy appears to hold a mirror image of whatever is on the remote server, i.e., on the server being proxied.
The process works like this: An outside client (presumably a web browser) requests a page from the pass-through proxy. The proxy first prompts the user for whatever authentication tokens it needs (usually a username and password). After clearing the username and password with some local authentication service (e.g., a local campus Kerberos key server), the pass-through proxy then fetches the requested page from the remote server. Finally, it sends the requested information back to the client. Throughout this process, the client never talks directly to the remote server; and the remote server never talks directly to the client. For all the client knows, the proxy is just a regular webserver. For all the remote server knows, the proxy is just a web browser on someone's desktop.
It's kind of like the proverbial manager who reports on work "he" has been doing--when really all he is doing is passing on information gleaned from other members of his team. Just as our manager presents himself as the author of the information, so also the pass-through proxy can make itself look like the source of pages it fetches from the site it mirrors.
Because pass-through proxies look exactly like origin (i.e., "normal") web servers, they can be used in conjunction with other proxies, or through a firewall--just like any other webserver. They can also send and receive cookies, which, as noted above, is difficult at best with non-PCookie-enabled proxies--but is perfectly acceptable with origin web servers.
Pass-through proxies' also require no set-up on the client end. And they do a complete end-run around vendors, who never need to know of its existence.
In a sense, pass-through proxies are a lot like the increasingly popular portal suites being vended by firms like Securant, Netegrity, and Dascom (the latter acquired in 1999 by IBM). These portals provide single sign-on access to multiple independent web servers and general networked services. The disadvantage to portals is that although they can be configured for various sign-on modes (in this way solving the single sign-on problem noted above), they tend to assume that special plug-ins will be installed on the remote servers being proxied. Portals can also be very expensive (typically figures in late 1999/early 2000 were twenty dollars per user, often with a base license fee of several thousand dollars--more for installation and/or support), and they typically require special maintenance procedures, pre-existing LDAP or SQL databases, and sometimes even specific vendors' products such as Microsoft's Proxy Server (so, e.g., for WorldSecure's Web 4.0). A lot of provisos are also necessary when using a portal from off-site (see, e.g., the Security Issues documentation provided by the University of Michigan, giving detailed instructions on how to alter browsers so as to prevent unauthorized access after being used to access the UMich private web space).
For many institutions, therefore, a simple pass-through proxy offers decisive advantages over both standard proxies and large-scale portals. And even in institutions that use portals it may prove convenient to set up a pass-through proxy to work together with the portal, so that the portal can treat proxied library databases as a single authentication domain. A portal and a pass-through proxy, in other words, may complement, rather than compete with, each other.
One serious obstacle to using a pass-through proxy for Web access is that the easiest implementation route is to set up a new proxy server for every remote server being proxied. In this respect, the proxy server is quite unlike our proverbial manager above, who can take credit for any number of team members' work. Pass-through proxies assume, rather, a one-to-one relationship. Every remote resource must have its own distinct proxy server. Doing things this way obviates the need for elaborate HTTP header and content rewriting (in technical terms, you only have to remap FQDNs, not URL paths). The disadvantage to doing things this way is that, for sites with significant IP-restricted information resources, one has to run what might seem like a daunting number of distinct proxy servers.
Fortunately, most webservers today allow port-based virtual hosting. That is, they allow one to set up distinct webservers on the same physical machine that differ only in the port number given in the URL (e.g., http://servername.cis.brown.edu:1443/, where 1443 is the port number). Port-based virtual hosting can be accomplished without adding new machine names, and without requiring extra (e.g., "Host") headers to be exchanged between client browsers and the server. So, even in cases where there are hundreds of vendors with servers that must be proxied, port-based virtual hosting makes it easy to set up enough proxy servers to cover them all. (What we have done here at Brown University is to set up simple scripts that automate the creation of virtual hosts.)
In fact, the most serious problem with using a pass-through proxy to solve the web-access problem is that links branching off of pages fetched from the mirrored site will often lead users out of the pass-through proxy's document space, and back to the original server. The only way to avoid this problem is to insert a parsing module on the pass-through proxy that rewrites pages sent back to the user so that they contain no reference to the server of origin--i.e., so that links back to the server of origin are replaced by links to the pass-through proxy server (Brown's system for doing this uses a simple Perl module; cf., however, the more elaborate system worked out by Barret and Maglio, WBI; note also the Perl HTML::Parser module, which may be used for this purpose, albeit with some performance penalty).
In late 1997 and early 1998, Brown University's Scholarly Technology Group (STG) implemented a trial rewriting pass-through proxy system that worked along the above-described lines. To review some of what has been said above, the basic constraints this system was to operate under were that it
In the "nice, but not necessary" category were the following constraints:
Although we expected commercial software to become available soon to take over the role of our pass-through proxy, nothing available at the time fit the bill. As a result, we ended up having little choice but to use free software (i.e., Apache) and modifying its native proxy module locally to suit our needs. Ironically, support for Apache turned out to be more readily available than for most commercial software, mainly due to the sheer number of people who are using it, so objections raised against it initially were quickly overcome. (In late 1999, in fact, we moved all our core web operations from NCSA and Netscape's webservers to Apache, and installed our Apache-based pass-through proxy on the same physical machine as our central webserver. Commercial software that we expected to become available has also been slow in coming and less functional than our internally written system.)
Although our system ended up reasonably robust and functional, it did not end up completely transparent to the user--due mainly to the problems with buggy Web clients, firewalled ISPs, and disinterested vendors.
To get around the lack of complete transparency for users, we provided extensive and well-placed "in case you're off campus" documentation--and a series of template bounce pages that local webmasters could use to redirect users who found themselves locked out of a resource when coming in from off-campus. Users have found the system, as a whole, trivial to use and easy to set up and negotiate.
Compare the necessarily more elaborate system developed by the United States Navy Virtual Library (D-Lib magazine, March 1997). A system more closely resembling STG's, but that used encapsulated URLs (URLs within URLs), was also developed by the University of Virginia. This system was called mIm ("man in the middle"). mIm used URL rewriting rather than virtual hosting to achieve its proxying, which turns out to be less resource intensive than reverse proxying. Although mIm ran over a clear-text HTTP link, it could just as easily have been run over SSL. It did not, however, handle cookies. URL-rewriting systems like mIm also, in general, require far more return-trip page rewriting than is needed with a pass-through proxy server, since they must rewrite not only headers and explicit server references, but also (to drop into HTML-ese for a moment) certain file-paths in HTML attributes that take URLs as their values (SRC="/", CODEBASE="/", for example).
Brown's original implementation of the rewriting pass-through proxy (written by the Brown University Scholarly Technology Group [STG]) proved fast, flexible, extensible, fairly robust, and, once the initial overhead of developing it was past, fairly low-cost. The only serious, systematic problems we found during testing and deployment were:
Problems 1-2 above were only rarely seen during 1998. Our vendors at that point used little JavaScript, and almost no Java (for more information on the problem of proxying Java applets see UCI's proxy faq). As for problem 3, only a few of our vendors used domain-restricted cookies initially. As of 1999 this number had grown to about a half dozen (see the University of Wisconsin's list). In early 1999 we put in place a workaround that simply stripped domain restrictions from set-cookie headers passed back to browsers by the pass-through proxy server. This workaround served its its purpose well, but was obviously not a viable long-term solution to the problem.
Problem (4) above ([a], the extra passwords) became a serious nuisance in only a few instances, particularly with MDConsult (a medical research service). The technical details here are as follows: Neither Netscape nor Internet Explorer react usefully to a 407 error code returned by a proxy server that mimics an origin webserver. Error code 407 is what prompts the browser to send proxy authentication tokens, so these browsers, in effect, cannot supply proxy authentication to our rewriting pass-through proxy server (which, as noted above, looks like an origin webserver). To get around this problem, we had to use normal (i.e., basic HTTP) authentication instead. Unfortunately, we could not also arrange for basic HTTP authentication tokens to be sent through the proxy to remote servers that needed them, not only because the proxy was already using basic authentication, but also because blind forwarding of such tokens might under, certain circumstances, have compromised Brown user IDs and passwords.
The last problem above (5), namely that the pass-through proxy server required more open file descriptors than most operating systems allowed, was easily overcome by simply reconfiguring the operating system on which we were running it to allow more file open file descriptors.
Some other minor aches and pains we experienced included the problem of cached credentials and browser lifetime. If a Brown user walked away from a remote kiosk without exiting the browser after having used the pass-through proxy, the next user at that kiosk was then able to access Brown-only resources. (Note that our initial pass-through proxy versions required users to authenticate separately for each remote server, so the potential damage here was minimal; many IS professionals on campus, however, objected to the repeated logins this necessitated, and advocated shifting to some sort of single sign-on system--which we later did implement). We also found requiring users to enter Brown user IDs and passwords via standard Web-authentication forms to be unpalatable, especially since doing this required SSL-encrypting the entire session, which cut transmission speeds in half for slow modem connections.
Finally, we found it inconvenient that, in order to get into the pass-through proxy's document hierarchy, one needed to come in via a specific entry point. One could not, for example, go directly to Brown's IP-restricted Oxford English Dictionary page directly and expect to get redirected automatically. One needed, rather, to come in via a page with links to the OED (and to other such resources)--pages that used URLs that would take users through the pass-through proxy.
It is worth noting that at first we did, in fact, attempt to set up a system that would automatically redirect users to the proxy where needed using a feature introduced by Netscape called a proxy auto-configuration (PAC) file (docs). We found the PAC file, however, to be problematic because of bugs in Internet Explorer 3.02-4 (which is supposed to be able to use PAC files, but doesn't always work properly). We also found, as time went on, that users made many mistakes configuring PAC files, and that PAC files, misconfigured or not, introduced various performance and access problems. E.g., we found that some ISPs used firewalls that rendered browsers set to use our PAC file nonfunctional.
Our ultimate solution to this difficulty was to provide simple "in case you're off campus" documentation that explained everything in pretty concrete terms and that offered appropriate proxy entry points. This provided faster, easier, and better-targeted access to the reverse proxy than was possible with PAC files. We still do document use of a PAC file for those who want to configure one (assuming their ISPs will let them). But we find that our simple directions give the vast majority of users enough information to use our pass-through proxy effectively, and with little support.
In September of 1999 we found that the rewriting proxy, which had been deployed on a dual-processor Intel machine running Linux, did not allow a sufficient number of open file descriptors to service proxy needs (a là problem [5] above). Some sort of upgrade was therefore in order.
Although we had originally hoped that commercial vendors would, by this time, come up with shrink-wrapped solutions that filled the role of our pass-through proxy, this expectation was not fully met. A few portal vendors offered products that might have fit the bill. But most portals were expensive, required significant maintenance, and assumed things like an LDAP infrastructure (which we only began to deploy later, in the summer of 2000). So despite our original view of the proxy as a near-term solution, the service looked as if it would be in production long enough to make it worthwhile to rewrite the code, improve it, and move it to a centrally located, monitored Solaris machine (Solaris being one of the officially supported production operating systems for Brown's central IT wing).
As noted above, Brown's original implementation had required use of a locally modified Apache proxy module. These local changes to the module needed to be recoded and updated every time there was a new release of Apache, and it had therefore created a lot of maintenance headaches. As part of the re-implementation process all of this locally written code was factored out and rewritten as an independent Perl (mod_perl) module requiring no changes to, or recompilations of, the Apache source tree. We also wrote a special authentication module that communicated with our Kerberos-based keyserver (which houses user ID and password information for the entire campus). These new modules were coded in such a way that they dropped into most mod_perl-equipped Apache webservers, regardless of their operating system.
As part of the rewriting process, we inserted a workaround for the passworded-page problem (i.e., the inability of the old proxy to forward basic HTTP authentication headers to remote servers [problem (4)]). What the workaround did was store the user ID and password temporarily within the return URLs shown to the client. When the client tried to reach the ID/password-equipped return URLs, the ID and password were stripped out and inserted as HTTP headers before the request was passed on to the remote server. This workaround was admittedly a bit of a kludge (and was noted as such when given as an option to users). Nevertheless it provided a last-resort access method that proved useful in situations where a vendor introduced new passworded systems or changed old systems without adequately informing our Library reference staff.
In the process of rewriting the proxy, we also dramatically improved the intelligence with which the proxy forwarded headers to the remote servers and back again to the client. The list of HTTP client headers that were processed, and if need be altered, were expanded to include not only the actual requested URL and any authentication headers sent out, but also the following:
- Accept-charset
- Must be rewritten to favor ISO-8859-x and UTF-8, to facilitate Perl processing
- Accept-encoding
- Must be stripped (again to facilitate Perl processing)
- Host
- Must be rewritten to name the remote host being proxied, and not the pass-through proxy server
- Referer
- Must be rewritten to point at the remote host being proxied; not the pass-through proxy server
Salient examples of proxy-alterable server headers (some of which, unfortunately, are vendor specific) include:
- Accept-ranges
- Must be set to none if proxy filtering alters document content in any way
- Connection
- Must be stripped (the proxy inserts a connection header of its own)
- Content-base, Content-location, Location, Refresh, Uri
- Must be rewritten to point to the pass-through proxy server
- Content-length
- Must be recalculated to reflect new filtered content lengths
- Set-cookie
- Must be stripped of domain information
The only disadvantage to the revised pass-through proxy was that it could not easily take advantage of Apache's native caching facilities (part of the standard Apache proxy module). One workaround for this problem would be to install a caching proxy front-end (perhaps as part of a larger portal system). It is not clear, though, whether this would be an advantage, except for remote hosts that vend a lot of images and static HTML pages.
Overall, the revised pass-through proxy server was a big win, both in terms of its intelligence and in terms of its portability and maintainability. We noted, with some pride, that since its first implementation at Brown, a number of other schools appeared to have implemented pass-through proxying systems (note, in particular, UCI's pass-through proxy).
In efforts to remedy the remaining problems with the system, Brown deployed (in the summer of 2000) a third major revision to its pass-through proxy server. This revision built upon the module Perl-based foundation established during the previous revision cycle, and rectified all but one of the major outstanding complaints from both users and staff (the one problem remaining is that the system does not yet proxy remote hosts that insist on using SSL/https connections--which we don't believe to be a viable way to vend standard databases in the first place, mainly because it slows down the connection speed so dramatically, and with no benefit to the user).
Major changes folded into our third revision of the pass-through proxy server include:
Because the old proxy server, as a side effect of using SSL, ended up converting http URLs on remote hosts into local https URLs, it often broke vendor code that assembled URLs using JavaScript on the fly. Typically JavaScript code that does on-the-fly URL assemblage assumes an http:// prefix on all URLs, and, sadly, doesn't bother using JavaScript's link.protocol method (as typically with Chadwyck-Healy databases, at least as of the summer of 2000). By replacing the the previous pass-through proxy server's basic-HTTP-auth scheme with a cookie-based system (change 1 above), and by abandoning SSL (change 2), we managed to reduce to almost nil the impact of this programming error on our users by making the URL prefixes (i.e., http://) identical on both the pass-through proxy and the remote hosts that it mirrors.
Replacing the the previous pass-through proxy server's basic HTTP authentication scheme with a cookie-based system also allowed us to proxy remote hosts that, in addition to checking source IP addresses, also required passwords (which before had been possible only via an ugly workaround that embedded user names and passwords temporarily within URLs). Because our pass-through proxy no longer needed to trap, and use, embedded basic authentication tokens, it was free to pass them back and forth unaltered between clients and remote hosts. Remote hosts were therefore free to request, and use, basic authentication.
A final benefit to replacing the the previous pass-through proxy server's basic HTTP authentication scheme with a cookie-based system (change 1) is that it gave us a way of maintaining authentication state, and hence also a way of implementing a single sign-on system. We skirted the potential security problems associated with single sign-on systems by imposing a timeout on sessions. By default this timeout is 20 minutes. Users, however, may request up to an hour. With specialized client software (created at Stanford University), the sign-on system can also be integrated with Brown's general cluster login facilities, so that users on campus workstations don't ever even have to look at a web login form. As of this writing, Brown has not yet decided whether to deploy this extra client software.
Change (3) above solved the longstanding problem of cookies with domain restrictions, which clients have no way of forwarding correctly through the pass-through proxy (which lies in the .brown.edu domain) to the remote hosts (which generally lie in other domains). By installing a session manager that handles forwarding of domain-restricted cookies on behalf of the user, all cookies now get forwarded to the right place. Vendor systems that rely on cookies to maintain state therefore now work as expected (with no clumsy workarounds, like the one implemented in late 1999 for the second version of the pass-through proxy).
Addition of an administrative interface module (4) eased the day-to-day overhead of maintaining the proxy, and allowed us to offload responsibility for such maintenance onto Library staff who formerly had to contact Brown's systems group, who would then have to manually edit the text files that contained the lists of proxied hosts. Addition of an auto-discovery module (5) allowed the system itself to discover, and add, some new IP-restricted vendor machines on its own, thereby easing further the day-to-day maintenance burdens.
At various points in this study I have emphasized the near-term nature of our solution to the problem of IP-restricted Web-based resources. The reason for this is that our solution is basically a workaround for a problem whose fundamental terms (how we authenticate and authorize people for use of networked resources) are in the midst of dramatic change.
Whether we like it or not, issues of authentication, authorization, and access control touch on virtually every aspect of life in a networked computing environment, from access to services like development and alumni-relations databases, to tracking purchase orders, to updating student records and using library resources; from collaborative research, editing, to file sharing, backups, license servers, institutional directories, and public/private key repositories. Ideally, every platform, from the desktop Macintosh to the IBM server in the air-conditioned room, should be able to use the same languages and protocols to handle authentication and authorization. This ideal remains somewhere off in the future.
For now the notion of a common environment in which diverse vendors' products all speak a standard authentication language is at best a pipe dream. Vendors have been slow to agree on appropriate standards. Some of them, for instance, have invested heavily in a public/private key infrastructure that doesn't yet exist--and that may prove problematic in the future. Others, in particular Microsoft, support use of standard shared-key systems like Kerberos, but then deliberately alter their implementations of them in such a way as to make them only partially interoperable with implementations done by other vendors. Some of the portal manufacturers are trying to set up standards, both official and via their proprietary APIs, but so far little progress has been made industry-wide. The situation thus remains fluid and not entirely tractable.
For the near-term, therefore, the most sensible way of implementing a simple solution to a limited problem like off-campus access to source-IP authenticated campus Web resources is to pick a cheap, solid near-term solution, to make it work, then to wait a until the landscape changes and see if another solution presents itself.
Brown's solution, namely a pass-through proxy, has served us well because requires no special client-side software; it necessitates no basic changes to client browser configurations; it doesn't depend on a particular authentication method. It offers fine-grained control over what services can be accessed; it's cheap to run, easy to use, reasonably fast; and it's secure. Although it only handles IP-authenticated resources, and requires a fair amount of maintenance, we have found it to be well suited to our modest needs during this transitional time.
The components used to implement the second revision of Brown's current pass-through proxy are as follows:
Note that RSAREF (one of the components mentioned above) is for nonprofit use only. US businesses can't use it royalty-free. Outside the US, RSAREF may, in some cases, not be needed. Nevertheless, the very fact that one is using secure encrypted transmission technology may violate other laws in your country. Bottom line: Be careful.
The latest (third) revision of Brown's pass-through proxy server dispenses with RSAREF, OpenSSL, and Kerberos above, instead leveraging a cookie-based authentication system initially deployed in the spring of 2000 and used now campus wide. This system uses a central web ticket server that itself incorporates RSAREF, OpenSSL, and Kerberos. The pass-through proxy, which delegates user authentication to the campus-wide authentiation system, now no longer needs to run any of these systems. Rather it requires only two mod_perl-based authentication modules and a few extra configuration lines in its configuration file. The central web ticket server, it might be emphasized, is actually not part of the pass-through proxy system per se. It is an independent service that we are merely leveraging to simplify proxy deployment and maintenance.
The new, revision 3, component list is as follows:
The latest revision of our pass-through proxy server packs in about three times the lines of code that the previous one had, due mainly to new session-management and resource auto-discovery modules that were added. These modules are again implemented using mod_perl. They also require the presence of an SQL database (which is accessed via Perl's DBI library). In our case, that SQL database is MySQL. Considerable rewriting of the existing pass-through module was also required to integrate it with the new session and auth-discovery modules and to add hooks for the new authentication system.
Installation is now much simpler, mainly because of the new authentication system, but also partly because the main Apache configuration file now uses so-called <Perl> sections (blocks of Perl code that adjust Apache parameters, on the fly, to work on the local host).
Although we are still running the pass-through proxy under Apache 1.3.9 + mod_perl 1.21 + mod_ssl 2.4.0 (as in the last revision), this latest revision has also been tested under Linux with Apache 1.3.11-12, mod_perl 1.22-24, and mod_ssl 2.4.8.
For those interested in the details of how the various parts of Brown's pass-through proxy system interact in practice, the following diagrams, and their accompanying text, will take you, step by step, through a typical proxy-mediated HTTP transaction.
The first illustration below depicts a typical transaction mediated by version 2 of the proxy service, deployed in late 1999. Note in particular the fact that it relies on both OpenSSL and Kerberos facilities, which must be installed on the local host:
In contrast to previous versions of the pass-through proxy server, which require that both OpenSSL and Kerberos facilities exist on the local host, the current version offloads functions served by these facilities onto the local campus ticket server, which handles authentication chores on behalf of the pass-through proxy. The current pass-through proxy also has two new modules, one for session management, and the other for auto-discovery of new IP-restricted resources. Addition of these modules complicates transactions somewhat, although performance is not affected in any major way:
Since the release of the first version of this report in 1998 a commercial reverse proxy server has appeared on the market, EZproxy. EZproxy currently runs under Linux and NT, and has begun to be noticed by the library community (see, e.g., the notice of it given in ALON). EZproxy has some advantages over Brown's pass-through proxy system, in particular that it runs as a stand-alone program that does not require that Perl, a database, or Apache, be installed and configured. We have not evaluated this product here at Brown.
For further reading on the topic of authentication, specifically as it pertains to libraries, see Steve Hunt's excellent list, Remote User Authentication in Libraries.
Richard Goerwitz