Remote Access to Web Databases/Files that Require Access Authorization, TCU URL Re-writer Solution
Kerry Bouchard: K.Bouchard@tcu.edu
The program discussed at the conference is fully documented, along with downloadable script files, at:
http://lib.tcu.edu/www/staff/bouchard/cgi_logon/cgi_logon.htm. This version of the program uses the LYNX WWW browser in conjunction with DCL CGI scripts and the VMS implementation of the CERN web server, so it is highly "non-portable". I hope to have a new, Java-based version of the program ready to announce in May (1999). The new version is being tested on an NT box running IIS, but should be compatible with other operating systems and web browsers that can be made to support Java servlets. The new version should also be able to handle authorization for sites that require the exchange of Cookies to maintain state. Links to the new program will be posted on the Web page above, and I will send announcements to Web4Lib and ATLAS-L.
Validation Scenarios Handled/Not Handled by Lynx URL Re-writer Program
The program in use at TCU currently handles the following validation scenarios:
- Vendor validates by IP address (i.e., works like a regular proxy – vendor’s server will see access coming from the IP address of the machine where the URL re-writer program is running, so they grant access).
- Vendor uses HTTP-based Username/Password authorization – essentially the same as IP recognition, except that the URL re-writer program will send the Username and Password to the remote site for each link. (For details, see the HTTP protocol specification regarding access authorization, listed on the web page at the top of this handout). Username and Password are stored in the script, which will be in a directory that grants execute-only (not read) access, so user never sees them.
- Vendor has a form that you fill out to gain access; if their system likes your username and password (or authorization number, or whatever they ask for), they return a temporary session ID in the URLs of the page returned, which is then used for gaining access until the session id expires. Form-based validation is one of the validation options for OCLC First Search and OCLC ECO and a variation of it is used by the UMI Ovid system. As with OCLC’s WebScript program, it may be possible to script past opening screens and go directly to the screen you want your users to start out with – for example, TCU’s script for Periodicals Abstracts sends our Username and Password to the opening form, then uses the session ID returned to construct a URL that goes directly to the Periodicals Abstracts search screen, bypassing some confusing opening screens.
The Lynx Proxy program currently does not handle the following scenarios:
- Same as last one above, except vendor returns the temporary session ID in a cookie instead of embedded in URLs. Since the Lynx doesn’t store cookies, there is currently no way to do this.
- Vendor uses forms that contain input fields longer than 2048 bytes – VMS implementation of the CERN server cannot pass values longer than 2048 bytes to a CGI script.
- Vendor wants to establish a meaningful personal relationship with everyone who visits their site – typically by assigning the library an institutional username and password that patrons are supposed to use to fill in a form to set up an individual username and password. Library presumably sets up a way for people to obtain the institutional name and password ("but promise not to tell your friends in Austin…"), and then each user sets up their own account. I don’t see any good way to enforce security of the institutional name and password in this scenario, and of course it raises privacy issues, since the vendor can individually track what each user does each time they visit.
How a URL Rewriter Differs From a Proxy
From a user’s standpoint, the difference from a URL re-writer and a proxy is that they do not need to change their browser settings to start going through a proxy – they simply click on a URL that points to the URL re-writer program. From a systems stand-point, handling the form-based (and possibly even the HTTP-based Username/Password) scenarios above may not be do-able with a commercial proxy server, since access authorization was not what proxies were invented for.
What it Does
The re-writer script fetches HTML pages and other data on behalf of the user and sends the data onto the user’s browser. When fetching HTML data, the program re-writes links on the fly, as in the following example:
<a href="/dir1/dir2/afile.html">
is converted from "relative" to "absolute" form:
<a href="http://www.vendor.com/dir1/dir2/afile.html">
and then the URL of the re-writer program is prepended to the URL for the remote resource:
<a href="http://lib.tcu.edu/htbin/Proxy.pp?http://www.vendor.com/dir1/dir2/afile.html">
If the vendor site requires a Username/Password (HTTP-based), then the proxy program sends these along with the request for each page. If the vendor site uses forms to start a session, then the script for that site contains extra statements to fill in the form with the correct parameters and pass the user the page the vendor returns after the form has been successfully processed.
Advantages/Disadvantages of This Approach (or of TCU’s Implementation)
Advantages:
- Works with validation scheme the vendor already has in place (obviously you need to check the license to be sure that remote access is legal). No special setup is required from vendor.
- As compared to a regular proxy, no setup is required from user; unless they user View Source to examine the URLs in the pages they get back, they won’t know they’re going through a proxy.
- Can handle form-based validation, like OCLC WebScript, provided vendor isn’t doing something else it can’t handle (requiring Cookies).
- TCU’s implementation uses a separate path for proxying images versus HTML pages; since most sites do not put any IP or other validation in front of the graphics that they use on their menus, there is no need to proxy access to images, and this reduces the load on the server running the URL re-writer.
- Should work with browsers behind firewalls, or that are already set up to use a proxy (since the URL re-writer is not a proxy in the sense that Web browsers and servers use the term.
Disadvantages:
- Is a complex solution to a simple problem, unlike the solution described by Tim Kambitsch of using Referrer fields. If vendors would all agree to start using that solution, there would be no need for proxies or URL-rewriters as far as I can tell.
- Like regular proxy solutions, places a load on the library’s server that would otherwise not be there.
- Like regular proxy solutions, creates an additional failure point – even if vendor’s system is working fine, the proxy server could be out of whack. Probably worse with the URL-rewriter than with a true proxy, since the URL-rewriter could fail in ways (getting confused about how to parse HTML statements it finds) that would not be an issue with a regular proxy.
- The DCL scripts for the Lynx-based program at TCU have become overly complex in some cases, and can be tedious to diagnose when something stops working. The Java-based version of the program should alleviate this somewhat.
- Current implementation can’t handle Cookies (turns out that a lot of sites that send Cookies will still work if you don’t accept them, but some vendor systems won’t work without them), and has other limitations inherent in using DCL as a CGI scripting language.
-###-