Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 32 (Dec 1998)

Ahh, cookies. Not one of my favorite subjects. Cookies, I've decided, are generally mostly evil. Every site I've visited seems to want to send me cookies (as MSIE is happy to report). For the most part, they want to do things like target advertising at me, or even see how often I've been to their particular site.

And more and more, cookies are being used to identify me as a particular person to access customized features (like my.yahoo.com or my online banking service). And that's annoying. Why? Because I use a half-dozen browsers (if you count agents made from Perl scripts) in a day, and cookies are associated with a particular browser, not a particular person.

So, what to do? I need to have a cookie jar that holds all my cookies. I thought a bit about it, and came up with an interesting approach. What if I were to direct all my web requests through a proxy server, and it could extract cookies from the responses, and add the right cookies to future requests. Aha! That would solve it.

Now, I've already written a Perl proxy server for this column a few years ago, and I could have wired my cookie-jar into that server, and kept it running. But some of the browsers I use in a day are the ones at a client site where I'm teaching or consulting, and it's not always possible to change their proxy settings in a way that would work.

So I took a different approach. A ``poor man's proxy'', handled entirely as a CGI script! I'd simply invoke the CGI script, extending the URL of the script to include the URL of the page I wanted fetched, and the program would fetch the page, handling the incoming and outgoing cookies. Of course, the URLs on that page would also have to refer to the script, so there's some rewriting to be done on the HTML pages.

In practice, this script was found to be usable, albeit a bit slow. But it's not meant for full-time use: only when I'm accessing those silly cookie sites. My guess is that the reloading of the LWP library on every request is slowing it down, and that could be easily eliminated by using mod_perl inside Apache. (I'll have to try that soon, because my ISP just installed mod_perl on the webserver that handles my website.)

The cookie jar program is presented in [listing one, below].

Lines 1 through 3 start all my non-trivial programs, enabling taint checks, warnings, compile time restrictions, and disabling output buffering.

Lines 5 through 11 set up an exception handler. Inside a BEGIN block (meaning that it will be executed as soon as it is compiled successfully), we'll establish a handler for all fatal errors (including compilation errors!). This handler prints a status-404, to let the browser know we broke somehow. The Perl error that triggered the fatal exception is printed in the content of the message. Most browsers will display that along with noting that the request failed. Finally, the subroutine exits with a normal exit status, keeping the webserver from giving us the dreaded ``500 server error'' we see only too often while debugging.

Lines 13 and 14 pull in two modules from the LWP library, including the cookie-handling routines.

Line 16 defines the path to the persistant storage for the cookies. Since the script is ephemeral, but we want the cookies to stick around, this has to be a place out on the disk. Note that any user of this script will be sharing the same cookie jar, and that the jar must be writable by the CGI-executor user ID (often the nobody user).

Lines 18 and 19 set up a user agent, which is like a browser. The $ua object represents our connection to the rest of the web, and we talk to it to fetch URLs.

Line 21 creates the memory-based cookie jar, initialized from contents of the on-disk cookie jar. This object is smart, and will save itself back down to the disk (probably updated) when the script exits.

Lines 23 through 27 reconstruct the URL that we're proxying by examining the PATH_INFO and QUERY_STRING. This is the part of the URL in the original request that appears after the script name, and will be the URL that we'll eventually want to fetch. (I used a similar technique in my go script in this column a few months back.)

Line 30 extracts the web name of this script by looking at the SCRIPT_URI, and backing out of it any part that can be attributed to PATH_INFO. This looks messy; I wish there was a variable that was just exactly the script URI, but apparently not. We'll need this name when we rewrite the URLs in the HTML being passed through.

Line 32 creates a request for us to be able to fetch the real page from the URL we've extracted from our invocation. Note how the method (GET or POST) is extracted automatically from whichever method was being used to invoke this script. That way, both GET and POST work correctly on forms.

If the incoming request has content (usually only a POST), we'll grab it in lines 34 through 38 and make it part of the proxied request. One would hope that we don't see a few gigabytes here, but there's not much you can do about it at this level.

Lines 40 through 42 pass the content type for this content through to the request.

Lines 44 through 46 take all the miscellaneous incoming headers and pass them through into the proxied request. The CGI protocol puts these all into environment variables that begin HTTP_, followed by the original names, so we have to reverse the process. The only header I can't push through is the Host header, because that was to connect to my web server, not for the proxied request.

Line 48 examines the request to see which cookies, if any, are appropriate for this request. If one or more cookies qualify, they are added to the request.

Line 49 is the core of this script, fetching the desired page. The result ends up in an HTTP::Response object in $response.

Next, we've got some cleaning up to do. The LWP package puts some extra stuff into the response that will get in our way if just handed through blindly. Lines 52 through 56 blast away any response header that begins X-Meta- or is exactly Content-Length. The former are used to capture META ITEM=VALUE constructs in the head of the HTML content, and the latter is going to certainly change if we do some URL rewriting.

Lines 58 through 61 deal with the only URL that can appear in the headers (as opposed to the body of a response): the URL of a redirect. So, here we're seeing the first taste of rewriting URLs to point back at the script. The Location header gets the script name in front of it. Without this, a redirect sends the browser to a direct fetch, instead of fetching through this pseudo-proxy.

Lines 63 and 64 dump the headers, including the required blank line, to STDOUT. For the most part, these then become the headers returned to the browser.

Now to deal with that URL rewriting. If the content type isn't HTML, line 72 just dumps the content (like for a image or text file). However, if the content is HTML, we execute lines 67 to 70.

Line 67 performs a ``lightweight AUTOLOAD''. All the data after the end marker in line 75 is ignored unless we hit this line. However, if we need to parse the HTML, it's time to pull in an additional package MyFilter, defined beginning in line 76. This filter is a subclass of HTML::Parser, and can accurately determine what parts of the HTML are URLs to be rewritten.

Lines 68 to 70 take the content of the HTML message and feed it to the filter, using the protocol described for any HTML::Parser. The only unusual thing is that the new method is being passed two parameters, squirreled away in the constructor method (described momentarily).

Lines 76 to 140 define this filter. I could stick this into another file and it'd be perfectly happy, but the DATA part of the program is good enough.

Lines 77 through 83 declare the package, and pull in items that are needed for this class, including establishing this class as a subclass of the HTML::Parser class.

Lines 85 through 97 were stolen from the LWP library's HTML::Element class. The linkElements variable is the most complete mapping of tags and attributes that are likely to contain URLs that I've seen, so I've borrowed the code here (with proper attribution).

But the format for the data structure does not make for quick searching, so lines 99 to 105 transform that structure to a simple flat hash, just once when the data is loaded.

Lines 107 to 113 is the constructor for an object of type MyFilter. The superclass constructor gets called first, and then we save away the two parameters into additional instance variables.

Lines 115 through 120 create ``null'' translations for most of the things that can occur in HTML. Since I'm creating something similar to what HTML::Filter does, I just looked there for the right code to borrow. If you want to make your filter recognize more complicated things (like removing inline images or fixing broken tables), those are the hooks that you'll need to alter.

Lines 122 to 139 handle the only kind of HTML construct that we have to mangle: the start of a section. Line 129 recognizes a combination of a particular tag and attribute that we're interested in. If it's a good combination, it's time to rewrite it.

Line 130 takes the URL attribute, and rewrites it so that it's absolute. To do this, we have to know the URL that requested the page. Handily, that is kept in an instance variable.

Lines 131 through 133 next look at the protocol. If it's HTTP, we can (and must) send this request back through this script. That's achieved by hacking the scriptname in front of the URL (just like we did with the Location header earlier).

And that's it! Whew! Large program to accomplish a simple cookie jar.

So, to use this, just stick it into your CGI area, and invoke the first URL by adding the name to the end of the URL. For example, if I placed this into my CGI area under the name of jar, I'd invoke this as

        http://www.stonehenge.com/cgi/jar/http://www.perl.org/

to get the home page for The Perl Institute. When the page comes back, I notice that all the URLs have been rewritten to again point through the script. So it must be working.

I hope you enjoyed this ``poor man's proxy'', and can now see how to keep all your cookies safe in the cookie jar. Enjoy!

Listings

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     BEGIN {
        =6=       $SIG{__DIE__} = sub {
        =7=         print "Status: 404 Not Found\nContent-type: text/plain\n\n";
        =8=         print "Perl error was $_[0]\n";
        =9=         exit 0;
        =10=      }
        =11=    }
        =12=    
        =13=    use LWP;
        =14=    use HTTP::Cookies;
        =15=    
        =16=    my $JARFILENAME = "/tmp/merlyn.cookie.jar";
        =17=    
        =18=    my $ua = LWP::UserAgent->new;
        =19=    $ua->env_proxy;
        =20=    
        =21=    my $jar = HTTP::Cookies->new(File => $JARFILENAME, AutoSave => 1);
        =22=    
        =23=    my $url = $ENV{PATH_INFO};
        =24=    die "missing PATH_INFO" unless defined $url;
        =25=    die "malformed PATH_INFO: $url" unless $url =~ s/^\///;
        =26=    my $query = $ENV{QUERY_STRING};
        =27=    $url .= "?$query" if defined $query and length $query;
        =28=    
        =29=    ## it'd be nice if there were a nicer way to do this
        =30=    my $script_name = substr($ENV{SCRIPT_URI}, 0, -length $ENV{PATH_INFO});
        =31=    
        =32=    my $req = HTTP::Request->new($ENV{REQUEST_METHOD} => $url);
        =33=    
        =34=    if ($ENV{CONTENT_LENGTH}) {
        =35=      my $buf;
        =36=      read(STDIN, $buf, $ENV{CONTENT_LENGTH});
        =37=      $req->content($buf);
        =38=    }
        =39=    
        =40=    if ($ENV{CONTENT_TYPE}) {
        =41=      $req->content_type($ENV{CONTENT_TYPE});
        =42=    }
        =43=    
        =44=    for (sort grep !/^HOST$/, map /^HTTP_(.*)/, keys %ENV) {
        =45=      $req->header($_, $ENV{"HTTP_$_"});
        =46=    }
        =47=    
        =48=    $jar->add_cookie_header($req);
        =49=    my $response = $ua->simple_request($req);
        =50=    $jar->extract_cookies($response);
        =51=    
        =52=    $response->scan(sub {
        =53=                      my ($h, $v) = @_;
        =54=                      $response->remove_header($h) if
        =55=                        $h =~ /^(X-Meta-|Content-Length$)/i or $v =~ /\n/;
        =56=                    });
        =57=    
        =58=    if ($response->is_redirect) {
        =59=      my $loc = $response->header("Location");
        =60=      $response->header(Location => "$script_name/$loc");
        =61=    }
        =62=    
        =63=    print $response->headers_as_string;
        =64=    print "\n";
        =65=    
        =66=    if ($response->content_type eq "text/html") {
        =67=      eval join "", <DATA>; die if $@;
        =68=      my $p = MyFilter->new($url, $script_name);
        =69=      $p->parse($response->content);
        =70=      $p->eof;
        =71=    } else {
        =72=      print $response->content;
        =73=    }
        =74=    
        =75=    __END__
        =76=    {                               # begin embedded package
        =77=      package MyFilter;
        =78=      require HTML::Parser;
        =79=      require HTML::Entities;
        =80=      require URI::URL;
        =81=      
        =82=      use vars qw(@ISA);
        =83=      @ISA = qw(HTML::Parser);
        =84=      
        =85=      my %linkElements =            # from HTML::Element.pm
        =86=        (
        =87=         body   => 'background',
        =88=         base   => 'href',
        =89=         a      => 'href',
        =90=         img    => [qw(src lowsrc usemap)], # lowsrc is a Netscape invention
        =91=         form   => 'action',
        =92=         input  => 'src',
        =93=         'link'  => 'href',         # need quoting since link is a perl builtin
        =94=         frame  => 'src',
        =95=         applet => 'codebase',
        =96=         area   => 'href',
        =97=        );
        =98=      
        =99=      my %tag_attr;
        =100=     for my $tag (keys %linkElements) {
        =101=       my $tagval = $linkElements{$tag};
        =102=       for my $attr (ref $tagval ? @$tagval : $tagval) {
        =103=         $tag_attr{"$tag $attr"}++;
        =104=       }
        =105=     }
        =106=     
        =107=     sub new {
        =108=       my $pack = shift;
        =109=       my $self = $pack->SUPER::new();
        =110=       $self->{Url} = shift;
        =111=       $self->{ScriptName} = shift;
        =112=       $self;
        =113=     }
        =114=     
        =115=     ## some items stolen from HTML::Filter
        =116=     sub output { print $_[1]; }
        =117=     sub declaration { $_[0]->output("<!$_[1]>") }
        =118=     sub comment { $_[0]->output("<!--$_[1]-->") }
        =119=     sub text { $_[0]->output($_[1]) }
        =120=     sub end { $_[0]->output("</$_[1]>") }
        =121=     
        =122=     sub start {
        =123=       my $self = shift;
        =124=       my ($tag, $attr, $attrseq, $origtext) = @_;
        =125=       $self->output("<$tag");
        =126=       for (keys %$attr) {
        =127=         $self->output(" $_=\"");
        =128=         my $val = $attr->{$_};
        =129=         if ($tag_attr{"$tag $_"}) { # needs rewrite
        =130=           $val = URI::URL::url($val)->abs($self->{Url},1); # make absolute
        =131=           if ($val->scheme eq "http") {
        =132=             $val = $self->{ScriptName} . "/$val"; # force return to us
        =133=           }
        =134=         }
        =135=         $self->output(HTML::Entities::encode($val, '<>&"'));
        =136=         $self->output('"');
        =137=       }
        =138=       $self->output(">");
        =139=     }
        =140=   }                               # end embedded package

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.