Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 11 (February 1997)

Maybe Gisle Aas should work for Nike, because he really keeps on ``just doing it.'' In this case, I'm referring to a really neat addition to the already all-singing all-dancing LWP library (found in the CPAN), namely the addition of HTTP::Daemon in version 5.004.

With HTTP::Daemon, it's possible to whip up a ``mini http server'' in a dozen or so lines of code. And, as I was playing with the new toys for a community service project I'm working on, it occurred to me that a neat demonstration would be to write a proxy server. Well, more than a proxy server -- an anonymizing proxy server!

The essence of an anonymous proxy is that you tell your browser that all requests should go through that proxy, which strips all hints that the request came from you in particular, and then handles the actual request. The real web server being contacted has no clue where the original request came from, although they can obviously see that it was from your anon proxy host.

And, without further delay, I present my hastily whipped-up anon proxy server in Listing 1 [below].

Lines 1 and 2 begin nearly every program I write. Lines 3 and 4 make life a little easier for a tainted program that I'm debugging.

Lines 6 through 8 are self-explanatory, but I would caution you against using this program as-is. It's mostly a proof-of-concept, but there may be bugs (I didn't shake-test this thing very well). Also, as you see, every new connection causes a new process launch -- OK for lightweight applications, but pretty expensive in the long run. (In a future column, I'll show how to create a bundle of non-forking processes to handle multiple connections in a round-robin fashion, ala Apache or modern NCSA httpd... from Perl!)

Lines 13 and 14 establish the location of this proxy server. You may adjust to fit. If you pick a fixed address and you kill the server, you may not be able to reuse that port address for up to 30 seconds. If you put ``0'' for the port, the system picks the port for you and you won't have this problem -- really handy when you are stopping and starting the server many times (while debugging, for example). The downside is that you then have to get the address from the first logging message (see later). Oh well.

Lines 16 through 20 are really handy for logging messages to a logfilel from a process that may be forking children (perhaps illegal in Tennessee, so be careful). The argument list is massaged to include a datestamp and a process ID in front of each line. The result is a single string. To figure this one out (one of my famous one-liners), read it from right to left: take the argument list (@_); make one string out of it (with join); split that string on newlines; take each piece of that list and prefix it with the current time and processID; and finally, create a single string out of that again. Yeah, there are probably better ways to do that, but this one works for now. (In fact, I thought of two better ways while writing this paragraph. Perhaps in a future column...)

Lines 22 and 23 cause all warns and dies to use the prefix function, thus putting the identifying string in front of the error message. Line 24 causes all exited children to be waited for. Nice.

Lines 26 through 37 create a ``user agent'' in $AGENT, suitable for performing the ``client-side'' aspect of the proxy server. This user agent is like the standard LWP::UserAgent, but has a hardwired ``agent'' type, and supports the proxy variables. That is, this proxy server will properly respect another proxy setting, so you could chain these if you wished.

There's also another difference between this agent and a standard agent: redirect messages from the servers this agent contacts will be passed through to the clients of this server, not handled internally. This is important: without this, a redirect would be handled transparently by LWP, and the client of this server would be looking at a page with a different real URL than what it asked for. That's a real mess when you are following relative URLs.

The difference between this agent and a standard user agent is created through the use of a subclass. The class ``MyAgent'' is established, with line 30 setting its superclass to LWP::UserAgent. The ``new'' method in line 32 is actually found in LWP::UserAgent, but the resulting object is still of the MyAgent class. Both the ``agent'' and ``env_proxy'' methods also come from the LWP:UserAgent class.

So, what's the point of all this? It's the subroutine definition in line 37. When an LWP::UserAgent (or its subclasses) want to know whether a redirect should be handled internally or passed back to the user, it sends itself a ``redirect_ok'' message. In the LWP::UserAgent class, this returns 1 for GETs and 0 for POSTs, but we've overridden that method here for a ``MyAgent'' as 0 for everything. Thus, an object of class MyAgent will never handle a redirect internally.

Yeah, that's the object way to do it. But sometimes, I object to the extra complexity.

Lines 39 to 48 form the ``main'' code. I put it in a block so that I could have my top-level variables still have limited scope. I've been hacking larger and larger programs lately, and have (re-)discovered how global variables are nearly inherently evil. So, I try to minimize them, and it seems to be simplifying my life again.

Line 40 pulls in the HTTP::Daemon class, new in LWP version 5.004, and without which this code would have been three times as long. Lines 42 and 43 create the top-level daemon at the indicated address. The $master object represents the listener at the port.

Line 46 is a standard ``wait for a connection and handle it'' loop. Each new connection to the port causes the ``accept'' method to return a new IO handle into $slave, representing a connection to a particular remote (or local) process. The &handle_connection function then takes care of that particular client.

Speaking of which, lines 50 through 66 define &handle_connection. The first (and only) argument is shifted off into the local $connection variable in line 51. This is the $slave from the main code.

Line 53 triggers a fork, creating a new process for each new connection. In the parent process, $pid is some number, and therefore true. In the child process, $pid is 0, and therefore false. If the fork fails, $pid is undef, and thus also false. I decided that if the fork fails, the parent (and only) process would handle the request would handle the connection (where are your kids when you need them?).

Line 59 is pretty intense... it actually causes the entire client message to be read, parsed, and stored as a HTTP::Request object, which ends up in $request. If the client talks gibberish, the $request is undef instead, which is tested in line 60.

Line 61 handles the good request, using my &fetch_request routine (defined slightly later). This always results in an HTTP::Response object, even if the request is denied for some reason. The response is returned to the client in line 62, and then the client connection is closed in line 63.

Line 65 exits the child process. If the fork originally failed, then $pid is undef, and we don't want to exit there, thus the conditional.

Lines 68 through 86 define the &fetch_request routine, which expects an HTTP::Request object as its only parameter, and returns an HTTP::Response, indicated by those cleverly placed comments in the first two lines.

Line 71 pulls in the HTTP::Response class, needed because we are constructing a response from scratch in a few cases.

Line 73 extracts the URL from the request. This is an object of class URI::URL, so we can do cool things to it. However, it prints nicely (because of an overloaded ``as_string'' method), so its use in the double-quoted string on line 74 still looks clean.

Line 75 extracts the scheme from the URL (like http or gopher). If this value isn't one of the legal types, I don't want the proxy server to try to fetch it. In particular, a type of ``file:'' allows a client to fetch files on this machine, so that'd be a really bad one to permit.

The failure is indicated by returning back an HTTP::Response that has a 403-type error code in lines 76 through 78. This routine constructs a 403 message with a content of the scheme type. Notice the use of @{[$url->scheme]} inside the double-quoted string, which is a nice way to get a function call into a string. Perhaps it would have been easier to simply have three arguments to the content method, thus requiring a little less ``showing off''. Oh well.

Lines 79 through 82 similarly return an error status message if the host and port of the requested URL is relative to this host and port. This happens because of one of two reasons: (1) either the host/port was not specified (only if someone had accidentally used this as a direct server instead of a proxy server), or (2) the server has looped back onto itself somehow. Neither one is a good thing, so we disallow both with the same obscure error message.

If we made it past all the sanity checks, it's time to fetch the actual stuff in line 84. Note that the return value for this subroutine can come from any of lines 84, 82, or 78.

Wow. By the time we're down to line 88, we have a validated request that should actually be fetched using the cool user agent created at the top of the file (remember that up there?). This request ends up in $request in line 89. And as a reminder that I'm using a global variable, I comment that fact in line 91.

Lines 93, 95, 97, and 99 are commented out here. If you want to see the headers before and after anonymizing, you can put them in. I had them in while debugging, but they get really annoying after a while, not to mention that they slow down the processing a bit.

Line 93 in particular shows the headers of the request before the tattletale ``User-Agent'', ``From'', ``Referer'', and ``Cookie'' headers are removed. These headers are zapped in line 94. Now, these headers were the only ones that I saw Netscape Navigator spitting out that might give away information about who I am or where I'm coming from... if you find more headers that should be killed, let me know, and I'll update y'all in a future column.

Line 96 actually fetches the requested web transaction, returning the response object in $response. Now, if this fails, we still get back a response object, but one that says that we didn't get what we wanted.

Line 98 cleans out the inherently evil ``Set-Cookie'' header. Now, since the requests don't have cookies passed back to the server, this really doesn't make any difference, but what the heck, I'm already attacking -- I might as well be completely agressive.

And there you have it. Set this thing up on a lightly loaded machine, fire it up, point your web browser at it as a proxy, and surf away, knowing that the only thing that the ultimate server gets to know is that an anonymous request came from your lightly loaded machine.

You can even chain them.. point one at another. You could probably even tell the useragent to use a different relay anonymous proxy at each request. Wow. The mind boggles. Have fun, and always practice safe surfing (and just say ``No!'' to cookies!).

Listing 1

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $ENV{PATH} = join ":", qw(/usr/ucb /bin /usr/bin);
        =4=     $|++;
        =5=     
        =6=     ## Copyright (c) 1996 by Randal L. Schwartz
        =7=     ## This program is free software; you can redistribute it
        =8=     ## and/or modify it under the same terms as Perl itself.
        =9=     
        =10=    ## Anonymous HTTP proxy (handles http:, gopher:, ftp:)
        =11=    ## requires LWP 5.04 or later
        =12=    
        =13=    my $HOST = "localhost";
        =14=    my $PORT = "8008";
        =15=    
        =16=    sub prefix {
        =17=      my $now = localtime;
        =18=    
        =19=      join "", map { "[$now] [${$}] $_\n" } split /\n/, join "", @_;
        =20=    }
        =21=    
        =22=    $SIG{__WARN__} = sub { warn prefix @_ };
        =23=    $SIG{__DIE__} = sub { die prefix @_ };
        =24=    $SIG{CLD} = $SIG{CHLD} = sub { wait; };
        =25=    
        =26=    my $AGENT;                      # global user agent (for efficiency)
        =27=    BEGIN {
        =28=      use LWP::UserAgent;
        =29=    
        =30=      @MyAgent::ISA = qw(LWP::UserAgent); # set inheritance
        =31=    
        =32=      $AGENT = MyAgent->new;
        =33=      $AGENT->agent("anon/0.07");
        =34=      $AGENT->env_proxy;
        =35=    }
        =36=    
        =37=    sub MyAgent::redirect_ok { 0 } # redirects should pass through
        =38=    
        =39=    {                               ### MAIN ###
        =40=      use HTTP::Daemon;
        =41=    
        =42=      my $master = new HTTP::Daemon
        =43=        LocalAddr => $HOST, LocalPort => $PORT;
        =44=      warn "set your proxy to <URL:", $master->url, ">";
        =45=      my $slave;
        =46=      &handle_connection($slave) while $slave = $master->accept;
        =47=      exit 0;
        =48=    }                               ### END MAIN ###
        =49=    
        =50=    sub handle_connection {
        =51=      my $connection = shift;       # HTTP::Daemon::ClientConn
        =52=    
        =53=      my $pid = fork;
        =54=      if ($pid) {                   # spawn OK, and I'm the parent
        =55=        close $connection;
        =56=        return;
        =57=      }
        =58=      ## spawn failed, or I'm a good child
        =59=      my $request = $connection->get_request;
        =60=      if (defined($request)) {
        =61=        my $response = &fetch_request($request);
        =62=        $connection->send_response($response);
        =63=        close $connection;
        =64=      }
        =65=      exit 0 if defined $pid;       # exit if I'm a good child with a good parent
        =66=    }
        =67=    
        =68=    sub fetch_request {
        =69=      my $request = shift;          # HTTP::Request
        =70=    
        =71=      use HTTP::Response;
        =72=    
        =73=      my $url = $request->url;
        =74=      warn "fetching $url";
        =75=      if ($url->scheme !~ /^(http|gopher|ftp)$/) {
        =76=        my $res = HTTP::Response->new(403, "Forbidden");
        =77=        $res->content("bad scheme: @{[$url->scheme]}\n");
        =78=        $res;
        =79=      } elsif (not $url->rel->netloc) {
        =80=        my $res = HTTP::Response->new(403, "Forbidden");
        =81=        $res->content("relative URL not permitted\n");
        =82=        $res;
        =83=      } else {
        =84=        &fetch_validated_request($request);
        =85=      }
        =86=    }
        =87=    
        =88=    sub fetch_validated_request { # return HTTP::Response
        =89=      my $request = shift;  # HTTP::Request
        =90=    
        =91=      ## uses global $AGENT
        =92=    
        =93=      ## warn "orig request: <<<", $request->headers_as_string, ">>>";
        =94=      $request->remove_header(qw(User-Agent From Referer Cookie));
        =95=      ## warn "anon request: <<<", $request->headers_as_string, ">>>";
        =96=      my $response = $AGENT->request($request);
        =97=      ## warn "orig response: <<<", $response->headers_as_string, ">>>";
        =98=      $response->remove_header(qw(Set-Cookie));
        =99=      ## warn "anon response: <<<", $response->headers_as_string, ">>>";
        =100=     $response;
        =101=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.