Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.

Download this listing!

Web Techniques Column 25 (May 1998)

If your website is like any of the 40 million pages out there, you've got links to other places on your page. They're usually some places that you found relevant, or maybe you just published your bookmarks as part of your site.

So, along I come, looking at your site, and I notice this really neat link to some interesting page. I click on it, and move on. However, because of the way the web works, you have no indication that I found that link interesting, and you may even start wondering if anyone finds the links in your carefully crafted web page useful.

Wouldn't it be nice to somehow track when someone leaves your site, and what route they took to exit? Well, it's actually pretty easy to do, provided you are willing to take a CGI hit on each followed link. Essentially, you'll need to change all outbound links to a CGI invocation.

For example, instead of providing:

        <A HREF="http://www.perltraining.com">Get trained!</A>

as a link, you could change this to:

        <A HREF="/cgi/go/http://www.perltraining.com">Get trained!</A>

Now, although it looks like a URL down toward the end, it's really just data that shows up in the PATH_INFO environment variable of the /cgi/go CGI script on the webserver.

(That may seem like a lot of work to change all your existing links to match this new requirement, but stay tuned... I have a programmatic solution.)

So, if someone follows this link, the CGI script is invoked, and we get the URL as a parameter. For this to work, the script must note the invocation, and then redirect the browser off to the real URL.

That's not tough either. In fact, the program ended up being shorter than I had expected, and is found in [listing one, below].

Line 1 of this program turns on taint checks (a good idea for CGI scripts), and both compile-time and run-time warnings (generally a good idea for all programs).

Line 2 turns on all the compiler restrictions, requiring all variables to be declared, disabling symbolic references, and removing ``bareword'' interpretation. As always, this is a good idea for all programs over a dozen lines or so.

Line 3 disables output buffering. Because this program forks later, it's important that all output happens as it is requested.

Line 5 is the only configuration parameter for this program. The $GO_LOG variable holds the full pathname to the logfile that will record the date and time of the clickthrough, as well as the source page and destination link.

Lines 7 through 15 form an eval block. This is a block of code that is being executed for a return value (to be stored in $result). However, at various places through the block, a die will cause an immediate exit, which we can check for later. This is a handy technique to evaluate a value with potential error conditions.

Line 8 grabs the PATH_INFO environment variable, and stores it into the block-local variable $res. If for some reason this environment variable isn't defined, we abort immediately via the die.

Line 9 tries to remove the initial slash from the value originally in the PATH_INFO variable. When you invoke

        /cgi/go/http://www.perltraining.com

the value of PATH_INFO is "/http://www.perltraining.com". That initial slash is annoyingly in the way, so let's just get rid of it. If no PATH_INFO was provided, then this also fails, preventing a simple invocation of /cgi/go from messing things up.

Line 10 fetches the QUERY_STRING, which is the part of the original URL after the question mark, if any. For example, if we see:

        /cgi/go/http://www.altavista.digital.com?q=Randal%20Schwartz

then we have to capture the part beginning with q=..., and this shows up in the QUERY_STRING variable, thankfully unmodified. If there's something out there, then we'll need to tack it on to the end of the URL that we'll be redirecting with eventually.

If there's a query string, line 11 detects it, and the query string gets appended (preceded by a question mark) in line 12.

The result has been computed into $res, and this gets returned from the block in line 14. That value will end up in $result from the assignment in line 7. Presuming, of course, that there has been no die operator executed along the way.

But if either of the die operators had been executed, the $@ variable is now non-empty, and line 16 detects this. If we've hit an error, we'll just have the CGI script return an HTTP 404 error, which the browser will interpret as ``I guess I'm not gonna get to go here''. This is handled in lines 17 and 18.

If we make it to line 20, we have a good solid URL to redirect this client toward, so that's handled with a simple Location header. And once we have a good redirect, it's time to note that fact. Otherwise, this was just an interesting pointless exercise.

Because we want the script to return quickly, we'll fork a background process to do the actual logging. Line 21 attempts to fork a process, storing the child pid (or 0) into $pid.

If $pid is not defined, then the fork failed. In this case, we'll make the browser wait a second or two while we write to the log file in the foreground instead. It's simple enough to do this by faking a $pid of 0 for a $pid of undefined, handled in line 22.

If we're the parent process, $pid is non-zero in line 23, and it's time to bail. If we're the kid, closing down STDOUT (in line 25) permits the web server to know that we won't be talking to the client any more.

Lines 26 through 33 write the data to the logfile. Line 26 opens the logfile in append mode.

Line 27 waits for an exclusive lock on the logfile. Only one CGI writer will be allowed to proceed past this line at any given time.

Line 28 resynchronizes the internal buffer with the actual file, and positions the file pointer at the end. This is important when flocking, because if we were blocking waiting for another script to finish, the file has now changed and we'll get data overwrite unless we seek (which forces a re-read).

Lines 29 through 32 write a single information record (as a line of text) to the log file. This line is tab-delimited, consisting of three fields: the current local date and time, the URL to which we are redirecting the client, and the referring page (if known).

Line 33 closes the logfile, releasing the lock at the same time. It's not strictly necessary to close the filehandle at this point, since we have reached the end of the program. However, if someone comes along later to maintain this code, and adds other steps later, it's a good idea to release the lock as quick as possible.

So, installing that little puppy as /cgi/go will start logging the invocations that invoke the script. And we can write various data-reduction Perl scripts on the log file, and will probably want to rotate logfiles and so on.

But that's not the end of what needs to be done. For this script to be invoked, we'll have to adjust the URL in the original HTML that we serve to the client. Sure, the text is easily adjusted in your favorite text editor, but there's a simpler way, since we have Perl at our disposal. Let's just have a Perl program that wanders through the HTML documents and edits them directly! Such a program has already been written, and is presented in [listing two, below].

Lines 1 through 3 of this program have been described above. Line 5 pulls in the File::Find library, which I'm using to recursively locate all the HTML files in my web server's directory.

If a command line argument is provided (or even multiple arguments), they're treated as the HTML files to edit. However, if no arguments are provided, lines 7 through 11 locate all the HTML files in my website's directories.

Line 8 invokes the find routine (defined by FIle::Find), passing it an anonymous subroutine reference as the action subroutine, and starting from the top-level directory of my website files.

Line 12 gets rid of the ending delimiter on a read. This means that each read will read the entire file in one fell swoop.

Line 13 turns on ``in-place editing mode''. When files are read from ARGV (or the empty filehandle, as seen in line 14), they are also opened for updating, such that any print operator to the default filehandle replaces the previous contents. The suffix chosen here is ~, which makes the backup files for the replaced files look like they were edited by GNU Emacs. You can chose a more appropriate string if necessary.

Lines 14 through 26 walk through each file in @ARGV. The entire file's contents ends up in the $_ variable.

Lines 15 through 24 are actually a single subsitute operator. However, notice that line 24 includes the suffix letters ``egi''. This means that case is insensitive on the match, and the match is global across all of $_, but most importantly, that the replacement side of the substitute is Perl code, not a simple variable-interpolated string.

Each time the pattern in line 15 is seen throughout $_, the code block in lines 16 through 23 are executed, and the last expression evaluated in that block becomes the replacement string. Line 16 creates three local variables, assigning two of them with the values of $1 and $2 (reflecting the current match).

Line 17 notes if this URL is worth replacing, by seeing if it begins with http:, and ensuring that it isn't a reference to the clickthru tracker already. If it's a valid-to-replace link, then we note that by printing a message to STDOUT, and create the new replacement string from the old URL prefixed by /cgi/go/ (in lines 18 and 19).

if the URL is not worth replacing, it's important that the replacement string be the same as the original string, handled in line 21. Regardless of whether or not a replacement has been selected, line 23 returns the string that must replace the original string.

Line 25 sends the updated string to the new file, but only if we're in inplace edit mode. (While I was testing, I commented line 13 out, and thus I was treated to merely the messages from line 19 telling me what it would have done had I let it.)

So, there you have it. A way of tracking ``clickthroughs'', or where people are going after they get bored with your site. Actually, I just installed this on my website, and it'll be interesting to see where people actually go. Until next time, enjoy!

Listings

        =0=     ###### listing 1 ######
        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     my $GO_LOG = "/home/merlyn/Web/golog";
        =6=     
        =7=     my $result = eval {
        =8=       die unless defined (my $res = $ENV{PATH_INFO});
        =9=       die unless $res =~ s/^\///;
        =10=      my $query = $ENV{QUERY_STRING};
        =11=      if (defined $query and length $query) {
        =12=        $res .= "?$query";
        =13=      }
        =14=      $res;
        =15=    };
        =16=    if ($@) {
        =17=      print "Status: 404 Not Found\n\n";
        =18=      exit 0;
        =19=    }
        =20=    print "Location: $result\n\n";
        =21=    my $pid = fork;
        =22=    $pid = 0 unless defined $pid;   # be the kid if fork failed
        =23=    exit 0 if $pid;
        =24=    ## child...
        =25=    close(STDOUT);
        =26=    open GOLOG, ">>$GO_LOG" or die "Cannot open $GO_LOG: $!";
        =27=    flock(GOLOG,2);                 # wait for exclusive
        =28=    seek GOLOG, 0, 2;               # seek to end, refresh buffers
        =29=    print GOLOG join("\t",
        =30=                     scalar localtime,
        =31=                     $result,
        =32=                     ($ENV{HTTP_REFERER} || "[unknown]")), "\n";
        =33=    close GOLOG;
        =0=     ###### listing 2 ######
        =1=     #!/home/merlyn/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use File::Find;
        =6=     
        =7=     unless (@ARGV) {
        =8=       find sub {
        =9=         push @ARGV, $File::Find::name if /\.html/;
        =10=      }, "/home/merlyn/Html/";
        =11=    }
        =12=    undef $/;
        =13=    $^I = "~";
        =14=    while (<>) {
        =15=      s{(href="(.*?)")}{
        =16=        my ($old,$url,$new) = ($1,$2);
        =17=        if ($url =~ /^http:(?!.*cgi\/go)/) {
        =18=          $new = qq{href="/cgi/go/$url"};
        =19=          print STDOUT "$ARGV: changing $old to $new\n";
        =20=        } else {
        =21=          $new = $old;
        =22=        }
        =23=        $new;
        =24=      }egi;
        =25=      print if defined $^I;
        =26=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.

Worldwide training and consulting by Perl experts

Copyright Notice

Web Techniques Column 25 (May 1998)

Listings