Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 26 (Jun 1998)

Last month's column could have been titled ``Where did they go?'', because I explored tracking the outbound links from my site to the interesting URLs I had provided on my pages. In this month's column, I'm looking at ``Where did they come from?''.

In particular, much of the web's content is now found these days, not by interesting URLs posted in other sites, but by users typing in search queries to the big indexing engines like Altavista and Lycos and Infoseek. If you're maintaining a ``referer log'', you may have noticed that the query strings typed in by the user sometimes shows up when that user follows a search results link to your page. This happens because the indexer's search page is often a GET form, and the parameters of the search are therefore encoded into the URL of the search results page.

And having noticed that, I decided to write a program that would go through my referer log and extract just the search strings. This is more than an idle curiousity; it tells me exactly what people are looking for that brought them to my page, and what I should be providing more of if I want to have my site be popular. Especially if I'm selling ads or wanting to be famous.

The ``referer log'' (available with some configuration parameter for most popular web servers) is merely a record of the HTTP Referer header (yes, it's spelled that way for historical reasons), which will frequently point to the URL from which the URL request is being made. The referer is not necessarily supported on all browsers, and will be messed up on a bookmarked entry. But for the majority of hits, the referer can give valuable information (as you can see by looking at the results of this program on your site).

The program to extract the search strings from the referer log is given in [listing 1, below].

Line 1 contains the path to Perl, along with the command-line switches that enable ``taint'' mode and warnings. Taint mode doesn't make much sense here, but I turned it on in case I decide later to make this a CGI script. Warnings are useful, but they can occasionally get in the way.

Line 2 turns on the compiler restrictions useful for all programs greater than ten lines or so. This includes disabling soft references (almost always a good idea), turning off ``Perl poetry mode'', and (most importantly) requiring all non-package variables to be declared. Variables will thus need to be introduced with an appropriate my directive.

Line 3 unbuffers STDOUT, causing all output to happen at the time it is printed, not when the STDIO buffer fills up. This is handy because it lets me see the output nearly immediately for a large log file, rather than having to wait until program exit time for the automatic buffer flush.

Line 5 pulls in the URI::URL module from the LWP library. This library is the all-singing, all-dancing, everything-you-wanted library to handle nearly all web-ish stuff in Perl, and can be found in the CPAN at [insert location here]. Of course, if you're doing anything with Perl and the web, you've probably already got this installed. We need this library to pull apart the referer URL.

Line 7 defines the result hash as %count, which will ultimately hold a hash of hashes of counts of how many times each query string was used from a particular search engine. Initially, it needs to be empty, so we set it to the empty list (becoming the empty hash).

Lines 8 through 53 define the data gathering loop. For each line in the referer log, we'll go through this loop once, with the line in $_. The data will be taked either from standard input, or from the list of files specified on the command line.

Line 9 pulls out the referer information from the line. For a standard RefererLog-style log, this'll look like:

        there -> here

And since we're only interested in there, it's simple enough to just pull out all the whitespace-separated items, and grab the first one, here kept in $ref. If you have a different logfile format, you'll have to adjust this line to pull out the field you need.

Line 10 turns the referer string in $ref into a URI::URL object, using the subroutine url defined in that module. If $ref is empty or not a valid URL, the object may be malformed, but that'll be caught in the next step.

Line 11 verifies that we have a valid http: URL. The scheme method on the URL object returns back either a string or undef. If it's not defined, the or operator (two vertical bars) select the empty string as an alternative, to prevent the use of an undef value in a further calculation, which triggers a warning under -w. If this URL is not an http URL, then we skip it.

Line 12 extracts the portion of the URL string after the ? as a query form, if it is at all possible. The eval block protects this program from an exception in the query_form method, which throws up a die if there isn't a valid form. The result of the eval creates a new hash, %form. The keys of this hash are the query field names, and the corresponding values are the field values.

Lines 13 through 39 create a value for @search_fields, specifying for a particular search engine host what we're guessing is the search query string. This list can have many kinds of values:

  1. If the list is empty, then we ignore this particular search engine. (Either it's not a search engine, or we can't find anything useful to note as a search string.)

  2. If the list consists of only uppercase words, then all fields of the query will be dumped (used for the catchall entry at the end).

  3. In the common case, if the list consists of one or more lowercase words, these represent form fields of interest, probably with the search string that brought the client here.

To construct this list, I started with a very small list, and ran it over my referer log of a few months. For every search engine that was dumped out as an other, I figured out which of the fields looked like a search list, and added them in. I also got a bit of help from Teratogen on IRC (known in ``real life'' as Anthony Nemmer of EdelSys Consulting), who had apparently tackled a similar problem before, and identified a significantly larger portion of the list from his own data.

The list is incomplete, and evolves over time, so the names here are merely a good cross section. And, there are search engines that don't use a GET method to go from the search page to the results, and thus their parameters won't show up in the URL. But as you can see, a good number of the popular ones (Altavista, Excite, Hotbot, Infoseek, Lycos, Search.com, and Webcrawler) do.

Line 14 extracts the hostname from the referer URL, and makes it lowercase. (We could have made all the comparisons case insensitive, but this alternative was much faster).

Lines 15 through 38 form a long if..elsif..elsif..else structure. Note that it begins with if 0, which will always be false, but permits all the remaining cases to be symmetrical. This is nice because it allows me to swap the order of the checking trivially (by exchanging lines in a text editor) or even sorting them if I wish.

The hostname is compared with each of the regular expressions in turn. Note that some of the hostnames are looking only for a particular hostname portion, while others are bounded by the complete suffix to the end of the string. In particular, I found many different hosts with altavista, and they all seemed to use the same query field, so writing the test for it this way made sense.

Note that they are tested in the order presented. I found some form being used in edit.my.yahoo.com that was nothing like the query form in yahoo.com (and friends), so I placed a special blocking entry ahead of the Yahoo main entry, saying ``don't bother with this one, it's not the same''. Otherwise, the ordering of this list is somewhat arbitrary, and for efficiency reasons should probably be placed with the most likely one first.

The multiway if statement is within a do block, meaning that the last expression evaluated will be the return value. If you don't like the structure requiring the use of elsif chunks, you can write other switch statements enclosed in sub blocks, like so:

    my @search_fields = "UNKNOWN";
    {
      local $_ = lc $url->host;
      (@search_fields = "q"), last if /\baltavista\b/;
      (@search_fields = qw(s search)), last if /\bnetfind\.aol\.com$/;
      ...;
      (@search_fields = "p"), last if /\byahoo\b/;
    }

But I didn't like the number of times I'd have to say @search_fields, and went with the do-block structure instead. Another alternative might be to call a subroutine, like:

    my @search_fields = &map_to_engine($url);
    sub map_to_engine {
      local $_ = lc shift->host;
      return "q" if /\baltavista\b/;
      return qw(s search) if /\bnetfind\.aol\.com$/;
      ...;
      return "p" if /\byahoo\b/;
      return "UNKNOWN";
    }

And in fact, to some, that make look cleaner than what I wrote. Your choice, however. After all, the Perl motto is ``There's More Than One Way To Do It.''

In line 40, we check the result of that multiway test. If @search_fields is empty, it's the signal that this line is noisy, and we can skip it. Otherwise, in line 41, we'll translate this list into a hash to do a fast lookup. The map operator takes the elements of the list in @search_fields, interposes a single 1 after each element, and turns that into the %wanted hash, with keys being the original elements of the list.

Line 42 scans the form fields from %form, keeping only those elements that match the keys of %wanted in a case-insensitive manner. This is accomplished through the clever use of lowercasing the value of $_ before doing the lookup. Thus, @show_fields will be a list of all the form fields of interest, if any.

If @show_fields has one or more elements, we found a valid search site along with an interesting field (hopefully a search string). In that case, we'll save the search string for later dumping. Lines 44 through 46 store the information into a hash-of-hashrefs, with the first level being the host, and the second level being the particular search string used at that host. A count is maintained, and for the most part will be just an increment from undef to 1. Occasionally, when the same search string is used (or repeated), we'll get multiple hits.

On the other hand, if @show_fields is empty, we were either looking at a referer URL that had a form from an unknown site, or somehow one of the known sites didn't have the proper field. In that case, we'll dump out the entire form immediately, so that you can consider it manually to locate a search string for a future run. That's handled in lines 48 through 51, which simply dump the %form variable preceded by the search host.

Lines 55 through 63 dump the search string hash-of-hashrefs. Each of the hostnames ends up in $host in line 55. (If you don't have a relatively modern version of Perl, the for my syntax will not work. Upgrade now, because it's free and less buggy than the version you're running.)

Line 56 extracts the hashref value from the top level hash, which is then dereferenced in line 57 to get the individual searchtext items into $text. Lines 58 to 62 dump out the hostname, textstring, and number of times each item was found (if more than once).

And there you have it. To use this program, adjust the ``referer field'' parsing line according to the format of your referer log, and then pass the name of the log on the command line to this program. You could even wrap this up into a nightly job, and with a little work generate an HTML output file that creates links back to the search engines in question! (Sounds like an interesting additional project if I've got another hour or two.) Enjoy!

Listings

        =1=     #!/usr/bin/perl -Tw
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use URI::URL;
        =6=     
        =7=     my %count = ();
        =8=     while (<>) {
        =9=       my ($ref) = split; ## may require adjustment
        =10=      my $url = url $ref;
        =11=      next unless ($url->scheme || "") eq "http";
        =12=      next unless my %form = eval { $url->query_form };
        =13=      my @search_fields = do {
        =14=        local $_ = lc $url->host;
        =15=        if (0) { () }
        =16=        elsif (/\baltavista\b/) { "q" }
        =17=        elsif (/\bnetfind\.aol\.com$/) { qw(s search) }
        =18=        elsif (/\baskjeeves\.com$/) { "ask" }
        =19=        elsif (/\bdejanews\.com$/) { () }
        =20=        elsif (/\bdigiweb\.com$/) { "string" }
        =21=        elsif (/\bdogpile\.com$/) { "q" }
        =22=        elsif (/\bexcite\.com$/) { qw(s search) }
        =23=        elsif (/\bhotbot\.com$/) { "mt" }
        =24=        elsif (/\binference\.com$/) { "query" }
        =25=        elsif (/\binfoseek\.com$/) { qw(oq qt) }
        =26=        elsif (/\blooksmart\.com$/) { "key" }
        =27=        elsif (/\blycos\b/) { "query" }
        =28=        elsif (/\bmckinley\.com$/) { "search" }
        =29=        elsif (/\bmetacrawler\b/) { "general" }
        =30=        elsif (/\bnlsearch\.com$/) { "qr" }
        =31=        elsif (/\bprodigy\.net$/) { "query" }
        =32=        elsif (/\bsearch\.com$/) { qw(oldquery query) }
        =33=        elsif (/\bsenrigan\.ascii\.co\.jp$/) { "word" }
        =34=        elsif (/\bswitchboard\.com$/) { "sp" }
        =35=        elsif (/\bwebcrawler\.com$/) { qw(search searchtext text) }
        =36=        elsif (/\bedit\.my\.yahoo\.com$/) { () } ## must come before yahoo.com
        =37=        elsif (/\byahoo\b/) { "p" }
        =38=        else { "UNKNOWN" }
        =39=      };
        =40=      next unless @search_fields;
        =41=      my %wanted = map { $_, 1 } @search_fields;
        =42=      my @show_fields = grep { $wanted{lc $_} } keys %form;
        =43=      if (@show_fields) {
        =44=        for (@show_fields) {
        =45=          $count{$url->host}{$form{$_}}++;
        =46=        }
        =47=      } else {
        =48=        print $url->host, "\n";
        =49=        for (sort keys %form) {
        =50=          print "?? $_ => $form{$_}\n";
        =51=        }
        =52=      }
        =53=    }
        =54=    
        =55=    for my $host (sort keys %count) {
        =56=      my $hostinfo = $count{$host};
        =57=      for my $text (sort keys %$hostinfo) {
        =58=        my $times = $hostinfo->{$text};
        =59=        print "$host: $text";
        =60=        print " ($times times)" if $times > 1;
        =61=        print "\n";
        =62=      }
        =63=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.