Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 58 (Feb 2001)

[suggested title: Visualizing your traffic flow]

The navigation elements of your site -- how a user gets from one place to another -- is an integral part of the user experience. Or at least, it should be. Jakob Nielson (www.useit.com) argues that the screen space taken up by navigation should be minimal, but effective.

How can you tell if your site navigation is effective? And as you start tweaking it, how can you tell if the changes you're making are useful to people? And how can you tell what people want to get to most often from a page or a given cluster of pages? And are people being drawn to your most important pages? When hits are coming from search engines or external links, are they ending up on the pages you thought they were?

Now, I'm not a usability expert by any means. I'm a Perl hacker. But the first step in figuring out the answer to some of these questions is gathering data. Much of the data has to come from a referer log, which tracks for each hit what the browser sends as the previous hit.

Apache doesn't include referer logging right out of the box, but you can get to it with one or two directives, and then sit back and wait a month while the statistics pour in. For me, I left Apache's default logging alone, and installed logging of lots of interesting things directly to a DBI log using mod_perl (see my April 2000 column).

The other thing that logs don't get by default is outbound links. When a browser follows a link I provide from my site destined for a place outside my website, I normally would get no indication. However, thanks to the code I wrote for this column in May 1998, I've rewritten every outbound link prefixed with /cgi/go/, as in:

  More info at <a href="/cgi/go/http://www.perl.org">Perl Mongers site</a>!

which causes a quick hit to my server (carefully logged again) before they get redirected (by a simple mod_perl handler) to the site of interest.

With this amount of info in hand, it should be no problem to recognize trends. Just stare at all the referer links, look for all the /cgi/go/ links to see outbound traffic, and wade through thousands of hits taking careful notes. Ugh. No. Isn't this what computers are for? (Not to completely rule that out, because you might notice some interesting patterns. Or maybe you're very bored, or very meticulous.)

So, the next step I tried was a simple data analysis. I did a simple counting of the various source addresses to destination addresses, and sorted based on frequency. This gave me a bit more insight, but I quickly realized that I wanted ``categories'' of hits. For example, how many incoming hits from other sites were directed at a specific archived column, as opposed to the top-level table-of-contents for the columns?

Even then, while I was dealing with fewer numbers, I'm still looking at about 15 categorized site areas, and counts for nearly every pair of items. I needed to rely on some parallel mental processing. And that's where visualization comes in handy.

Someone in the newsgroup comp.lang.perl.misc recently mentioned Graphviz (at http://www.research.att.com/sw/tools/graphviz/) for a similar task to what I was attempting. I grabbed it, started compiling it, figured out it needed a lot more things to be already installed (drat!), grabbed those, compiled those, and after an hour or so, could start generating some really nifty little graphs with minimal input (although I never did get PNG output to work).

Although there's an interface to Graphviz in the CPAN, it doesn't allow the precise specifications of attributes that I needed for my chart. So I have my program write raw input directly into dot, which was pretty simple for me.

A couple of warnings before I go on.

One, I spent a lot of time tweaking the output. I think I must've scanned the last week's data from my website some 200 times before I got something close to friendly. And it's not the layout that needed tweaking (Graphviz does an amazing job with layout right from the start), it was how best to represent the rate of traffic between nodes, and the categories of nodes I needed for the best info about my particular site layout.

Two, I'm not a graphics dude. I move ones and zeroes around, and write and teach about my experiences of that. I can occasionally make interesting pictures, but I have like next-to-nothing in my brain about good graphic sense. For example, the left side of my brain wanted red to be hot (lots of traffic), and blue to be cool (very little traffic). When I showed this graph to some others, they said ``no, you should vary the brightness of the lines as well''. Well, I tried that, and it didn't say as much to me as the colors did. So, your mileage may vary, big time. In fact, if you take this program and adapt it to your site and get better results, I'd like to hear from you so I can get better educated.

And the result of a typical run is in [figure one... editors, please get the transparent GIF from http://www.stonehenge.com/WT58-1.gif, and try it on different backgrounds]. Each labeled oval area represents a page or category of pages (more on that in a moment). The arrows represent traffic from that area to another area, with red being the hot links and blue being the not-so-hot links. As I stared at this, I was surprised to see the significant traffic from my Perl Training information pages to my column archives, as well as the hits coming in from perl.com to my training pages. Knowing this, I can make it easier for people to navigate in this direction. Besides that, it's a pretty picture.

The code to generate this picture is in [listing one, below].

Line 1 turns on warnings, and line 2 disables output buffering. Standard stuff for most of my programs.

Line 4 pulls in the DBI module (found in the CPAN) so that I can access my webserver logs. Line 5 brings in URI (from LWP in the CPAN) so that I can make a given URL canonical and pull out its parts.

Line 6 pulls in Memoize (in the CPAN) to cache the results of calling string_to_location automatically, so that we don't have to keep running the algorithm repeatedly for a given URL. (The program works fine without this step, but a lot slower.)

Lines 10 through 14 deceptively make up the ``configuration'' section. I say ``deceptively'' because much of the rest of the program is very site specific, but I started out wanting it to be generic, so you can see the archaeology there. At least it's a place for constants worth tweaking. I've got the DBI info here, the output location for the resulting illegally-created GIF (don't tell Unisys), and the number of days to analyze.

Lines 18 to 49 pull the referer data out of the database, making it look harder than it needs to, perhaps. The real work is down in lines 45 and 46, where I take each source or destination URL and turn them into a categorized location (all individual columns become one common string, for example), and then count the pairings of sources and destinations. Lines 23 to 40 define the SQL to pull the data out of my database (as shown in my earlier column). Some of this was incrementally tweaked as I discovered things I didn't care to count. And the rest is straightforward DBI, so I won't describe that here.

Lines 52 and 53 perform a necessary, but odd (but not necessarily odd) sequence of things. We want the output of this program to pass through the dot program (part of Graphviz), and want that output to head into the GIF location. So, first we open our own STDOUT to a temporary location that can be renamed to the GIF location, and then we open our own STDOUT again to the pipe of the dot program. This makes the output of dot head into the GIF temporary, and our output head into dot. Follow that? Later, in lines 89 and 90, we close STDOUT (waiting for dot to exit), and rename the temporary file to the real name. The purpose of the temporary file is to ensure that any independent fetches of the GIF (like from web accesses) will see a consistent finished GIF, and not some empty file or intermediate product.

Lines 55 and 56 compute the greatest number of hits for a given source/destination pair. We need this to scale the output properly.

Lines 58 to 72 begin the input to dot. The two commented-out lines (with a leading double-slash) were parts of some of the various experiments I was having with the output format. In fact, the edge definition is totally useless now that I'm no longer adding labels to the edges. I originally had the hit count as a label so I could see if my ``red means hot'' algorithm was working to my satisfaction.

Lines 74 to 83 generate the dot code for each arrow in the output, representing traffic from one categorized location to another. Note that I don't have to sort the data, because the graph is determined based entirely on the relative weights of the edges generated. For each arrow, I come up with a ratio of hits on that arrow compared to the maximum hits on any arrow, and skip over any line that had less than 1% of the traffic (the picture was too cluttered without that).

And then I came up with this idea of ``heat'', and tweaked the power in line 78 to change the ``gamma'' of the heat. The problem is that I wanted an exponentially increasing number of hits to be represented by roughly equal steps of color change. Since $ratio goes from 0 to 1, I wanted another 0 to 1 value that was logarithmically proportioned. I finally settled on 0.5 for the power factor for my particular site, which seems to give a nice meaningful range of reds to purples to blues. Again, your data may require tweaking here. Experiment!

Finally, I needed to tell dot two things about each edge. The weight factor defines a relative importance, which dot tries to make the shortest and straightest in the output. Again, much tweaking here. And the color is a 3-number tuple, representing hue, saturation, and luminance. By varying the hue from 0.3 to 1.0, I get blue through red as I wanted. Again, lots of experimentation was needed here. But that all winds up in a single line to dot for each arrow, printed in line 82, looking something like:

  "http://something"; -> "/merlyn" [weight=89.57 color="0.97, 0.95, 1"];

The final wonder of this program is the categorization subroutine, from line 94 to the end of the program. First, I make canonical each URL to reduce them to a common form. Then, if it's an HTTP URL within my site, I see if it's an outbound link (starting with /cgi/go/) and if so, restart the process with the data following the link.

Otherwise, it's down through a series of regular-expression matches, returning the right entry for each match. The ordering is very important: specific exceptions always must be tried before the more general items.

Lines 127 to 132 handle other-site inbound and outbound links. I break out three special sites because of the observed traffic (http://www.geek-girl.com is the number one inbound referer, for example), but then drop everything else in to the something category.

And there's a few unknown and other scattered in this logic, because I was mostly too lazy to figure out what that last few percentage really was. Close enough.

And there you have it. A program that actually works for my site to generate traffic pictures based on referer. Takes about 15 seconds for the SQL query for a week's worth of data, and about two seconds to draw the GIF. I can run it from a cron job automatically every four hours or so to keep the picture piping hot.

One idea that I was wrestling with as I was hitting the deadline for this month's column was the idea of using dot's ability to create imagemap entries for the nodes. With that, I could easily stare at this map, then click on each item of interest as needed (with proxies like Google for items like the generic offsite Internet). Or, even without traffic data, create a nice little clickable site table-of-contents, changing dynamically as content becomes available. But, I've run out of time and room, so that'll go back into the todo pile. So, until next time, enjoy!

Listings

        =1=     #!/usr/bin/perl -w
        =2=     $|++;
        =3=     
        =4=     use DBI;
        =5=     use URI;
        =6=     use Memoize; memoize('string_to_location');
        =7=     
        =8=     ## CONFIG ##
        =9=     
        =10=    my $DSN = 'dbi:mysql:httpd_logs';
        =11=    my $DB_AUTH = 'username:passwd;
        =12=    my $OUTPUT = "/home/merlyn/Html/sitemap.gif";
        =13=    my $OUTPUT_TMP = "$OUTPUT~";
        =14=    my $DAY = 7;
        =15=    
        =16=    ## END CONFIG ##
        =17=    
        =18=    ## database phase
        =19=    my $dbh = DBI->connect($DSN, (split ':', $DB_AUTH), { RaiseError => 1 });
        =20=    $dbh->do("SET OPTION SQL_BIG_TABLES = 1");
        =21=    
        =22=    my $sth = $dbh->prepare(qq(
        =23=    select Referer, Url
        =24=    from requests
        =25=    where When > date_sub(now(), interval $DAY day)
        =26=    and (
        =27=      Url not like '/%/%'
        =28=      or Url like '/perltraining/%'
        =29=      or Url like '/merlyn/%'
        =30=      or Url like '/cgi/%'
        =31=      or Url like '/perl/%'
        =32=      or Url like '/icons/%'
        =33=      or Url like '/books/%'
        =34=    )
        =35=    and Url not like '%.jpg'
        =36=    and Url not like '%.gif'
        =37=    and Url not like '/perl/bigword%'
        =38=    and Host not like '%.stonehenge.%'
        =39=    and Vhost = 'web.stonehenge.com'
        =40=    and Referer is not null
        =41=    ));
        =42=    $sth->execute();
        =43=    my %count;
        =44=    while (my ($referer, $url) = $sth->fetchrow_array) {
        =45=      $_ = string_to_location($_) for $referer, $url;
        =46=      ++$count{"$referer $url"};
        =47=    }
        =48=    $dbh->disconnect();
        =49=    ## end database phase
        =50=    
        =51=    ## set up output, yes must do these in this order
        =52=    open STDOUT, ">$OUTPUT_TMP" or die "Cannot create $OUTPUT_TMP: $!";
        =53=    open STDOUT, "|/usr/local/bin/dot -Tgif" or die "Cannot fork: $!";
        =54=    
        =55=    my $max = 0;
        =56=    $max < $_ and $max = $_ for values %count;
        =57=    
        =58=    print <<'END';
        =59=    digraph d {
        =60=      ranksep = 0.5; nodesep = 0.1;
        =61=      node [
        =62=        style=invis, width=0.1, height=0.5,
        =63=        fontname="helvetica", fontsize=12,
        =64=      ];
        =65=      edge [
        =66=        // arrowsize=0.5,
        =67=        fontname="helvetica",fontsize=10,
        =68=      ];
        =69=      // mention these so they usually end up near the top
        =70=      "http://something";;
        =71=      // "/merlyn";
        =72=    END
        =73=    
        =74=    for (keys %count) {
        =75=      my $count = $count{$_};
        =76=      my $ratio = $count / $max;
        =77=      next if $ratio < 0.01;
        =78=      my $heat = $ratio ** 0.5;
        =79=      my $weight = sprintf "%.2f", $heat * 99 + 1;
        =80=      my $color = sprintf q{"%.2f,%.2f,%.2f"}, $heat*0.30+0.70, 0.95, 1;
        =81=      my ($src, $dst) = split;
        =82=      print qq{  "$src" -> "$dst" [weight=$weight, color=$color];\n};
        =83=    }
        =84=    
        =85=    print <<'END';
        =86=    }
        =87=    END
        =88=    
        =89=    close STDOUT or die "Something wrong with waiting: $!";
        =90=    rename $OUTPUT_TMP, $OUTPUT or die "Cannot rename: $!";
        =91=    
        =92=    ## end of program, start of subroutines
        =93=    
        =94=    sub string_to_location {
        =95=      my $uri = URI->new_abs(shift, "http://www.stonehenge.com/";)->canonical;
        =96=      {
        =97=        if ($uri->scheme eq 'http') {
        =98=          return "unknown" unless defined $uri->host;
        =99=          if ($uri->host =~ /^(w3|www|web)\.stonehenge\.com$/i) {
        =100=           if ($uri->path_query =~ /^\/cgi\/go\/(.*)/s) {
        =101=             ## outbound link
        =102=             $uri = URI->new_abs("$1", "http://www.stonehenge.com";); redo;
        =103=           }
        =104=           if ($uri->path =~ /^(.*\/)index\.html/s) {
        =105=             $uri->path("$1");
        =106=           }
        =107=           for ($uri->path) {
        =108=             return "/wt-column" if m{^/merlyn/WebTechniques/col\d\d};
        =109=             return "/wt" if m{^/merlyn/WebTechniques};
        =110=             return "/ur-column" if m{^/merlyn/UnixReview/col\d\d};
        =111=             return "/ur" if m{^/merlyn/UnixReview};
        =112=             return "/lm-column" if m{^/merlyn/LinuxMag/col\d\d};
        =113=             return "/lm" if m{^/merlyn/LinuxMag};
        =114=             return "/pt-page" if m{^/perltraining/.*html};
        =115=             return "/pt" if m{^/perltraining};
        =116=             return "/pictures" if m{^/merlyn/Pictures/};
        =117=             return "/merlyn-other" if m{^/merlyn/.+};
        =118=             return "/merlyn" if m{^/merlyn/};
        =119=             return "/books" if m{^/books/};
        =120=             return "/cgi/amazon" if m{^/cgi/amazon};
        =121=             return "/cgi/wtsearch" if m{^/cgi/wtsearch};
        =122=             return "/cgi" if m{^/(?:cgi|perl)/};
        =123=             return "/" if $_ eq "/";
        =124=             return "/(other)";
        =125=           }
        =126=         }
        =127=         for ($uri->host) {
        =128=           return "http://geek-girl"; if /geek-girl/;
        =129=           return "http://perl.com"; if /perl\.com/;
        =130=           return "http://perl.org"; if /(pm|perl)\.org/;
        =131=         }
        =132=         return "http://something";;
        =133=       }
        =134=       return "unknown";
        =135=     }
        =136=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.