Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 37 (May 1999)

One of the wonderful things about Dejanews (at www.dejanews.com) is that it has a long memory, allowing me to think about how things have changed over the past few years by looking at what people have been saying.

The other day, I got to thinking, ``have the number of job postings for Perl gone up, down, or stayed relatively constant compared to the overall posting rate?''. So I started doing a few queries by hand for Perl and other languages using various date ranges over the newsgroup misc.jobs.offered, which is where all job postings are supposed to go. After doing about a dozen of these, I got bored, knowing that I was typing relatively the same thing over and over again. Then it dawned on me -- write a program!

So, I reverse-engineered the interface (see below), and started dumping numbers. After staring at the numbers for a while, I thought about importing them into a spreadsheet to get a graph, then remembered that I hadn't yet played with GIFgraph (from the CPAN at www.cpan.org and dozens of mirrors around the world). So, I added the graphing right into the program.

The resulting picture is in [Figure 1]. It quickly shows a comparison of some popular langauges, and their hit counts going back four years. Note that Perl jobs are relatively steady at about 200 hits per day, and that Java surpassed Cobol just a few months back.

And the program that produced this output is in [listing one, below].

Lines 1 through 3 begin nearly every lengthy Perl program I write, enabling warnings, providing compiling restrictions, and disabling output buffering.

Lines 5 and 6 bring in the two modules that I'll be using. GIFgraph creates simple but useful graphs, using the GD module for the actual drawing. You don't need to use GD explicitly in your program; GIFgraph does that directly. Date::Calc provides some nice date calculations, and doesn't import anything by default, so I've listed explicitly the functions I'm using in this program.

All three of these modules are found in the CPAN. So fetch them if you must. As I've said many times before, use the CPAN stuff when you can, because there's no point in reinventing perfectly decent wheels.

Lines 8 through 18 provide the configuration constants -- things I'd be likely to tweak between runs of the program.

Line 10 gives a list of all the queries I'll be handing to Dejanews. These will be plotted separately on the graph. Note that for Perl I've used the single query perl|perl5, which looks for either of the words. I've found that a lot of recruiters (and pointy-haired bosses) think that perl5 is a language. (Hint: it's not.) So, to keep the graph fair, I've given both common ``spellings'' of Perl.

Lines 12 and 13 define two arrays for the lower and upper bounds of our month-by-month search for jobs. Each array is in Year-Month-Day order, because many of the Date::Calc routines generate and expect data in such an order.

Through experimentation, I found that Dejanews seems pretty well populated all the way back to March of 1995. And that queries for anything in the last 45 days don't give an exact hit count, but rather an approximate hit count (which seems to be way off), so I'll simply go from no later than a month ago down to the start of good data for this program. The calculation for ``a month ago'' results from taking Today (found in Date::Calc), and subtracting a month (using Add_Delta_YMD, also from Date::Calc). This calculation works right even at month's end and near a year boundary, things I would have to think through carefully if I was doing this myself.

Lines 15 and 16 define the output GIF location and the data memory caching prior runs. If you're gonna play with this program, you'll most certainly want to change these values. The output GIF is located in my web-server's doc tree, so I don't even need to move it into place.

Line 20 defines a global data structure @plotdata. This data is in the format to be handed to GIFgraph in line 53, and consists of a list of listrefs. The first listref points to a list of labels, while the remaining listrefs point to lists of data to be plotted. See the docs for GIFgraph for further clarification.

Line 22 sets the cycles in motion, by creating a @from array (again in year-month-day order. This array tracks the lower bounds of each particular month that we're querying.

Lines 24 and 25 compute the end of the month that begins in @from, putting the result in @to. The $days parameter is also used later in this loop.

Line 26 gives the exit condition of this loop. If the end of the month is later than the upper bounds we computed originally, we bail (and generate the graph below). According to the Date::Calc doc, turning both dates into a compressed version means that we can compare them using a simple comparison. So, that's what's happening here. For speed, I might move the Compress of @UPPER outside the loop, since that value doesn't change, but that's clearly not the bottleneck in this program.

Lines 27 and 28 turn both of the date arrays into single strings that Dejanews can understand as a date. The to_dejadate subroutine is defined later.

Line 29 creates the X-axis labels. In this case, we want the year and month as a nicely formatted string. Each new date is pushed onto the end of an array forming as referenced by $plotdata[0]. Thus, $plotdata[0][0] is the first month label, $plotdata[0][1] is the second month label, $plotdata[0][2] is the third month label, and so on.

Lines 30 through 38 are the meat of the routine. For each of the languages (the query strings provided in @langs), we'll ask the count_hits routine to tell us how many messages contain that language in the given date range. The value returned can be undef if something went wrong (like Dejanews was too busy, or gave us data that wasn't in an expected format). Luckily, GIFgraph understands an undef value in the value list to mean ``data unavailable'', and does the right thing.

Note that the hit count is scaled by the number of days in the month, so that longer months don't get slightly higher peaks unfairly, everything else being equal. It's also easier for me to think of 200 hits a day than to grok 6000 hits a month, but maybe I'm just odd.

The data is stored in successive arrays referenced by elements of @plotdata. For example, the hits for the first language for the first month are kept in $plotdata[1][0], while the hits for the second language for the same month are in $plotdata[2][0]. These data values parallel the X-labels we pushed in earlier.

Line 39 advances the lower bound by adding a single month to the date. Line 40 sends us back to the beginning of the loop (in this case, a naked block) in line 24, to start all over again until the condition in line 26 is satisfied.

Once we're out of the loop, it's time to plot the data in @plotdata. We start by creating a graph object in line 43. I'm fixing the width and height of the graph to 640 and 480, which gives just about the right amount of room for this particular dataset. The default is 400 by 300, which came out too crowded.

Lines 44 through 51 set some of the graphing parameters. While the defaults are pretty good, I wanted to add some tweaking here, mostly to show how it's done. Line 45 sets the year-month labels to be vertical instead of horizontal. Line 46 gives the chart a title. Line 47 changes the data plot colors to a nice two-brightness rainbow. See the documentation for GIFgraph::colour for details on color names.

Line 52 provides the dataplot legend, giving a name to each of the lines being drawn. Without this, we wouldn't know which line was Java and which line was Cobol. It's important that this ordering be the same as the data pushed earlier into @plotdata.

Line 53 is the main operation. The graph object will generate a GIF file into the location specified by GIFOUT. And we're done!

So, now for the subroutines. Lines 57 to 60 take the three parameters passed in, and covert it to a month name, day number, and year number. The month name comes from Yet Another Date::Calc routine.

Lines 62 through 81 give this program a memory from one run to the next. I found that each query takes Dejanews about 2 to 10 seconds to respond, and that would have been prohibitively long to test, at nearly 100 separate queries per data year. But with a cache, only the new information needs to be fetched, and I can feel confident in running this program once a day or once a week, getting all the info directly from the cache in most cases.

The cache is basically a tied hash, using the dbmopen call in line 66 to provide the connection. Line 69 constructs the key for the hash by joining the query string, start date, and end date, and constructing a single key string delimited by $SEPARATOR (here, a control A).

Line 72 determines if we already have a good result for this particular combination of query, start, and end date. If so, we'll extract the first element of splitting that result by the delimiter. (The same delimiter is used in both the key and value, although that wasn't strictly necessary.) The value also has a second element of the timestamp when I created the data, allowing me to write little maintenance scripts to preen out unused data (not illustrated here, because they were all pretty ugly).

If the data can't be found in the cache, it's time to ask Dejanews for a real query, in line 74. If that comes back undef, we pass that along to the caller, but otherwise we'll put the data into the cache so that a future run of this program won't have to ask Dejanews again.

Lines 83 to 118 handle the Dejanews query. I got the names of the fields by a little reverse engineering. I went to the Dejanews search page, did a ``view source'' in my browser, stared at the form for a while, created a query that populated all the fields, and then slowly reduced it one field at a time until I determined the simplest possible set. Now, if Dejanews changes their form layout tomorrow, my program breaks, but them's the, uh, breaks.

Lines 90 to 96 construct a user agent using parts of the LWP module, also found in the CPAN. Note that I'm initializing the user agent, and bringing in the LWP module parts only when needed. This keeps me from doing expensive initialization when a particular run can be satisfied entirely from the cache.

Lines 97 to 103 adjust the query form elements of the fetched URL so that it emulates my entering those fields into the original form.

Line 106 sets up a query as a HTTP::Request object, again, using parts of LWP. Line 108 uses that object to do the actual contact with Dejanews. The response is stored into a local $_ implicitly by the foreach construct.

Lines 109 to 115 figure out if the response is useful, by looking for the desired hit count string. If we get a number, we'll return it. If it matched nothing, we return 0. If it's in a format we don't expected, then undef gets passed back.

Well, there you have it. I can now run this program once a week from a cron job, and have an always up-to-date graph showing relative numbers of job postings for these languages. Until next time, enjoy!

Listing One

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     $|++;
        =4=     
        =5=     use GIFgraph::lines;
        =6=     use Date::Calc qw(Today Days_in_Month Compress Add_Delta_YMD Month_to_Text);
        =7=     
        =8=     ## begin configuration
        =9=     
        =10=    my @LANGS = sort qw(java perl|perl5 html cgi cobol fortran python tcl);
        =11=    
        =12=    my (@LOWER) = (1995,3,1);       # beginning of useful data
        =13=    my (@UPPER) = Add_Delta_YMD(Today,0,-1,0); # last month
        =14=    
        =15=    use constant GIFOUT => "/home/merlyn/Html/x.gif";
        =16=    use constant MEMORY => "/home/merlyn/.dejajobscache";
        =17=    
        =18=    ## end configuration
        =19=    
        =20=    my @plotdata = ();
        =21=    
        =22=    my (@from) = @LOWER;
        =23=    {
        =24=      my $days = Days_in_Month(@from[0,1]);
        =25=      my (@to) = (@from[0,1],$days);
        =26=      last if Compress(@to) >= Compress(@UPPER);
        =27=      my $fromdate = to_dejadate(@from);
        =28=      my $todate = to_dejadate(@to);
        =29=      push @{$plotdata[0]}, sprintf("%04d %02d", @from);
        =30=      my $id = 0;
        =31=      for my $lang (@LANGS) {
        =32=        my $hits = count_hits($lang, $fromdate, $todate);
        =33=        if (defined $hits) {
        =34=          print "$fromdate $todate $lang $hits\n";
        =35=          $hits /= $days;
        =36=        }
        =37=        push @{$plotdata[++$id]}, $hits;
        =38=      }
        =39=      @from = Add_Delta_YMD(@from, 0, 1, 0);
        =40=      redo;
        =41=    }
        =42=    
        =43=    my $graph = GIFgraph::lines->new(640,480);
        =44=    $graph->set(
        =45=                x_labels_vertical => 1,
        =46=                title => 'Keyword hits per day in misc.jobs.offered from Dejanews',
        =47=                dclrs => [qw(
        =48=                            lred lorange lyellow lgreen lblue lpurple
        =49=                            dred  orange dyellow dgreen dblue dpurple
        =50=                            )],
        =51=               );
        =52=    $graph->set_legend(@LANGS);
        =53=    $graph->plot_to_gif(GIFOUT, \@plotdata);
        =54=    
        =55=    ## subroutines
        =56=    
        =57=    sub to_dejadate {
        =58=      my($y,$m,$d) = @_;
        =59=      join " ", Month_to_Text($m), $d, $y;
        =60=    }
        =61=    
        =62=    BEGIN {
        =63=      my %HIT_CACHE;
        =64=      my $SEPARATOR = "\001";
        =65=    
        =66=      dbmopen(%HIT_CACHE, MEMORY, 0666);
        =67=    
        =68=      sub count_hits {
        =69=        my $tag = join $SEPARATOR, @_;
        =70=        my $response;
        =71=    
        =72=        if ($response = $HIT_CACHE{$tag}) {
        =73=          (split $SEPARATOR, $response)[0];
        =74=        } elsif (defined ($response = count_hits_from_deja(@_))) {
        =75=          $HIT_CACHE{$tag} = join $SEPARATOR, $response, time;
        =76=          $response;
        =77=        } else {
        =78=          undef;
        =79=        }
        =80=      }
        =81=    }
        =82=    
        =83=    BEGIN {
        =84=      my $ua;
        =85=      my $uri;
        =86=    
        =87=      sub count_hits_from_deja {
        =88=        my ($query,$fromdate,$todate) = @_;
        =89=        
        =90=        unless ($ua) {
        =91=          require LWP::UserAgent;
        =92=          require URI;
        =93=    
        =94=          $ua = LWP::UserAgent->new;
        =95=          $uri = URI->new('http://www.dejanews.com/[ST_rn=ps]/dnquery.xp');
        =96=        }
        =97=        $uri->query_form(
        =98=                         ST => "PS",        # hidden
        =99=                         QRY => $query,
        =100=                        "groups" => "misc.jobs.offered",
        =101=                        "fromdate" => $fromdate,
        =102=                        "todate" => $todate,
        =103=                       );
        =104=   
        =105=       require HTTP::Request;
        =106=       my $req = HTTP::Request->new('GET',$uri);
        =107=       
        =108=       for ($ua->request($req)->as_string) {
        =109=         if (/Messages.*of exactly.*?(\d+)/) {
        =110=           return "$1";
        =111=         } elsif (/did not match any/) {
        =112=           return 0;
        =113=         } else {
        =114=           return undef;
        =115=         }
        =116=       }
        =117=     }
        =118=   }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.