Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 12 (April 1997)

I often read ``How do I make my web-pages searchable?'' in the WWW newsgroups. While there are many fancy packages available to WAIS-index or GLIMPSE-index or ``excite'' your web server, what if you just have a dozen or so scripts that have, say, come from past Web Techniques columns, and you want to make them available and searchable?

Well, I had that problem the other day when someone asked me ``which of your columns talked about flock()?'' I couldn't recall. So, I solved this problem once and for all with a script (talk about overkill!).

The idea for this script comes in part from a similar script written by my associate Joseph Hall <joseph@5sigma.com> as part of our CGI course training materials. (At one point, we were discussing making all the matched characters bold which his script did not do, and he said that would take some thought, and I said I could do it in a few lines, but ended up discovering a bug in Perl that won't be fixed until 5.004, so now I have to do something that takes a few more lines as a workaround.)

The resulting web-search CGI program can be seen in Listing 1 [below].

Line 1 starts most of the programs I write these days, enabling ``taint'' mode with -T (to prevent outside data from unknowingly being used to my disadvantage) and -w to tell me what I've done stupidly.

Line 2 also forces me to produce all my variables as lexical variables (declared with my) rather than package variables. This is a good practice, as well as speeding up the program slightly, because lexical variables have a slightly faster access rate than package variables.

Line 3 disables output buffering, not strictly necessary in this program, but handy nonetheless.

Line 6 takes care of the PATH. I need to set this in order to perform the glob later in the program, and it also handles any other child-process execs as well.

Line 8 is a flag that I'm using in the death handler. Initially, its value is 0, but will be set to 1 as soon as the program has sent out an HTTP header. (Sending out the header twice is messy, although not completely painful.)

Lines 11 through 15 are my lightweight entity encoder, stolen from the last few programs I've written for this column. Line 12 puts the argument into a local $_. Line 13 replaces all HTML-nasty characters with their entity-ized equivalents. Line 14 returns this hacked string as the result.

Lines 18 through 25 define a ``death'' handler, somewhat like the ones from the past few columns, but with yet another twist. This time, the printing of the HTTP header (``Content-type:...'') depends on whether or not $header_printed is set. Walking slowing through this, lines 19 and 20 fetch the argument, line 21 converts the HTML-dangerous characters to their HTML-safe equivalents, and lines 22 and 23 print the message, optionally preceded by the HTTP header. Line 24 exits the program.

Line 27 creates a CGI object, and imports enough methods directly as callable functions so that I can finish the code with minimal hassle.

The next three lines (29 to 31) define path locations for this script to search. $DIR is the Unix path to the directory in question. $URL is its equivalent URL location. It's not trivial for the script to go from one to the other; hence, we need to tell it explicitly. The third line sets $FILEPAT, which is going to be used as input to a glob operator. This string therefore can contain file glob characters (such as * in this example).

Now its time to get to some good stuff. Line 33 prints the HTTP header and lets the death handler know that the header has been printed by incrementing the $header_printed scalar variable to a non-zero value. Lines 34 through 36 print a common top of the page.

The next few lines display the actual search form. I made the design decision that the search form appears even after the result of a successful search. This way, the surfer can try a different query while looking at the results of the old query.

Line 38 displays a horizontal line and the beginning of the form. The start_form doesn't have any visual component to it -- it's just for the user to know what fields belong to this form for transmitting.

Line 39 shows a textfield named ``search''. As a nifty featyure CGI.pm will initially load this as an empty string, but on subsequent executions of the same form, the previous search string is automatically used.

Line 40 creates a checkbox named ``regex''. This box will be initially unchecked, indicating that the string in ``search'' is to be taken literally. However, if the user checks this one off (or leaves it checked from a previous invocation), then the ``search'' string will be interpreted as a Perl regular expression. I'll explain this further in a moment. Line 41 similarly looks for ``case sensitive'' versus ``case ignore''. Initially, the search is case sensitive.

Line 42 displays the Submit button. (Well, technically, it invokes a subroutine to invoke a method to create HTML that is then printed to standard out, retransmitted by the Web Server to the client's browser which then displays the submit button. But isn't it much easier to say the first?)

Line 43 marks the end of the form (not visible to the user) and a horizontal line (very visible to the user).

Line 45 looks for the existance of a parameter called ``search'', storing it into a lexical variable named $searchstring. The first time this script is called, this string will be empty. Subsequent invocations will probably have something here. This is in fact tested in line 46. Notice we test either undefined (undef value) or empty string. The former is tested because it means this is the first time the script has been invoked -- the latter is tested because it doesn't make sense to search for nothing, no matter how hard the user wants to do it.

If this is a valid search (second or later time around, and something to search for), then we need to set up for the search. Line 47 puts us in the right directory, and line 48 looks for all the proper files. The filenames are stored into the @ARGV array, just right for scanning with the diamond (<>) operator.

Lines 49 through 51 handle the steps of turning a string into a regular expression, or rather, making a string act like a string and a regular expression act like a regular expression. If the user wanted regular expressions, these statements are skipped, leaving $searchstring exactly the way it was. If the user did not select a regular expression interpretation, then $searchstring is hacked to put backslashes in front of all non-alphanumeric characters. Specifically, if the searchstring is ``[box]'', Perl will treat this as a character class looking for any single ``b'', ``o'', and ``x'' characters. After the quotemeta() function (a built-in), the string becomes ``\[box\]'', which will look for a literal left bracket, a box, and then a right bracket... closer to what the user wanted for a literal string match.

Line 52 further adjusts things by creating a $ignore variable, set to either an empty string or the string ``(?i)''. It just so happens that inserting (?i) into a regular expression causes that regular expression to be computed as case-insensitive. That's too easy.

Lines 53 and 54 spit out some HTML ahead of the search results. Note that I'm jumping into pre-formatted mode here, causing multiple spaces to retain their spaceness, and linefeeds to be feeding more lines. On most browsers, this also switches to a constant-width font, making some of the lines even line up.

Line 55 creates a $per_file counter, allowing me to keep track of how many hits I've seen in a particular file. Through micro-months of painstaking research, Stonehenge Laboratories has determined that there's no need to show more than 5 hits in a particular file. So I have to keep track of this.

Lines 56 to 80 form a standard sort of diamond loop, reading a line at a time from the file named by $ARGV into the $_ variable. Lines 57 to 60 handle the necessary bookkeeping at the end of each file to keep $. set correctly (this comes directly from the manpages for Perl). Line 61 tosses that mostly useless newline at the end of the line away from $_.

Lines 62 through 71 track the hits in a particular line. First, a per-line hit counter is reset to zero. Then, line 63 attempts to replace the regular expression with an empty string. This replacement is either successful or a failure. If it's successful, we have a valid occurance of the regular expression comprised of $searchstring and $ignore, and it has now been removed from the string. Now, recall that $searchstring is either the raw search string entered by the user, or one that has been hacked to have all special characters preceded by a backslash, causing it to act like an ordinary string. And $ignore is either an empty string, or ``(?i)''.

What's that /o for on the end of the replacement? Well, try leaving it off, and you'll see. It causes the regular expression to be compiled once the first time we hit this statement, never again dependent on the values of $searchstring and $ignore (perfectly OK because they have only one final value per invocation of this program). Leaving that off causes the regular expression to be re-interpreted every time around this loop. Bad. Very Bad. And noticably slower if you are parsing more than five lines.

OK, back to the body. If the replacement is successful, we need to tell the user. If this is the first hit, lines 64 through 67 print a prefix for the line that consists of the name of the file as a link followed by the name of the file as a human-string, and then the line number. Line 68 prints the part of the string ahead of the regular expression match as regular text, followed by the part that matched as a bold string (using the ``b'' function from CGI.pm).

Line 69 causes the left part of the string to be discarded, by setting the whole string to be the part of the string after the regular expression match. This is a good way to walk a string looking for matches.

Line 70 ensures that we process only the first five matches. This keeps the program from working hard, but also prevents infinite loops if the regular expression can match a null string. Consider what happens when someone asks to match [a-z]*, and you'll see what I mean. No forward progress when we finally bump up against a non-letter.

Outside the loop, we need to finish off the line if we have started on the line, by printing the remaining $_ followed by a newline (important, because we're in PRE-/PRE territory). If the per-file hit count exceeds 5, we bail out of this particular file by closing the ARGV handle, just like above.

And line 81 closes down the search output, and line 84 prints the end of the HTML data. That's it!

As you can see, a simple program gives my users a chance to search the listings for the keywords they want, pointing them directly at the file. Pretty cool. And it even does it rather efficiently. See ya next time and I hope you find what you are looking for....

Listing 1

        =1=     #!/home/merlyn/bin/perl -Tw
        =2=     use strict;
        =3=     $| = 1;
        =4=     
        =5=     ## set the path
        =6=     $ENV{"PATH"} = "/usr/local/bin:/usr/ucb:/bin:/usr/bin";
        =7=     
        =8=     my $header_printed = 0;         # so the death handler knows
        =9=     
        =10=    ## return $_[0] encoded for HTML entities
        =11=    sub ent {
        =12=      local $_ = shift;
        =13=      $_ =~ s/["<&>"]/"&#".ord($&).";"/ge;  # entity escape
        =14=      $_;
        =15=    }
        =16=    
        =17=    ## death handler
        =18=    $SIG{"__DIE__"} = $SIG{"__WARN__"} = sub {
        =19=      my $why = shift;
        =20=      chomp $why;
        =21=      $why = ent $why;
        =22=      print "Content-type: text/html\n\n" unless $header_printed++;
        =23=      print "ERROR: $why\n";
        =24=      exit 0;
        =25=    };
        =26=    
        =27=    use CGI qw/:standard/;
        =28=    
        =29=    my $DIR = "/home/merlyn/Html/merlyn/WebTechniques";
        =30=    my $URL = "http://www.stonehenge.com/merlyn/WebTechniques";;
        =31=    my $FILEPAT = "*.listing.txt";
        =32=    
        =33=    print header; $header_printed++;
        =34=    print start_html("-title" => "Search WebTechniques Perl Scripts");
        =35=    print h1("Search WebTechniques Perl Scripts");
        =36=    print "Search the <A HREF=\"$URL/\">Perl WebTechniques programs</A>",
        =37=      " by submitting this form:\n";
        =38=    print hr, start_form;
        =39=    print p, "Search for: ", textfield("-name" => "search");
        =40=    print p, checkbox("-name" => "regex", "-label" => "Use Regular Expressions");
        =41=    print p, checkbox("-name" => "ignore", "-label" => "Ignore case");
        =42=    print p, submit;
        =43=    print end_form, hr;
        =44=    
        =45=    my $searchstring = param("search"); # the search item
        =46=    if (defined $searchstring and length $searchstring) {
        =47=      chdir $DIR or die "Cannot chdir $DIR: $!";
        =48=      @ARGV = glob $FILEPAT;        # get matching filenames for <>
        =49=      unless (param("regex")) {     # if ordinary string...
        =50=        $searchstring = quotemeta $searchstring; # make ordinary.
        =51=      }
        =52=      my $ignore = param("ignore") ? "(?i)" : ""; # make case insensitive
        =53=      print "<P>Follow the link to get the full listing:\n";
        =54=      print "<PRE>\n";
        =55=      my $per_file = 0;             # how many hits this file?
        =56=      while (<>) {
        =57=        if (eof) {
        =58=          close ARGV;               # resets $.
        =59=          $per_file = 0;
        =60=        }
        =61=        chomp;
        =62=        my $per_line = 0;           # how many hits this line?
        =63=        while (s/$ignore$searchstring//o) {
        =64=          print
        =65=            '<A HREF="',ent("$URL/$ARGV"),'">',
        =66=            ent($ARGV),"</A>:$.: "
        =67=              unless $per_line++;   # first time, print prefix
        =68=          print ent($`), b(ent $&);
        =69=          $_ = $';
        =70=          last if $per_line >= 5;   # only five hits max per line
        =71=        }
        =72=        if ($per_line) {            # at least one hit?
        =73=          print ent($_),"\n";       # finish line off
        =74=          if (++$per_file >= 5) {   # only five lines max per file
        =75=            print "[skipping to next file]\n";
        =76=            close ARGV;             # force EOF
        =77=            $per_file = 0;
        =78=          }
        =79=        }
        =80=      }
        =81=      print "</PRE>\n";
        =82=    }
        =83=    
        =84=    print end_html;

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.