Copyright Notice

This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.

This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.

Please read all the information in the table of contents before using this article.
Download this listing!

Web Techniques Column 36 (Apr 1999)

Well, last month's column was a pretty heavy piece of work, clocking in at 312 lines of code, and a correspondingly large amount of descriptive text. This month, I decided to get back to basics and tackle a simple but annoying problem on your typical web page: making it load faster.

One of the things you can do to make a web page appear to load faster is by giving the browser hints about the ultimate size of its images. In modern HTML, the IMG tag accepts WIDTH and HEIGHT attributes to give the pixel dimensions of the image. The browser can use this to leave a hole of the appropriate size while the rest of the HTML is still loading, and even while the picture image data is being fetched in a separate HTTP transaction. While this doesn't actually make the page load any faster, it seems to calm the users down a bit more, since things aren't jumping around for as long.

But, to make this work, you've got to get the actual pixel sizes into the HTML code. Doing this by hand means downloading the image into your favorite image manipulation tool, looking at the information for the picture, noting the pixel size, and then invoking your favorite text editor to hack the HTML. Bleh. No wonder it doesn't get done as often as it could.

But, thanks to the nice Image::Size module (available from the CPAN at http://www.cpan.org/CPAN.html and other places), I was able to write a program to automatically fetch the image, compute its size, and then edit that data right into the HTML! No more excuses: my web pages will now have sizes on them! The Image::Size module handles all the common image formats, such as GIF, JPEG, and PNG, as well as some that you probably won't be using on the web.

To fix an index.html and reference.html file, for example, I can now enter:

        addsize -i index.html reference.html

The -i switch here says to edit these files in place, which means that the files will be changed, saving the old versions to the original names with an appended tilde.

So, let's examine together the program presented in [listing one, below].

Line 1 turns on warnings, while line 2 enables all compiler restrictions. These selections make writing any program longer than about ten lines easier to get right the first time.

Line 4 pulls in the URI::file module, from the new URI distribution. This class (or the class it superseded) was formerly part of the huge LWP distribution (in the CPAN). Now it's a separate piece. You can still install this piece and any other former LWP pieces using the Bundle::LWP installation from CPAN.pm like so:

  $ perl -MCPAN -eshell
  cpan> install Bundle::LWP
  [lots of output]
  cpan> quit
  $

The URI::file module creates objects that represent a URI for a diskfile (usually starting with a scheme of file:). This will be used later to translate the command line arguments into an appropriate object for relative and absolute addressing.

Lines 6 through 50 create a subclass from the HTML::Filter class. Once again, the base class is a part of the LWP bundle described above. These lines are wrapped in a BEGIN block, both to localize the effects of setting the package in line 7 as well as to ensure that any needed modules are brought in and initialized before the rest of the program is parsed.

Line 7 sets the class name as a package name: MyFilter.

Line 8 pulls in the HTML::Filter module, and sets MyFilter's inheritance to include that module. If you're running a version of Perl prior to 5.005, you'll need to replace that line with:

        use HTML::Filter;
        @MyFilter::ISA = qw(HTML::Filter);

which seems like more work, but that's why I used base instead.

Lines 9 through 11 bring in three other useful classes. Image::Size is described above. HTML::Entities is found in the LWP bundle, and lets us provide proper escaping for the HTML attribute values. And finally, LWP::Simple gives us an easy way to fetch remote images so that their size can be corrected as well.

Lines 13 through 19 define an overridden constructor method called new. The first parameter to this class method will be the class name (package name), which I shift off the @_ array in line 14.

The second parameter will be an object of type URI (or one of its subclasses). This object will be used to construct proper absolute pathnames (or URLs) when we're given a relative URL in an image source URL. We'll shift this off in line 15, saving it for a moment in a temporary variable.

Line 16 calls the superclass (in this case, HTML::Filter) to construct the base object, passing along any other parameters (in this case, none usually). The SUPER syntax here ensures that we don't need to know the inheritance path currently established, although for this example, the path is trivial to determine upon inspection. The result is our object to be returned, saved for the moment into $self.

Line 17 saves the saved URI into an instance variable called _uri. Note that I've determined by inspection that this name is available. If I wasn't sure, I'd pick something like _MyFilter_uri_, which would be very unlikely to conflict.

Finally, the newly constructed object is returned in line 18.

Lines 21 through 50 define the start method. This method is called automatically by the HTML::Filter class whenever a start tag is seen, such as img src=.... We're overriding the default method (in HTML::Filter::start), which simply dumps the tag to the output. We override it because for some tags (namely the image tag), we're gonna make some changes and decisions.

Line 22 grabs the current object into a local variable $self.

Line 23 captures the incoming parameters: the tag name, a hashref for the attributes, a listref that gives the original sequence of those attributes, and the original untouched text (for a quick passthrough if no editing is required).

Lines 24 through 26 detect a base tag, used in the HTML header to define an alternate URL for relative references. This is important to notice, because we need to fetch images according to this tag as well. If we see one of these, we'll grab the URL and stuff it into _uri instance variable.

Lines 27 through 47 handle an IMG tag that needs to be rewritten. Lines 28 through 31 first determine if we're looking at something that needs to be hacked (must be an IMG tag, must have a SRC attribute, must not already have WIDTH and HEIGHT attributes). If anything doesn't pass the muster, the last operator breaks us out to line 48.

Lines 32 and 33 compute the URI for the SRC attribute. We'll take the given attribute value, and compute an absolute URI based on the _uri instance variable. We'll use this either to open a local file or fetch a remote URL to determine the image size.

Lines 34 through 37 compute the image size. If the source URI scheme is file, then we're looking at what was originally a relative URL, because we've made it absolute against a file URL repesenting this particular HTML file. Of course, when this document is ultimately fetched, it'll be an HTTP URL, but that's not relevant here.

If we're looking at a file URL, then line 36 calls imgsize (imported from the Image::Size package) routine on the filename. Otherwise, we'll use the get routine (from LWP::Simple) to fetch the contents of the remote URL, and pass a reference to that scalar data to imgsize. The imgsize distinguishes filenames from actual data by noting that actual data is always passed as a scalar reference. The return value from one of these two calls to imgsize ends up in @xy.

If imgsize fails for any reason, the first two elements of @xy are undef, which I'll test for in line 38. If we don't get a good value, we'll bail, and just dump out the original text below.

Line 39 takes the X and Y value, and stores those as new attributes in the hash pointed to by $attr, using the hashref-slice notation here.

Lines 40 through 44 build up a new tag-attribute string. Each entry in the list pointed to by $attrseq, including the new width and height values, are dumped. Note that we need to encode the HTML-significant entities from the attribute values, so we're calling encode_entities (from HTML::Entities) to handle that.

Once the string is built up, we'll dump it in line 45 by calling the output method. By default, this is merely a print to the default filehandle, which is exactly where we want it to go. But we'll call the method anyway in case someone subclasses my filter routine, overriding output (it could happen, but not in this program).

Line 48 is selected only when the start tag needs to be output exactly as it was input. This happens nearly all the time, so this line gets called a lot.

And that defines the class MyFilter, a subclass of HTML::Filter, with specific instructions to read HTML data, look for image tags, determine the size of the corresponding images, and rewrite those items as necessary. Now all we have to do is call an object of that class.

Line 52 undefines the $/ variable. When this variable is undef, any ``line'' read operation becomes an entire file read operation. Very useful here, as you'll see a few lines down.

Line 53 notes a -i option on the command line. If the option is present, we'll enable in-place editing mode, affecting the way the diamond operator in line 54 opens up a new file. The backup file extension is set to tilde by default. However, if there's an extension present after the -i parameter, then we'll use that instead.

Lines 54 through 58 form a diamond loop, reading through the filenames now present in @ARGV. As each file is read, the entire contents end up in $_, and the filename is in $ARGV.

Line 55 creates a new URI object: actually, in this case a URI::file object. We'll dump out the filename in line 56 for safe keeping (or a progress indicator).

Finally, the major work gets done in line 57. A call to the new method returns the parsing object, which we then invoke a parse method within, passing it the contents of the file, and then signal end of file by calling eof. This will result in a bunch of stuff being dumped to the currently selected filehandle (either STDOUT, or ARGVOUT if we're in in-place editing mode), and we're done!

And there you have it. A small-sized program to do a giant-sized service to your web site visitors. Until next time, enjoy!

Listing One

        =1=     #!/usr/bin/perl -w
        =2=     use strict;
        =3=     
        =4=     use URI::file;
        =5=     
        =6=     BEGIN {
        =7=       package MyFilter;
        =8=       use base qw(HTML::Filter);
        =9=       use Image::Size;
        =10=      use HTML::Entities;
        =11=      use LWP::Simple;
        =12=    
        =13=      sub new {
        =14=        my $package = shift;
        =15=        my $uri = shift;
        =16=        my $self = $package->SUPER::new(@_);
        =17=        $self->{_uri} = $uri;
        =18=        $self;
        =19=      }
        =20=    
        =21=      sub start {
        =22=        my $self = shift;
        =23=        my($tag, $attr, $attrseq, $origtext) = @_;
        =24=        if ($tag eq 'base' and exists $attr->{href}) {
        =25=          $self->{_uri} = URI->new($attr->{href});
        =26=        }
        =27=        {
        =28=          last unless $tag eq 'img';
        =29=          last unless exists $attr->{src};
        =30=          last if exists $attr->{width};
        =31=          last if exists $attr->{height};
        =32=          my $src = $attr->{src};
        =33=          my $src_uri = URI->new_abs($src, $self->{_uri});
        =34=          my @xy =
        =35=            $src_uri->scheme eq "file" ?
        =36=              imgsize($src_uri->path) :
        =37=                imgsize(\get($src_uri));
        =38=          last unless defined $xy[0];
        =39=          @$attr{qw(width height)} = @xy[0,1];
        =40=          my $tmp = "<$tag";
        =41=          for (@$attrseq, qw(width height)) {
        =42=            $tmp .= qq/ $_="/.encode_entities($attr->{$_}).q/"/;
        =43=          }
        =44=          $tmp .= ">";
        =45=          $self->output($tmp);
        =46=          return;
        =47=        }
        =48=        $self->output($origtext);
        =49=      }
        =50=    }
        =51=    
        =52=    undef $/;
        =53=    shift, $^I = ($1 || "~") if @ARGV and $ARGV[0] =~ /^-i(.*)/;
        =54=    while (<>) {
        =55=      my $file = URI::file->new_abs($ARGV);
        =56=      print STDOUT "===== $ARGV =====\n";
        =57=      MyFilter->new($file)->parse($_)->eof;
        =58=    }

Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.

Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.