Copyright Notice
This text is copyright by CMP Media, LLC, and is used with their permission. Further distribution or use is not permitted.This text has appeared in an edited form in WebTechniques magazine. However, the version you are reading here is as the author originally submitted the article for publication, not after their editors applied their creativity.
Please read all the information in the table of contents before using this article.
Download this listing! | ||
Web Techniques Column 21 (January 1998)
Most of the web surfers out there are on slow boxes attached to a 28.8KBPS modem. But as more of the world shifts to faster solutions (like X2 or K56flex or ISDN, or even T-1's and faster at company or university settings), the increasing demands on a web server can sometimes be substantial.
Many ISPs from whence you can rent webspace are on fast
connections. But suppose you're an avid reader of the
alt.binaries.pictures.stonehenge
newsgroup, and have faithfully
downloaded all the 145-part pictures in the ``Stonehenge: Dawn to Dusk''
series. After much hair-pulling to get them all decoded, you decide
to be nice to your fellow web surfers by offering them on your
website. So, you post an announcement that says ``hey, come over here
for the entire series all decoded'', and your ISPs two T-1 lines get
completely swamped by all these requests! Argh! What to do (besides
find a new ISP now that you've been kicked off)?
Well, the program in this month's column permits you to provide a web directory that is limited in bandwidth by artificially slowing down (in a fairly fair manner) the number of bytes sent over a limited period of time (such as 5 seconds). In this way, you can ensure that your Stonehenge pictures never steal more than the equivalent of, say, a single 28.8 modem, even though 24 druids are currently hitting your site simultaneously. (One ISP that I use limits general users to 20MB/day, so I could keep it down to 200 bytes/sec and be assured that I'd never exceed the max number, although this would be rediculously low.)
You could also use this script to see what your website looks like when viewed at slow speeds even though you have a direct connection to your machine. (I've seen far too many sites that have never tested themselves for low-speed connections. Oh well.)
Like all of my columns, this script isn't really ready-to-run -- you're expected to adapt and adopt the ideas for your particular application. But at least I've figured out the hard stuff.
So, I've hacked out this bandwidth-reducing CGI script, presented in [Listing One, below].
Lines 1 through 3 start nearly every program I write. (I'm using an alpha version of the next Perl release in my home directory, hence the funny path.) These lines turn on taint checking, warnings, all compile-time restrictions, and unbuffer STDOUT (very important for an NPH script).
Line 5 pulls in the CGI functions from fellow-columnist Lincoln Stein's wonderful CGI.pm module. I'm using this module mainly to print a nicely formatted header and to get the ``path info''.
Line 6 pulls in the time2str
function from the HTTP::Date module,
found in Gisle Aas' incredible LWP
library. This function takes a
timestamp and converts to a date string that is nicely compatible with
HTTP headers. LWP is not installed by default -- you'll need to get
it from the CPAN if you don't have it (at http://www.perl.org/CPAN/).
Line 7 pulls in the IO::File module, which is shipped standard with Perl since version 5.004. (If you don't have 5.004, upgrade now -- there's a longstanding CERT security warning for all prior versions, and if you're using an older version for CGI, you are vulnerable to automated bad-guy breakins on your web server.) Here, I'm using the IO::File module to create temporary filehandles with very specific properties.
Lines 9 through 18 define configuration constants. Line 11 gives the pathname of the directory in which all files will be found. Subdirectories are not permitted here, so all files to be served by a particular instance of this script must be within one flat directory.
Line 12 gives the filename of the inter-process communcation file. This file will be shared by all invocations of this script, and contains the history of recent bytes sent so that the invocations can keep from exceeding the requested bandwidth.
Line 14 defines the sample time in seconds over which the current bandwidth usage is calculated. Higher values give more accuracy over longer time, but allow for big huge bursts (which may not necessarily be desirable). Lower values give more control over a moment-by-moment basis, but waste overall potential bandwidth if it can't all be utilized in the narrow window. I found 5 seconds to be a nice compromise for my small-to-medium test files. Also note that huge numbers will radically increase the size of $BYTES_FILE as well as decrease the performance of the inter-process communication, so beware.
Line 15 defines the target bandwidth limit in bytes-per-second. To simulate a 28.8 serial link for example, set this to roughly 3000. Here, I'm using 10K bytes/second... a nice round number that made testing relative performance easy.
Line 16 defines the maximum burst that any particular send is permitted to send before yielding to the next process in line that wants to send something. I didn't get a good feeling for setting this number in my limited testing, so I set it to a nice figure for STDIO (8K bytes). Setting this number too low would be like setting $WINDOW_SECONDS too high -- too much time will be spent in inter-process yielding and communication, so beware. If you set this number to an extremely large value, a single process can use the entire effective bandewidth available, causing no further information to be sent for $WINDOW_SECONDS (from possibly the same or perhaps a different process).
These last three values will require tuning depending on your application. If you get some insight about setting these numbers for a particular need, please email me and let me know, and I'll pass the information along in a future column.
Line 20 fetches the ``path info'' -- the value after the name of the CGI
script. For example, if this script is invoked with
/cgi/nph-limit/stone1.jpg
, then $file will contain just
/stone1.jpg
.
Lines 22 through 79 define an eval
block to catch errors generated
by the main part of the code. If you've been following my column for
a while, you'll note that this is Yet Another Reinvention of the same
thing. Some day, I'll settle down on a good plan, or maybe I'll just
give in and use CGI::Carp. Any errors in this block will cause the
eval to be aborted, handled below.
Lines 23 through 26 handle the requested filename, checking it for
good characters and whether or not the file is even present. The
format of the die
message is defined by the eval-error handler
later.
Lines 27 through 36 try to determine a MIME-type from a given
filename's extension. It'd be nice if we could ask the web server,
but it's out of the loop at this point, so we need to do the mapping
ourselves. Here I'm looking at some of the common types and suffices.
If you have other types being served, add them to the appropriate
line. If none of the suffix values are right, we'll send it down as
an application/octet-stream
which is basically a MIME-type of ``I
don't have a clue here''. Most browsers ask if you want to save such a
file.
Once the MIME-type is determined, lines 37 through 41 set up the two
filehandles that we'll need. $f is opened on the data file that we
will end up sending, and $b is opened on the inter-process
communication file. The mode of O_RDWR and O_CREAT causes a
read-write filehandle to be opened on a file that may or may not
exist. This way, if someone clears /tmp
, the next invocation of
the script creates the file automatically. Line 41 ensures that
buffering is turned off for the $b file.
Lines 42 through 46 print the HTTP response header, indicating to the browser that we are about to send down some data. Note that the ``Last Modified'' value is set to the modification timestamp of the file we are sending. This will allow a good browser or proxy agent to properly cache the result.
The loop (named OUTER
) from lines 47 to 87 comprise the main
sending loop. Initially (in line 49), we flock
the file. Only one
sender script at a time is allowed to proceed from this point.
Line 50 rewinds the file, to ensure that we're reading the data from
the beginning. Line 51 reads in the data. Each line consists of a
pair of numbers. The split breaks that into two elements, which are
turned into an anonymous list and gathered together with the map
operator. So, @times is a 2-d array.
Lines 52 to 76 define an inner naked-block-loop to handle the potential pause-and-retry if too much data has already been sent.
Lines 53 through 55 parse through the @times array, throwing out information for prior transmissions that occured before the window of interest.
Lines 56 to 59 compute the number of bytes already sent in the window of interest into $sent.
Lines 60 through 61 compute the number of bytes we might be able to send in this particular invocation. If this number is negative, we're gonna be held up for a while until we get to a positive value.
If we can send stuff, lines 62 through 71 take care of that. Line 63 defines a buffer to read some of the file into. Line 64 reads that, exiting the loop if we're already at the end. Line 65 notes the bytes we're sending in this pass, while lines 66 through 68 record this new transmission in the collective memory of the various invocations. Line 69 relinquishes control, permitting another invocation to take charge if necessary. Line 70 sends the data, while line 71 pushes us back up to start all over.
If for some reason we have exceeded the current quota, lines 74 and 75
cause us to go asleep until a reasonable restart time. During this
time, we are still holding the exclusive lock on the file, since
there's no reason for another instance to try at this time -- such a
calculation would be redundant. So, at any given time, for N
simultaneous processes, N-1 will be blocked via flock
, and one will
be operating as the ``master'', ready to send the next bytes, and either
sending them or sleeping.
All that's left is the error handler in lines 81 to 90. If there was
a die
of any kind from the eval block (either user-generated, or
system generated), then $@ is non-empty, and we save it into $text.
For user-generated messages, the first line is always three digits
followed by an HTTP status message, which we strip off into $status.
If that's not the case, it was a system message, and we put in the
dreaded ``500 Server Error'' as a fall-back. The content is dumped as a
text/plain
message so that we don't have to escape HTML (a
brilliant idea I stumbled upon in the shower one day).
So, there you have it. All you have to do is configure it, install it as an NPH script somewhere, then populate the directory to which it points, then sit back and watch people s-l-o-w-l-y download those nifty Stonehenge picture files! Enjoy!
Listing One
=1= #!/home/merlyn/bin/perl -Tw =2= use strict; =3= $|++; =4= =5= use CGI qw(:cgi); =6= use HTTP::Date qw(time2str); =7= use IO::File; =8= =9= ### constants =10= =11= my $SERVED_DIR = "/home/merlyn/Web/Limit"; =12= my $BYTES_FILE = "/tmp/merlyn.web.limit"; =13= =14= my $WINDOW_SECONDS = 5; =15= my $MAX_BYTES_PER_SECOND = 10 * 1024; =16= my $MAX_BYTES_PER_WRITE = 8 * 1024; =17= =18= ### end constants =19= =20= my $file = path_info; =21= =22= eval { =23= if ($file eq "/.." or $file !~ m{ ^/[a-zA-Z0-9_\-.]+$ }x) { =24= die "403 Forbidden\nYou are not permitted to access $file\n"; =25= } elsif (! -e (my $full_file = "$SERVED_DIR$file")) { =26= die "404 Not Found\nYour requested file $file is not found\n"; =27= } else { =28= my $mime_type; =29= { =30= local $_ = $file; =31= /\.txt$/ and ($mime_type = "text/plain"), last; =32= /\.html$/ and ($mime_type = "text/html"), last; =33= /\.gif$/ and ($mime_type = "image/gif"), last; =34= /\.jpe?g$/ and ($mime_type = "image/jpeg"), last; =35= $mime_type = "application/octet-stream"; =36= } =37= my $f = new IO::File $full_file, O_RDONLY =38= or die "404 Not Found\nYour requested file $file cannot be opened\n"; =39= my $b = new IO::File $BYTES_FILE, O_RDWR | O_CREAT =40= or die "500 Internal Error\nCannot create/open $BYTES_FILE\n"; =41= autoflush $b 1; =42= print =43= header(-type => $mime_type, =44= -nph => 1, =45= -status => '200 Found', =46= "last-modified" => time2str ((stat $f)[9])); =47= OUTER: =48= { =49= flock $b, 2; # wait for exclusive lock =50= seek $b, 0, 0; =51= my @times = map [ split ], <$b>; =52= { =53= my $now = time; =54= my $then = $now - $WINDOW_SECONDS; =55= shift @times while @times and $times[0][0] <= $then; =56= my $sent = 0; =57= for (@times) { =58= $sent += $_->[1]; =59= } =60= my $can_send = $MAX_BYTES_PER_SECOND * $WINDOW_SECONDS - $sent; =61= $can_send = $MAX_BYTES_PER_WRITE if $can_send > $MAX_BYTES_PER_WRITE; =62= if ($can_send > 0) { # ok to send =63= my $buf; =64= last unless read($f, $buf, $can_send); =65= push @times, [$now, length $buf]; =66= seek $b, 0, 0; =67= truncate $b, 0; =68= print $b map "$_->[0] $_->[1]\n", @times; =69= flock $b, 8; # release lock =70= print $buf; =71= redo OUTER; =72= } =73= ## not ok to send, must sleep =74= sleep $times[0][0] - $then + 1; =75= redo; =76= } =77= } =78= } =79= }; =80= =81= if ($@) { =82= my $text = $@; =83= my $status = ($text =~ s/^(\d\d\d .*)\n//) ? $1 : "500 Server Error"; =84= print =85= header(-type => 'text/plain', =86= -nph => 1, =87= -status => $status, =88= -date => time2str), =89= $text; =90= }
Randal L. Schwartz is a renowned expert on the Perl programming language (the lifeblood of the Internet), having contributed to a dozen top-selling books on the subject, and over 200 magazine articles. Schwartz runs a Perl training and consulting company (Stonehenge Consulting Services, Inc of Portland, Oregon), and is a highly sought-after speaker for his masterful stage combination of technical skill, comedic timing, and crowd rapport. And he's a pretty good Karaoke singer, winning contests regularly.
Schwartz can be reached for comment at merlyn@stonehenge.com or +1 503 777-0095, and welcomes questions on Perl and other related topics.