4. Using Perl with Web Servers¶
Web servers frequently need some type of maintenance in order to operate at peak efficiency. This chapter will look at some maintenance tasks that can be performed by Perl programs. You see some ways that your server keeps track of who visits and what web pages are accessed on your site. You also see some ways to automatically generate a site index, a what's new document and user feedback about a web page.
4.1. Server Log Files¶
The most useful tool to assist in understanding how and when your web site pages and applications are being accessed is the log file generated by your web server. This log file contains, among other things, which pages are being accessed, by whom, and when.
Each web server will provide some form of log file that records who and what accesses a specific HTML page or graphic. A terrific site to get an overall comparison of the major web servers can be found at http://www.webcompare.com/. From this site one can see which web servers follow the CERN/NCSA common log format which is detailed below. In addition, you can also find out which sites can customize log files, or write to multiple log files. You might also be surprised at the number of web servers there are on the market.
Understanding the contents of the server log files is a worthwhile endeavor. And in this section, you'll see several ways that the information in the log files can be manipulated. However, if you're like most people, you'll use one of the log file analyzers that you'll read about in section "Existing Log File Analyzing Programs" to do most of your work. After all, you don't want to create a program that others are giving away for free.
| Note |
| This section about server log files is one that you can read when the need arises. If you are not actively running a web server now, you won't be able to get full value from the examples. The CD-ROM that accompanies this book has a sample log file to you to experiment on but it is very limited in size and scope. |
Nearly all of the major web servers use a common format for their log files. These log files contain information such as the IP address of the remote host, the document that was requested and a timestamp. The syntax for each line of a log file is:
site logName fullName [date:time GMToffset] "req file proto" status length
Since
that line of syntax is relatively meaningless, here is a line from real log
file:
204.31.113.138 - - [03/Jul/1996:06:56:12 -0800]
"GET /PowerBuilder/Compny3.htm HTTP/1.0" 200 5593Even though I
have split the line into two, you need to remember that inside the log file it
really is only one line.
Each of the eleven items listed in the above syntax and example are described in the following list.
- site - either an IP address or the symbolic name of the site making the HTTP request. In the example line the remotehost is 204.31.113.138.
- logName - login name of the user who owns the account that is making the HTTP request. Most remote sites don't give out this information for security reasons. If this field is disabled by the host, you see a dash (-) instead of the login name.
- fullName - full name of the user who owns the account that is making the HTTP request. Most remote sites don't give out this information for security reasons. If this field is disabled by the host, you see a dash (-) instead of the full name. If your server requires a user id in order to fulfill an HTTP request, the user id will be placed in this field.
- date - date of the HTTP request. In the example line the date is 03/Jul/1996.
- time - time of the HTTP request. The time will be presented in 24-hour format. In the example line the time is 06:56:12.
- GMToffset - signed offset from Greenwich Mean Time. GMT is the international time reference. In the example line the offset is -0800, eight hours earlier than GMT.
- req - HTTP command. For WWW page requests, this field will always start with the GET command. In the example line the request is GET.
- file - path and filename of the requested file. In the example line the file is /PowerBuilder/Compny3.htm. There are three types of path/filename combinations:
- Implied Path and Filename - accesses a file in a user's home directory. For example, /~foo/ could be expanded into /user/foo/homepage.html. The /user/foo directory is the home directory for the user foo. And homepage.html is the default file name for any user's home page. Implied paths are hard to analyze because you need to know how the server is setup and because the server's setup may change.
- Relative Path and Filename - accesses a file in a directory that is specified relative to a user's home directory. For example, /~foo/cooking.html will be expanded into /user/foo/cooking.html.
- Full Path and Filename - accesses a file by explicitly stating the full directory and filename. For example, /user/foo/biking/mountain/index.html.
- proto - type of protocol used for the request. In the example line proto is HTTP 1.0 was used.
- status - status code generated by the request. In the example line the status is 200. See section "Example: Looking at the Status Code" later in the chapter for more information.
- length - length of requested document. In the example line the bytes is 5593.
Web servers can have many different types of log files. For example, you might see a proxy access log, or an error log. In this chapter, we'll focus on the access log - where the web server tracks every access to your web site.
4.1.1. Example: Reading a Log File¶
In this section you see a Perl script that can open a log file and iterate over the lines of the log file. It is usually unwise to read entire log files into memory because they can get quite large. A friend of mine has a log file that is over 113 Megabytes!
Regardless of the way that you'd like to process the data, you must open a log file and read it. You could read the entry into one variable for processing, or you can split the entry into its components. To read each line into a single variable, use the following code sample:
$LOGFILE = "access.log";
open(LOGFILE) or die("Could not open log file.");
foreach $line (
| Note |
| If you don't have your own server logs, you can use the file server.log that is included on the CD-ROM that accompanies this book. |
The code snippet will open the log file for reading and will access the file one line at a time, loading the line into the $line variable. This type of processing is pretty limiting because you need to deal with the entire log entry at once.
A more popular way to read the log file is to split the contents of the entry into different variables. For example, Listing 21.1 uses the split() command and some processing to value 11 variables:
| Pseudocode |
|
Turn on the warning option. Initialize $LOGFILE with the full path and name of the access log. Open the log file. Iterate over the lines of the log file. Each line gets placed, in turn, into $line. Split $line using the space character as the delimiter. Get the time value from the $date variable. Remove the date value from the $date variable avoiding the time value and the '[' character. Remove the '"' character from the beginning of the request value. Remove the end square bracket from the gmt offset value. Remove the end quote from the protocol value. close the log file. |
|
Listing 21.1-21LST01.PL - Read the Access Log and Parse Each Entry |
|
If you print out the variables, you might get a display like this:
$site = ros.algonet.se
$logName = -
$fullName = -
$date = 09/Aug/1996
$time = 08:30:52
$gmt = -0500
$req = GET
$file = /~jltinche/songs/rib_supp.gif
$proto = HTTP/1.0
$status = 200
$length = 1543
You can see that after the split is done, further
manipulation is needed in order to "clean up" the values inside the variable. At
the very least, the square brackets and the double-quotes needed to be removed.
I prefer to use a regular expression to extract the information from the log file entries. I feel that this approach is more straightforward - assuming that you are comfortable with regular expressions - than the others. Listing 21.2 shows a program that uses a regular expression to determine the 11 items in the log entries.
| Pseudocode |
|
Turn on the warning option. Initialize $LOGFILE with the full path and name of the access log. Open the log file. Iterate over the lines of the log file. Each line gets placed, in turn, into $line. Define a temporary variable to hold a pattern that recognizes a single item. Use the matching operator to store the 11 items into pattern memory. Store the pattern memories into individual variables. close the log file. |
|
Listing 21.2-21LST02.PL - Using a Regular Expression to Parse the Log File Entry |
|
The main advantage to using regular expressions to extract information is the ease with which you can adjust the pattern to account for different log file formats. If you use a server that delimits the date/time item with curly brackets, you only need to change the line with the matching operator to accommodate the different format.
4.1.2. Example: Listing Access by Document¶
One easy and useful analysis that you can do is to find out how many times each document at your site has been visited. Listing 21.3 contains a program that reports on the access counts of documents beginning with the letter s.
| Note |
| The parseLogEntry() function uses $_ as the pattern space. This eliminates the need to pass parameters but is generally considered bad programming practice. But this is a small program, so perhaps it's okay. |
| Pseudocode |
|
Turn on the warning option. Define a format for the report's detail line. Define a format for the report's header line. Define the parseLogEntry() function. Declare a local variable to hold the pattern that matches a single item. Use the matching operator to extract information into pattern memory. Return a list that contains the 11 items extracted from the log entry. Open the logfile. Iterate over each line of the logfile. Parse the entry to extract the 11 items but only keep the file specification that was requested. Put the filename into pattern memory. Store the filename into $fileName. Test to see if $fileName is defined. Increment the file specification's value in the %docList hash. close the log file. Iterate over the hash that holds the file specifications. Write out each hash entry in a report. |
|
Listing 21.3-21LST03.PL - Creating a Report of the Access Counts for Documents that Start with the Letter S. |
|
This program displays:
Document Access Count /~bamohr/scapenow.gif 1
/~jltinche/songs/song2.gif 5
/~mtmortoj/mortoja_html/song.html 1
/~scmccubb/pics/shock.gif 1Access Counts for S* Documents Pg 1
This program has a
couple of points that deserve a comment or two. First, notice that the program
takes advantage of the fact that Perl’s variables default to a global scope. The
main program values \(_</TT> with each log file entry and
<TT>parseLogEntry()</TT> also directly accesses <TT>\)_. This is okay for a
small program but for larger program, you need to use local variables. Second,
notice that it takes two steps to specify files that start with a letter. The
filename needs to be extracted from $fileSpec and then the filename can
be filtered inside the if statement. If the file that was requested has
no filename, the server will probably default to index.html. However,
this program doesn’t take this into account. It simply ignores the log file
entry if no file was explicitly requested.
You can use this same counting technique to display the most frequent remote sites that contact your server. You can also check the status code to see how many requests have been rejected. The next section looks at status codes.
4.1.3. Example: Looking at the Status Code¶
It is important for you to periodically check the server’s log file in order to determine if unauthorized people are trying to access secured documents. This is done by checking the status code in the log file entries.
Every status code is a three digit number. The first digit defines how your server responded to the request. The last two digits do not have any categorization role. There are 5 values for the first digit:
- 1xx: Informational - Not used, but reserved for future use
- 2xx: Success - The action was successfully received, understood, and accepted.
- 3xx: Redirection - Further action must be taken in order to complete the request.
- 4xx: Client Error - The request contains bad syntax or cannot be fulfilled.
- 5xx: Server Error - The server failed to fulfill an apparently valid request.
Table 21.1 contains a list of the most common status code that can appear in your log file. You can find a complete list on the http://www.w3.org/pub/WWW/Protocols/HTTP/1.0/spec.html web page.
| Status Code | Description |
|---|---|
| 200 | OK |
| 204 | No content |
| 301 | Moved permanently |
| 302 | Moved temporarily |
| 400 | Bad Request |
| 401 | Unauthorized |
| 403 | Forbidden |
| 404 | Not found |
| 500 | Internal server error |
| 501 | Not implemented |
| 503 | Service unavailable |
Status code 401 is logged when a user attempts to access a secured document and enters an incorrect password. By searching the log file for this code, you can create a report of the failed attempts to gain entry into your site. Listing 21.4 shows how the log file could be searched for a specific error code - in this case, 401.
| Pseudocode |
|
Turn on the warning option. Define a format for the report's detail line. Define a format for the report's header line. Define the parseLogEntry() function. Declare a local variable to hold the pattern that matches a single item. Use the matching operator to extract information into pattern memory. Return a list that contains the 11 items extracted from the log entry. Open the logfile. Iterate over each line of the logfile. Parse the entry to extract the 11 items but only keep the site information and the status code that was requested. If the status code is 401 then save the increment the counter for that site. close the log file. Check the site name to see if it has any entries. If not, display a message that says no unauthorized accesses took place. Iterate over the hash that holds the site names. Write out each hash entry in a report. |
|
Listing 21.4-21LST04.PL - Checking for Unauthorized Access Attempts |
|
This program displays:
Remote Site Name Access Count ip48-max1-fitch.zipnet.net 1
kairos.algonet.se 4 Unauthorized Access Report Pg 1
You can expand this
program’s usefulness by also displaying the logName and fullName items from the
log file.
4.1.4. Example: Converting the Report to a Web Page¶
Creating nice reports for your own use is all well and good. But suppose your boss wants the statistics updated hourly and available on demand? Printing the report and faxing to the head office is probably a bad idea. One solution is to convert the report into a web page. Listing 21.5 contains a program that does just that. The program will create a web page that displays the access counts for the documents that start with a ‘s’.
| Pseudocode |
|
Turn on the warning option. Define the parseLogEntry() function. Declare a local variable to hold the pattern that matches a single item. Use the matching operator to extract information into pattern memory. Return a list that contains the 11 items extracted from the log entry. Initialize some variables to be used later. The file name of the access log, the web page file name, and the email address of the web page maintainer. Open the logfile. Iterate over each line of the logfile. Parse the entry to extract the 11 items but only keep the file specification that was requested. Put the filename into pattern memory. Store the filename into $fileName. Test to see if $fileName is defined. Increment the file specification's value in the %docList hash. close the log file. open the output file that will become the web page. output the HTML header. start the body of the HTML page. output current time. start an unorder list so the subsequent table is indented. start a HTML table. output the heading for the two columns the table will use. Iterate over hash that holds the document list. output a table row for each hash entry. end the HTML table. end the unordered list. output a message about who to contact if questions arise. end the body of the page. end the HTML. close the web page file. |
|
Listing 21.5-21LST05.PL - Creating a Web Page to View Access Counts | |||
”); print WEBPAGE (“ |