3. Form Processing¶
One of the most popular uses for CGI programs is to process information from HTML forms. This chapter gives you an extremely brief overview of HTML and Forms. Next you see how the form information is sent to CGI programs. After being introduced to form processing, a Guest book application is developed.
3.1. A Brief Overview of HTML¶
HTML, or Hypertext Markup Language, is used by web programmers to describe the contents of a web page. It is not a programming language. You simply use HTML to indicate what a certain chunk of text is - such as a paragraph, a heading or specially formatted text. All HTML directives are specified using matched sets of angle brackets and are usually called tags. For example means that the following text should be displayed in bold. To stop the bold text, use the directive. Most HTML directives come in pairs and surround the affected text.
HTML documents need to have certain tags in order for them to be considered "correct". The
.. set of tags surround the header information for each document. Inside the header, you can specify a document title with the
| Tip |
| HTML tags are case-insensitive. For example,
|
After the document header, you need to have a set of
.. tags. Inside the document's body, you specify text headings by using a set of..
tags. Changing the number after the H changes the heading level. For example, is the first
level. is the second level, and so on.
You can use the
tag to indicate paragraph endings or use the
to indicate a line break. The .. and
.. tags are used to indicate bold and italic text.
The text and tags of the entire HTML document must be surrounded by a set of .. tags. For example:
This is the first paragraph.
This is the second paragraph and it has italic text.
This is the third paragraph and it has bold text.
Most of the time, you will be inserting or modifying text
inside the This is a level one header
This is a level two header
That's enough about generic HTML. The next section discusses Server-Side Includes. Today, Server-Side Includes are replacing some basic CGI programs, so it is important to know about them.
3.2. Server-Side Includes¶
One of the newest features that has been added to web servers is that of Server-Side Includes or SSI. SSI is a set of functions built into web servers that give HTML developers the ability to insert data into HTML documents using special directives. This means that you can have dynamic documents without needing to create full CGI programs.
The inserted information can take the form of a local file or a file referenced by a URL. You can also include information from a limited set of variables - similar to environmental variables. Finally, you can execute programs that can insert text into the document.
| Note |
| The only real difference between CGI programs and SSI programs is that CGI programs must output an HTTP header as their first line of output. See "HTTP Headers" in [](./cgi.md), for more information. |
Most web servers need the file extension to be changed from html to shtml in order for the server to know that it needs to look for Server-Side directives. The file extension is dependent on server configuration, but shtml is a common choice.
All SSI directives look like HTML comments within a document. This way, the SSI directives will simply be ignored on web servers that do not support them.
Table 20.1 shows a partial list of SSI directives supported by the webSite server from O'Reilly. Not all web servers will support all of the directives in the table. You need to check the documentation of your web server to determine what directives it will support.
| Note |
| Table 20.1 shows complete examples of SSI directives. You need to modify the examples so that they work for your web site. |
| Directive | Description |
|---|---|
| Changes the format used to display dates. | |
| Changes the format used to display file sizes. You may also be able to specify bytes (to display file sizes with commas) or abbrev (to display the file sizes in kilobytes or megabytes). | |
| Changes the format used to display error messages caused by wayward SSI directives. Error messages are also sent to the server's error log. | |
| Displays the value of the variable specified by ?. Several of the possible variables are mentioned in this table. | |
| Displays the full path and filename of the current document. | |
| Displays the virtual path and filename of the current document. | |
| Displays the last time the file was modified. It will use this format for display: 05/31/96 16:45:40. | |
| Displays the date and time using the local time zone. | |
| Displays the date and time using GMT. | |
| Executes a specified CGI program. It must be activated to be used. You can also use a cmd= option to execute shell commands. | |
| Displays the last modification date of the specified file given a virtual path. | |
| Displays the last modification date of the specified file given a relative path. | |
| Displays the size of the specified file given a virtual path. | |
| Displays the size of the specified file given a relative path. | |
| Displays a file given a virtual path. | |
| Displays a file given a relative path. The relative path can't start with the ../ character sequence or the / character to avoid security risks. |
SSI provides a fairly rich set of features to the programmer. You might use SSI if you had an existing set of documents to which you wanted to add modification dates. You might also have a file you want to include in a number of your pages - perhaps to act as a header or footer. You could just use the SSI include command on each of those pages, instead of copying the document into each page manually. When available, Server-Side Includes provide a good way to make simple pages more interesting.
Before Server-Side Includes were available, a CGI program was needed in order to automatically generate the last modification date text or to add a generic footer to all pages.
Your particular web server might have additional directives that you can use. Check the documentation that came with it for more information.
| Tip |
If you'd like more information about Server-Side
Includes, check out the following web site:
http://www.sigma.net/tdunn/Tim Dunn has created a nice site that documents some of the more technical aspects of web sites. |
| Caution |
| I would be remiss if I didn't mention the down side of Server-Side Includes. They are very processor intensive. If you don't have a high-powered computer running your web server and you expect to have a lot of traffic, you might want to limit the number of documents that use Server-Side Includes. |
3.3. HTML Forms¶
HTML forms are designed to let a web page designer interact with users by letting them fill out a form. The form can be composed of elements such as input boxes, buttons, checkboxes, radio buttons, and selection lists. All of the form elements are specified using HTML tags surrounded by a set of
tags. You can have more than one form per HTML document.There are several modifiers or options used with the
3.4. Handling Form Information¶
There are two ways for your form to receive form information - the GET method and the POST method. The transfer mechanism is specified in the
The GET method can't be used for larger forms because some web servers limit the length of the URL portion of a request. (Check the documentation on your particular server.) This means that larger forms might blow up if submitted using the GET method. For larger forms, the POST method is the answer.
The POST method sends all of the form information to the CGI program using the STDIN filehandle. The web server will set the CONTENT_LENGTH environment variable to indicate how much data the CGI program needs to read.
The rest of this section develops a function capable of reading both types of form information. The goal of the function is to create a hash that has one entry for each input field on the form.
The first step is simply to read the form information. The method used to send the information is stored in the REQUEST_METHOD environment variable. Therefore, we can examine it to tell if the function needs to look at the QUERY_STRING environment variable or the STDIN filehandle. Listing 20.1 contains a function called getFormData() that places the form information in a variable called $buffer regardless of the method used to transmit the information.
| Pseudocode |
|
Define the getFormData() function. Initialize a buffer. If the GET method is used, copy the form information into the buffer. If the POST method is used, read the form information into the buffer. |
|
Listing 20.1-20LST01.PL - The First Step is to Get the Form Information. |
|
| Tip |
| Since a single function can handle both the GET and POST methods, you really don't have to worry about which one to use. However, because of the limitation regarding URL length, I suggest that you stick with the POST method. |
I'm sure that you find this function pretty simple. But you might be wondering what information is contained in the $buffer variable.
Form information is passed to a CGI program in name=value format and each input field is delimited by an ampersand (&). For example, if you have a form with two fields - one called name and one called age - the form information would look like this:
name=Rolf+D%27Barno&age=34
Can you see the two input fields?
First, split up the information using the & as the delimiter:
name=Rolf+D%27Barno
age=34
Next, split up the two input fields based on the = character:
Field Name: name Field Value: Rolf+D%27Barno
Field Name: age Field Value: 34
Remember the section on URL
encoding from Chapter 19? You see it in action in the name field. The name is
really Rolf D'Barno. However, with URL encoding spaces are converted to plus
signs and some characters are converted to their hexadecimal ASCII equivalents.
If you think about how a single quote might be mistaken for the beginning of an
HTML value, you can understand why the ASCII equivalent is used.
Let's add some features to the getFormData() function to split up the input fields and store them in a hash variable. Listing 20.2 shows the new version of the getFormData() function.
| Pseudocode | ||||||||||||||||||||||||||||||
|
Declare a hash variable to hold the form's input fields. Call the getFormData() function. Define the getFormData() function. Declare a local variable to hold the reference to the input field hash. <P>Initialize a buffer.
<P>If the GET method is used, copy the form information into the buffer.
<P>If the POST method is used, read the form information into the buffer.
<P>Iterate over the array returned by the split() function.
<P>Decode both the input field name and value.
<P>Create an entry in the input field hash variable.
<P>Define the decodeURL() function.
<P>Get the encoded string from the parameter array.
<P>Translate all plus signs into spaces.
<P>Convert character coded as hexadecimal digits into regular characters.
<P>Return the decoded string.</TT></P></TD></TR></TBODY></TABLE>
The getFormData() function could be considered complete at this point. It correctly reads from both the GET and POST transmission methods, decodes the information, and places the input fields into a hash variable for easy access. There are some additional considerations of which you need to be aware. If you simply display the information that a user entered, there are some risks involved that you may not be aware of. Let's take a simple example. What if the user enters Rolf in the name field and you subsequently displayed that field's value? Yep, you guessed it, Rolf would be displayed in bold! For simple formatting HTML tags this is not a problem, and may even be a feature. However, if the user entered an SSI tag, he or she may be able to take advantage of a security hole - remember the tag? You can thwart would-be hackers by converting every instance of < to < and of > to >. The HTML standard allows for certain characters to be displayed using symbolic codes. This allows you to display a < character without the web browser thinking that a new HTML tag is starting. If you'd like to give users the ability to retain the character formatting HTML tags, you can test for each tag that you want to allow. When an allowed tag is found, reconvert it back to using normal < and > tags. You might want to check for users entering a series of tags in the hopes of generating pages and pages of blank lines. Also, you might want to convert pressing the enter key into spaces so that the line endings that the user entered are ignored and the text will wrap normally when displayed by a web browser. One small refinement of eliminating the line endings could be to convert two consecutive newlines into a paragraph ( ) tag. When you put all of these new features together, you wind up with a getFormData() function that looks like Listing 20.3.
|