Level: Introductory Andrei Malicinski (malacins@us.ibm.com)IBM Application Integration Middleware Lab Scott Dominick (scottdom@us.ibm.com)IBM Application Integration Middleware Lab Tom Hartrick (thartric@us.ibm.com)IBM Application Integration Middleware Lab
01 Mar 2001 The best way to know whether your Web site is achieving its goals is to gather extensive traffic data. Part 1 (see Resources) of this two part series explored several Web measurement approaches, such as network monitors and single-pixel solutions. Here in part 2, the authors show you how to obtain detailed traffic measurements through analysis of HTTP server logs.
Introduction
"How many hits is my Web site getting?" "How many visitors are going to my site?" "What pages are people looking at?" "Where are my Web customers coming from?" The answers to these questions and others like them can be answered by including Web measurement technology in your Web site deployment solution. There are several Web measurement approaches that have been adopted by the industry, such as network monitors, single-pixel solutions, and HTTP server log analysis.
In part two of this series, we concentrate on the approach of obtaining measurements from HTTP server logs. To help you better understand HTTP server log analysis, this article contains an explanation of HTTP logging (what it is, what the log formats are, what information is available, and so on) and reviews the metrics available from raw log data. Basic metrics include the number of hits, the number of visitors, visitor duration, visitor origin (subdomain, referral link), visitor IP address, browser type, and platform type; we also will discuss advanced metrics which are derived from the manipulation of data through techniques such as categorization and aggregation. Additionally, it is possible to expand the reaches of your solution by combining technologies such as data warehousing and data mining.
We will take a look at general HTTP server logging capabilities (currently available through prominent HTTP server vendors, each of which implements one of the industry standard logging
formats) and explore how analytics concepts are applied to raw logged data. We will discuss real concepts that have been applied in practice, and show you how to obtain measurements from Web server logs -- just one of several Web measurement approaches available in the industry.
HTTP log analysis
HTTP Log analysis is the analysis of log files produced by the HTTP servers in a Web server environment. Each HTTP server vendor (such as Apache, Microsoft Internet Server, Netscape, and Domino GO) provides logging capabilities with their products. These logging capabilities typically include configuration options that enable and disable logging as well as specify the type of data logged and the quantity of data logged. Data is logged to files in one or more file formats. As Web server software has evolved, so has the variety of logging options and logging implementations. Over this period, the industry has settled on a small set of log file formats:
- NCSA Combined Log Format
- NCSA Separate Log Format (three-log Format)
- NCSA Common Log Format (access log)
- W3C Extended Log Format
A typical HTTP logging configuration would result in a log entry for each HTTP request, otherwise known as a hit on the server. This entry would contain detailed information about the resource request. Web analytics software can parse and process these logs files, in batch, to combine information from each request to give a view of the Web site's traffic. Figure 1 shows a simple data collection process.
Figure 1: Gathering and analyzing Web server logs
Let's examine HTTP log analysis by first looking at what information is available from Web server logs.
Web servers can be configured to record client requests in a server log. The server log contains the information that can later be used to track Web site traffic and usage. For example, when a client visits http://www.ibm.com, this request is logged in the server log of the Web server that received the request.
A Web page typically contains multiple resources which results in multiple hits. For example, the Web page basic.html contains multiple images: image1.gif, image2.gif, and image3.gif. When a user requests basic.html, we typically consider that a "hit." However, the request for basic.html includes three separate requests for the image files anchored in the page. Each resource is requested separately from the server, meaning that the request for basic.html adds three additional hits and produces a total of four hits. In industry terminology, a page view is a hit to a resource that is a page, or readable text, as opposed to a hit to an image or other resources. In this example, there is only a single one page view, but the Web server's log for this user's action would contain four entries.
Log information
Depending on your HTTP server vendor, and the log formats that it supports, the raw information available for Web analytics may vary slightly. The following section details the most common pieces of information typically logged. Review the actual specifics of your log format to determine if your Web server is producing the information you need.
Remotehost
The Remotehost is the IP address or host/subdomain name of the HTTP client that made the HTTP resource request. The IP address of the client making the HTTP request is available to the HTTP server. Typically the remotehost field will contain this IP address in numeric form, such as 125.125.125.125. However, many HTTP servers allow you to turn on DNS resolution for this IP address. When DNS resolution is turned on, the HTTP server will attempt to resolve the IP address into a character-based hostname, or subdomain name such as ibm.com. In this case, the subdomain will be the value of the remotehost.
(Note: Typically HTTP servers are configured with DNS resolution turned off. DNS resolution is often a time-consuming process, and if a Web administrator's goal is to configure a Web site to serve each requested resource as quickly as possible, then it is unlikely DNS resolution will be active.)
The IP address available to the HTTP server is the IP address of the client process that made the actual HTTP resource request of the server. Often, this IP address does not uniquely identify end users who surf the Web with Web browsers; this IP address may be the IP address of a caching server, a proxy server, or one of a pool of IP addresses dynamically
assigned by an ISP on each HTTP request.
Logname
This is an identifier (if available) used to identify the client making the HTTP request. RFC 931 contains furthur information on this field (see Resources).
In practice, most HTTP servers or Web sites do not make use of this field. Consult your Web server documentation for more information on how your server uses the logname field.
User ID
This is the username (or user ID) used by the client for authentification when the HTTP resource requested requires user authentification. If the resource requested does not require authentification, then this username is not filled in.
Date
This is the date and time stamp of HTTP request; often this information is logged in Greenwich Mean Time (GMT). The actual date and time stamp format contains a time offset as well. With this offset, it's possible to obtain not only the GMT time, but the local time of the HTTP server that logged the request.
HTTP request
The request is basically what the Web client requested. The request field actually contains three pieces of information:
- the requested resource (the main piece);
- the HTTP method (for example, GET or POST); and
- the HTTP protocol version (for example, HTTP/1.0 or HTTP1.1)
The requested resource is the HTTP resource requested by the HTTP client. For example, consider a Web page that a user loads in his or her browser. When the user loads the page, the browser (an HTTP client) makes a request to the Web server for the page (the HTTP resource). The resource listed in the request field of the log file does not contain the protocol or hostname that make up the URL of the requested resource. For example, if the page the user loaded was http://www.ibm.com/index.html, then the value of the resource that is logged, without the protocol (http://) and without the hostname (www.ibm.com), would be /index.html.
The HTTP method is the method that the HTTP client uses to send its request for the resource. For example, when a user loads a page in his or her browser by entering the URL of the page in the browser entry field, the browser (an HTTP client) makes a "GET" request for the page. GET is the HTTP method by which the browser requests the resource. Another
method is "POST." The browser may issue a POST method in the case where the user fills out a form on a Web page and submits the information by clicking on a button on the page that he or she is viewing. The determination of whether the HTTP client issues a GET or POST to resolve a link on a Web page depends of the implementation of the Web page as chosen by the
Web site author.
The HTTP protocol indicates whether HTTP version 1.0 or version 1.1 was used in the communication between the HTTP client and the HTTP server.
HTTP status
The status field is a numeric code indicating the success or failure of the HTTP request. For example, if a user successfully loads a page in his or her browser, then the Web server will likely log a status code of 200, the value indicating success. A failed request will result in a different status code. Generally status codes in the 200-299 range indicate success, 300-399 indicate a warning situation (such as a redirected link), 400-499 indicate client error, and 500-599 indicate server error. The HTTP specification documents each possible status code
and its meaning.
Bytes
The bytes field is a numeric field containing the number of bytes of data transferred as part of the HTTP request. This value does not include the bytes of the HTTP header. For example, if you load an image that is 10K in size in a browser by entering the URL http://www.ibm.com/large10KImage.gif in your browser entry field, and the server is able to satisfy the request, you would see a byte count of 10,000.
Referral
The referral is the URL of the HTTP resource that referred the user to the resource that was requested. For example, if a Web client is browsing http://www.ibm.com/index.html and clicks on a link to a secondary page, then the initial page refers the user to the secondary page. The entry in the referral log for the secondary page will list the URL of the first page, http://www.ibm.com/index.html, as its referral.
Agent
The agent is an HTTP client making HTTP requests. It is standard procedure for an HTTP client, such as a Web browser, to identify itself by name when making an HTTP request. The Web server then writes this name in the agent log. When the HTTP client is a Web browser, the browser typically identifies itself with an agent name that indicates the browser name and version and the operating system of the client.
Basic Web metrics
Now that the kind of information contained in the Web server's log has been identified, what value is there in this data and how does a user make sense of it all? Some large Web servers record millions of hits a day -- a tremendous amount of information to process. What does all this data indicate about the traffic to the Web site? What relevant business and design questions can be answered?
Let's start off with a some basic queries obtained from the raw data in the log file. Later, we will look at more complex queries that can be created by interpreting the data at a higher level. The following basic queries can be answered based on the raw data in the log file. As you will see, there is meaningful information in each field of each record of the log file.
Remotehost (IP Address/subdomain)
In the last section, we learned that a client's IP Address can be translated into a subdomain (and/or a domain) by performing a DNS lookup. Knowing the subdomain that a client comes from can be extremely useful in providing a high level of demographics if user IDs or user cookies are not implemented. Subdomains can also answer questions like:
- How often does a competitor visit my site?
- Are most of my customers from companies or from ISPs?
- Are Web clients visiting my site from work or home?
- What are the top ISPs (for example, AOL.com, Mindspring.com)?
- Are any major universities using my Web site? (Note: University subdomains end with
.edu.)
If User IDs or user cookies are not enabled, the IP address is the only way to identify a client: Which IP address has spent the most time at the site? Which IP Address had the most visits for the week?
User ID
If present, User ID allows you to identify Web traffic by user, answering questions such as: Which user spent the most money for the month? Maybe that user should be awarded some sort of gift, or is now eligible for preferred customer status.
- Who are the top visitors by hits?
- Who are the top visitors by number of sessions?
Agent
As previously discussed, we can extrapolate the browser and the platform (operating system) being used by the client from the agent field in the log record, which can assist with the following question: What is the top browser that is being used? This might be useful to know because some robust Web server pages display differently, depending on the browser. If a Web page consistently displays improperly with a particular browser and that browser is used by more clients than expected, that page may need to be reconfigured so that it displays properly on the problem browser.
Similarly, if there is a problem platform that does not display a dynamic page properly, that could be identified.
- What is the top version of Netscape being used?
- What is the top version of Internet Explorer (IE) being used?
- How many clients are using a Unix-based OS?
HTTP status (return code)
Each time the Web server serves a resource, the Web server records a return code in the log file. As we mentioned, a return code status of 200 means that the requested resource was served to the client successfully. A status other than 200 means that there may have been a problem serving the resource to the client. A webmaster needs to know if there is a particular resource that has been requested often but not served to the client.
If the Web site recently changed names and the old name is a redirect, how many clients are still using the old name? Return codes of 301 or 302 would demonstrate this. If a user is still paying to host the old name, how can the user more effectively inform customers of the new name, so as not to incur the extra cost for the old Web site?
Referral
A referral indicates whether or not the client has arrived from another Web site. When users enter http://www.ibm.com in their browser, they will go directly to that page; however, when a user clicks an IBM advertisement on Yahoo.com that is linked to http://www.ibm.com, that would be noted as a referral in the IBM Web server log. Obviously, then, if you are paying to advertise on another Web site but no clients are coming from that site, it may be time to end the relationship. Figure 2 is an example of a referral configuration.
Figure 2: Typical referral configuration
Some questions pertaining to referrals are:
- What are the top referring Web sites?
- What are the top referring pages?
- What are the top referring commercial organizations?
- What are the top referring URLs?
HTTP request (resource)
Resources measurements provide exact detail of a Web site's traffic, answering questions like:
- What were the total number of hits for the site last month?
- What is the most active day of the week?
- How many times was the site's home page hit last month?
- What are the top pages by time spent?
- What are the top images requested?
- What are the site's top entry pages?
- What are the site's top exit pages?
Bytes
The amount of bytes served by the Web server can be useful in helping users discover the amount of work that a Web server is performing. You can answer questions like:
- How many bytes were served during the sale period?
- How many bytes are served during an average visit?
- How many bytes were served to this user?
Advanced Web metrics
The queries and metrics described above are readily available from the raw data in the log file. However, there is much more information that can be extrapolated by using more complex algorithms to interpret the data. In this section, we will look at some complex queries, and the kinds of information that can be gathered from the Web server's log.
Sessionization
The terms "sessionization" and "visits" are interchangeable. Sessionization is the determination of the number of visitors to a Web site. Sessionization algorithms take into consideration several pieces of information when sessionizing each hit, such as IP address, referral address, user agent (browser/platform), and time stamp, as well as a time out value. Each piece of information is prioritized. An algorithm must make a best guess as to where to sessionize a hit. Hit sessionization algorithms have been researched in the industry and several are available.
In fact, due to the variety of algorithms, sessionization standards have been established.
Categorization
Categorization is the process by which similar entities are grouped together based on pattern matching. This way, data from multiple entities can be combined into groups. For example, if a Web site has content pertaining to cars (file name cars.html) and content pertaining to airplanes (file name airplanes.html), then a category called "Transportation" could be created with patterns for *cars.html and *airplanes.html. Thus, if cars.html has three hits and airplanes.html has four hits, the Transportation category would report seven total hits. Perhaps the user wants a "competitors" subdomain category; the patterns would match the subdomains of the user's competitors.
Aggregation
Aggregation is the process by which all combinations of entities and their resulting measures are combined. For example, a user might want to know how many times this IP address had a return code that was not 200. The user could create an IP address/return code aggregate combination. Or, if a user wants to know how many times a particular resource had an
unsuccessful return code, the user could create a resource/return code combination. To go even further, maybe there is a need to know how many times a specific user requested a particular resource with a particular return code. This would be a user ID/resource/return code combination.
Figure 3 shows all of the aggregate combinations for IP addresses and return codes. It tells you how many times a particular IP Address had a return code for a page that was not in the 200 range (meaning the user had trouble reaching the page). This information is extremely important because it would tell you how many users are trying to reach a page that is not responding properly. This kind of information can help determine how Web site maintenance and repair resources are directed.
Figure 3: Aggregate combinations for IP addresses and return codes
Log files
The log files generated by the HTTP servers in your Web server environment provide information for Web site traffic measurements. HTTP server vendors (for example, Apache, Microsoft Internet Server, Netscape, Domino GO, and so on) provide logging capabilities with their products, along with logging configuration options for doing things like enabling and disabling
logging and specifying the log file format.
There are several log file formats currently in use in the HTTP server industry. Again, the more prevalent formats are:
- NCSA Common Log Format (access log)
- NCSA Separate Log Format (three-log Format)
- NCSA Combined Log Format
- W3C Extended Log Format
Each hit or HTTP request from a Web client results in an entry in the log file. A log file is made up of multiple entries (or records), each of which has the same format. Below is an excerpt that contains five records, from five hits from a log file of type NCSA Combined Log Format. A discussion of the fields follows.
-
111.134.115.117 - - [08/Jun/1999:08:36:04 +0000] "GET /index.html
HTTP/1.0" 404 55 "-" "Mozilla/3.0 [en] (WinNT; I)"
-
122.133.123.112 - - [08/Jun/1999:08:36:14 +0000] "GET /images/sp.gif
HTTP/1.0" 200 47 "http://www.sample.com/index.shtml" "Mozilla/3.0 [en] (WinNT; I)"
-
111.109.230.202 - - [08/Jun/1999:08:36:15 +0000] "GET /images/music.gif
HTTP/1.0" 200 47 "http://www.sample.com/index.shtml" "Mozilla/3.0 [en] (WinNT; I)"
-
101.223.240.118 - - [08/Jun/1999:08:36:17 +0000] "GET /images/net.gif
HTTP/1.0" 200 98 "http://www.sample.com/index.shtml" "Mozilla/3.0 [en] (WinNT; I)"
-
101.223.240.118 - - [08/Jun/1999:08:36:19 +0000] "GET /images/header.gif
HTTP/1.0" 200 50 "http://www.sample.com/index.shtml" "Mozilla/3.0 [en] (WinNT; I)"
NCSA Common Log Format
The National Center for Supercomputing Applications (NCSA) Common Log Format is a simple format containing only basic HTTP access information. The NCSA Common Log, sometimes referred to as the Access Log, is the first of three logs in the NCSA Separate Log Format. The NCSA log formats are based on NCSA httpd, and are widely accepted as standard among HTTP server vendors. The Common Log Format can also be thought of as the NCSA Combined Log Format without the referral and user agent, (which are optional fields in the Combined Log Format). The Common Log contains the requested resource and a few other pieces of information, but does not contain referral, user agent, or cookie information. The information is contained in a single file. The syntax, an example, and descriptions of the various fields follow:
Common Log syntax with an example:
remotehost | logname | username | date
125.125.125.125 - dsmith [10/Oct/1999:21:15:05 +0500]
request | status | bytes
"GET /index.html HTTP/1.0" 200 1043
|
remotehost: 125.125.125.125
The Remotehost is the IP address or host/subdomain name of the HTTP client that made the HTTP resource request.
This is an identifier (if available) used to identify the client making the HTTP request. A "-" is used to indicate no logname present.
The username, (or user ID) is used by the client for authentification. A "-" is used to indicate that no username is present.
date: [10/Oct/1999:21:15:05 +0500] |
This is the date and time stamp of the HTTP request.
The format of this date/time stamp is as follows:
day of month in | month in | year in | hour | minute | second | timezone
[ dd /MMM /yyyy :hh :mm :ss +0500]
[ 10 /Oct /1999 :21 :15 :05 +0500]
|
In practice, one will find that the day is typically logged in two-digit format even for single-digit days. For example, the second day of the month would be represented as 02. However, some HTTP servers do log a single digit day as a single digit. When parsing log records, be aware of both possible day representations.
request: "GET /index.html HTTP/1.0" |
This is the HTTP request. The request field contains three pieces of information. The main piece is the requested resource (index.html). The request field also
contains the HTTP method (GET) and the HTTP protocol version (1.0).
The status is the numeric code indicating the success or failure of the HTTP request.
The bytes field is a numeric field containing the number of bytes of data transferred as part of the HTTP request, not including the HTTP header.
NCSA Separate Log Format (three-log format)
The NCSA Separate Log Format -- sometimes called three-log Format -- refers to a log format in which the information gathered is separated into three separate files (or logs), rather than a single file as is the case with the other formats documented here. The three logs are often referred to as:
- Common log or access log
- Referral log
- Agent log
The three-log Format contains the basic information present in the NCSA Common Log Format in one file, and referral and user agent information in subsequent files. However, no cookie information is recorded in this log format. We will use the same example as above to discuss the three-log Format
Common or access log
The first of the three logs is Common log, sometimes referred to as the access log, which is identical in format and syntax to the NCSA Common Log Format described above.
Referral log
The referral log is the second of the three logs. The referral log contains a corresponding entry for each entry in the common log. The referral log contains only two fields: date stamp for correlating the entry with the corresponding entry in the common log and the referrer which is the referral address. Below is an example:
date | referrer
[10/Oct/1999:21:15:05 +0500] "http://www.ibm.com/index.html"
|
-
date: This is the date and time stamp of HTTP request. The date and time of an entry logged in the referral log corresponds to the resource access entry in the common log. As a result, the date and time of corresponding records from each of these logs will be the same. The syntax of the date stamp is identical to the date stamp in the common log.
-
referrer: The referrer is the URL of the HTTP resource that referred the user to the resource requested. For example, if a Web client is browsing a page such as http://www.ibm.com/index.html, and clicks on a link to a secondary page, then the initial page has referred the user to the secondary page. The entry in the referral log for the secondary page will list the URL of the first page (http://www.ibm.com/index.html) as its referral. For example, the referral entry may look like the following: [10/Oct/1999:21:15:05 +0500] "http://www.ibm.com/index.html"
Agent log
The agent log is the third of the three logs making up the three-log Format. Like the referral log, the agent log contains a corresponding entry for each entry in the common log. The agent
log contains two fields: The first is a date stamp for correlating the entry with the corresponding entry in the common log; the second is the agent name of the HTTP client which made the request for the resource. The file format and description follow:
date | agent
[10/Oct/1999:21:15:05 +0500] "Microsoft Internet Explorer - 5.0"
|
-
date: This is the date and time stamp of HTTP request. The date and time of an entry logged in the agent log corresponds to the resource access entry in the common log. Because information logged in the agent log supplements information logged in the common log, the date and time of corresponding records from each of these logs will be the same. The syntax of the date stamp is identical to the date stamp in the common log.
-
agent: The agent is an HTTP client that makes HTTP requests. It is standard procedure for an HTTP client, such as a Web browser, to identify itself by name when making an HTTP request. It is not required, but most HTTP clients do identify themselves by name. The Web server writes this name in the agent log.
NCSA Combined Log Format
The NCSA Combined Log Format is an extension of the NCSA Common Log Format. The Combined Log Format contains the same information in the Common Log Format, includes the referral and the user agent, and provides the optional cookie information. The information is contained in a single file. A complete example of this format is shown below:
111.134.115.117 - - [08/Jun/1999:08:36:04 +0000] "GET /index.html HTTP/1.0" 404 55 "-" "Mozilla/3.0 [en] (WinNT; I)"
remotehost | logname | username | date
111.134.115.117 - - [08/Jun/1999:08:36:04 +0000]
request | status | bytes | referral
"GET /index.html HTTP/1.0" 404 55 "-"
agent | [cookies]
"Mozilla/3.0 [en] (WinNT; I)" "USERID=CustomerA;IMPID=01234"
|
Cookies
Cookies are pieces of information that the HTTP server can send back to client along the with the requested resources. A client's browser may store this information and subsequently send it back to the HTTP server upon making additional resource requests. In fact, the HTTP server may establish multiple cookies per HTTP request. Cookies take the form KEY = VALUE. Multiple cookie key value pairs are delineated by semicolons(;). If your HTTP server's logging configuration indicates to log cookies, then each cookie that the HTTP server contains for the
requested resource is logged.
The number of cookies used and their functions are under the control of the Web site implementor. One common use of cookies is to identify sessions. Multiple HTTP requests from the same client can each contain the same cookie value. Cookies used to identify sessions are often referred to as session cookies, and the session cookie values are often referred to as
session IDs. Another common use of cookies is to identify users. Again, multiple HTTP requests from the same client can each contain the same cookie value (a user ID, for example). Cookies used to identify users are often referred to as user cookies and their values are referred to as user IDs.
W3C Extended Log Format
The W3C Extended Log Format is a flexible, extendable format for recording information about HTTP requests. Like the NCSA formats, this log format will log one entry (or one line) per HTTP request. The W3C Extended Log Format differs from the NCSA formats not only in syntax, though, as it also contains other useful, flexible features. The extended format allows multiple optional fields to be included or excluded independent of each other. Additionally, the extended format allows for special directives to be added to the file which contain information such as remarks or the file format, thereby allowing the log to change format at any time. The #Fields directive indicates the contents of each log entry. Each of the basic field types (for example, Remotehost, date/time, HTTP request, HTTP status, bytes, referral, and agent cookies) are available in the W3C Extended Log Format. Other types are also supported by the log format.
In addition to entries, the log may also contain directives. In fact, each line in the log file may contain either a directive or an entry. Directives contain information about the log itself. A line beginning with a "#" character is a directive line. Valid directives include the following:
-
Version: <integer>.<integer> The version of the extended log file format used.
-
Fields: [<specifier>...] Lists of field identifiersspecifying the information recorded in each entry.
-
Software: string Identifies the software that generatedthe log.
-
Start-Date: <date> <time> The date and time at which the log was started.
-
End-Date:<date> <time> The date and time at which the log was finished.
-
Date:<date> <time> The date and time at which the entry was added.
-
Remark: <text> Any remarks or comments.
Log entries contain information about HTTP requests. Fields are separated by spaces, and the dash character (-) is used as the place holder for a field in which no data is available to log.
The following is a simple excerpt from a log file of type W3 Extended Log Format containing only a few fields. The #Fields directive indicates the contents of each log entry:
#Version: 1.0
#Date: 12-Jan-1996 00:00:00
#Fields: time cs-method cs-uri
00:34:23 GET /foo/bar.html
12:21:16 GET /foo/bar.html
12:45:52 GET /foo/bar.html
12:57:34 GET /foo/bar.html
|
Conclusion
As you can see, there are several technological approaches to measuring Web traffic. Analyzing HTTP logs is a common approach often adapted by companies interested in analyzing their sites' usage traffic. For starters, HTTP logging support is likely already available as part of the HTTP software package already being used in Web site deployment. From there one must select an analysis software product or service provider, and then decide what metrics to collect. It is our hope that these articles will help you understand the HTTP server log analysis approach. Happy analyzing!
 |
Glossary of terms
Browser: The Web browser used by a visitor to access the Web site.
Bytes transferred: The number of bytes transferred to the client Web browser as a result of a request.
Domain: The unique name that identifies an Internet site (EDU, ORG, COM).
Duration: The amount of time spent on a page, in seconds.
Duration per visit: The amount of time spent in a given visit, in seconds.
Entry resource: The first resource viewed as part of a visit.
Exit resource: The last resource viewed as part of a visit.
Hit: A browser request for any one item, such as a page, graphic, or other resource. It may take several hits to bring up a single Web page as displayed in a browser.
Hits per visit: The number of hits occurring in a given visit.
Page view: The number of deliberate requests to a given URL. For example, one Web page that contains three frames and 12 artwork files would generate one page view, but 15 hits. This calculation is an approximation based on the time, sequence, and referral page from which various resources were requested.
Page views per visit: The number of page views occurring in agiven visit.
Platform: The operating system (for example, AIX, Windows NT, and so on) used by a visitor to access the Web site.
Referral: The resource from which a visitor requests another resource, expressed as a URL.
Resource: An item that can be requested by a Web browser (for example, HTML files, artwork files, and so on).
Return code: The result status of a HTTP request that indicates the success or failure of the request.
Server error: An error occurring at the server while processing a client's request.
Subdomain: The text name of the item to the left of the domain (ibm.com, microsoft.com).
User Agent: The browser and platform used by a visitor to access the Web site.
Visit: A continuous period of activity by one visitor to a Web site. This measurement can also be referred to as a session (usually within 30 minutes).
|
|
 |
Two solutions from IBM
As discussed in this article, HTTP server log file analysis can be an effective method for monitoring a Web site's traffic. IBM offers two solutions based on this approach: IBM WebSphere Site Analyzer and IBM Surfaid Analytics.
IBM WebSphere Site Analyzer
IBM WebSphere Site Analyzer is an installable product that analyzes Web server logs in NCSA Combined Log Format, NCSA Separate Log Format (three-log format), and W3C Extended Log Format. IBM WebSphere Site Analyzer, V3.5 provides analysis for enterprise Web site visitor trends, usage and content, and WebSphere Commerce Suite reporting. WebSphere Site Analyzer can help you make factual e-business decisions. You can use WebSphere Site Analyzer to detect visitor trends and preferences, manage Web site content and structure,
and improve the overall effectiveness of Web initiatives and campaigns.
Site Analyzer is available on a wide variety of platforms and supports multiple configurations. It offers both sample report queries and a Query Builder that allows custom SQL queries to be built. Site Analyzer also offers categorization and aggregation of data which provide the user an extremely powerful and intricate view of the processed data. Site Analyzer provides a robust charting utility. Charts are created as GIF files, which are included in the final reports (rendered in HTML format). With Site Analyzer you can publish the HTML reports to a Web server for viewing by multiple users. More information on SiteAnalyzer can be found at: http://www.ibm.com/software/webservers/siteanalyzer. Figure 4 shows the analysis process of Site Analyzer. Data is collected from various Web servers in a central webmart (database) where it can be manipulated and reported on.
Figure 4: An example of Web analytics reporting extended using OLAP and data mining
IBM Surfaid Analytics
IBM Surfaid Analytics is IBM's premiere Application Service Provider (ASP) offering. In an ASP offering, customers either send HTTP logs to the service provider or they allow the service provider to access their Web server logs. The service provider then analyzes the server logs and provides the reports back to their customers. IBM Surfaid Analytics has provided Web analysis for such events as the US Olympics and the Grammy Awards. For more information on SurfAid, visit the Surfaid site (see Resources), call 817-62-SURF, or send E-mail to: surfaid@us.ibm.com. Figure 5 illustrates IBM Surfaid Analytic's process:
Figure 5: IBM's Surfaid Analytics process
Trademarks
The following are trademarks of International Business Machines Corporation: IBM, OS/2, WebSphere
Java and all Java-based trademarks and logos are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both.
Microsoft, Windows, Windows NT, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both.
UNIX is a registered trademark of the Open Group in the United States and other countries.
All other marks are the property of their respective owners.
© IBM Corporation 2001. All rights reserved.
Resources
About the authors  | |  | malacins@us.ibm.com">Andrei Malacinski is a software engineer for IBM in the Web software development area. He works at the IBM Application Integration Middleware Lab in Research Triangle Park, NC. Andrei has been both a developer and team lead on a number of IBM application development products for Windows, OS/2, and UNIX platforms. He is currently the team lead and lead developer of the IBM WebSphere Site Analyzer He can be reached at malacins@us.ibm.com. |
 | |  | Scott Dominick is a software engineer for IBM in the IBM Application Integration Middleware Lab at Research Triangle Park, NC. He received his Bachelor's degree from North Carolina State University in 1992. He is currently working on the IBM WebSphere Site Analyzer in the WebSphere Tools Development area. He can be reached at scottdom@us.ibm.com. |
 | |  | Tom Hartrick is currently a software engineering manager for the IBM WebSphere Site Analyzer product, working at the IBM Application Integration Middleware Lab in Research Triangle Park, NC. Tom received his Bachelor's degree in Computer Science from Rochester Institute of Technology, and has previously been development manager for WebSphere Application Server as well as other software projects. He can be reached at thartric@us.ibm.com. |
Rate this page
|