A Longer Look At The LISNews Numbers

Submitted by Blake on Mon, 11/22/2004 - 15:50

Every month I post some basic statistics about LISNews. I thought I'd write something up that explains what those numbers mean, and where they come from to help everyone understand not just the LISNews numbers, but web stat numbers in general.

Here's a well known secret about web stats: They're less than perfect. While they may be less than perfect, they're all we have to go one. Without using some of the more invasive methods like cookie pushing, we just can't be quite sure who's visiting our site, where they came from, what they did while here, and why they left. These are the things that drive web masters crazy. To really be able to design a useable and user friendly site it helps to know why people come to your site, how they found it, what they used it for, and why they left. These are hard questions to answer when all you have is a line of text that looks like this:

888.665.555.1212 - - [07/Nov/2004:05:32:14 -0500] "GET /images/layout/slc.gif HTTP
/1.1" 200 139 "http://www.lisnews.com/articles/04/06/30/2012205.shtml?tid=30" "M
ozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

They say "write what you know" so I'll be focusing on LISNews, and the stats package we use at LISHost, Urchin. First let's quickly look at where most websites normally get their numbers from. Most server log files look something like this:

888.665.555.1212 - - [07/Nov/2004:05:32:14 -0500] "GET /images/layout/slc.gif HTTP
/1.1" 200 139 "http://www.lisnews.com/articles/04/06/30/2012205.shtml?tid=30" "M
ozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

From this we can learn a few things. The IP address of our visitor. This will often lead us back to their domain as well. So we can often figure out where in the world that computer was located. The date and time this hit was recorded, the type of request, protocol used, the server message returned, the referring page, and finally the type of browser being used to make the request. This, all on a single line. Each time a file is requested from the server a similar line is written to the log file. To all but the most dedicated administrators these lines are used by programs that work through them and present numbers in a more manageable and readable format. Before I talk about stats packages I'll break down a log file line and cover what we learn from each bit.

Basically we have a mini text database of all requests served by our web server program, Apache. Each entry in our database is separated by either a space, dash or a line break. Let's take a look at our example. Each bit of information we can use is separated by a space:

888.665.995.123: This is the users IP address (in reality this is an IP address I made up). It tells us where in the world this request was made from, that is, where we sent this file. Using a DNS server it is often possible to look up the requesting computer's domain name as well.

[07/Nov/2004:05:32:14 -0500]: Next we have the date.

"GET /images/layout/slc.gif HTTP/1.1" Now something not as easy read. GET, followed by /images/layout/slc.gif followed by HTTP/1.1. This tells us the server returned a GET request to a browser, and the name of the file, followed by the protocol it used to make the transfer.

200 139: Now a couple a seemingly random number popup to confuse us even more. Servers have a standard set of numbers they use to tell each other how things are going. A 404 means the file is missing, 200 means all is well. There's several others that are possible as wel.

"http://www.lisnews.com/articles/04/06/30/2012205.shtml?tid=30" A rather standard looking URL. This is the referral, that page that was used to find whatever page was just served.

"Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" And last, but not least, the browser. This is a good place to learn how many search engines are hitting your site. Often you'll see Googlebot as a type of browser.

So stack a few million similar lines on top of each other, one for each hit, and you've got your self an exact representation of what's going on with your site. Better yet, sit there and just watch the hits roll in by using a few command line programs (tail -f or perhaps tail -f | grep -V .gif) to get a feel for what's happening in real time. While it's not fair to say these numbers are misleading, it may be more accurate to say these numbers can be misleading if left unanalyzed.

Remember, each line represents any request made to the web server for any reason. An impatient reader could have reloaded a single page 6 times trying to make it move along quicker, an out of control bot could've gotten stuck in an endless loop and loaded the same 2 pages 10,000 times. Stranger things have happened, and we've seen many of them here @LISNews. So to really get a feel for how many people are visiting the site in any given day it takes a combination of human powered brains, and computer powered muscles to get an accurate count of your visitors.

Urchin is the computer power behind most of the numbers I share for LISNews each month. I thought it might be interesting to first look at Urchin, and how it does things, and then follow that up with a comparison to a popular (free) stats package to see how the two systems crunch the same numbers. Remember, all stats packages do things a bit different so your site may report numbers in a completely different way depending on how things are recorded and measured.

So let's have a look at Urchin. Urchin is expensive, powerful, pretty and quite darn nice, if you ask me. You can take a tour at Urchin to see what it looks like, we're current running version 5.7. Urchin is probably one of the more expensive log analyzers out there, but it's also one of the nicest. One of the nicest features of Urchin is the SVG graphs. Scalable Vector Graphics (SVG) allow some nice interaction with the graphs generated by the system. SVG currently has limited support in Firefox, so you may find yourself firing up IE to look at some of the more interesting charts. Some of the more important things to consider with any stats package is how it defines certain terms that are of central importance when digging through your stats. Pageviews, hits, sessions, visitors. These are the important numbers, and these are also numbers that are open to some interpretation. This is how Urchin, and therefore LISNews, defines some numbers that I report each month.

Hit - A hit is simply any request to the web server for any type of file. This can be an HTML page, an image (jpeg, gif, png, etc.), a sound clip, a cgi script, and many other file types. An HTML page can account for several hits: the page itself, each image on the page, and any embedded sound or video clips. Therefore, the number of hits a website receives is not a valid popularity gauge, but rather is an indication of server use and loading.

Pageview - A page is defined as any file or content delivered by a web server that would generally be considered a web document. This includes HTML pages (.html, .htm, .shtml), script-generated pages (.cgi, .asp, .cfm, etc.), and plain-text pages. It also includes sound files (.wav, .aiff, etc.), video files (.mov, etc.), and other non-document files. Only image files (.jpeg, .gif, .png), javascript (.js) and style sheets (.css) are excluded from this definition. Each time a file defined as a page is served, a pageview is registered by Urchin.

Page - Also known as a web page, a page is defined as a single file delivered by a web server that contains HTML or similar content. Any file that is not specifically a GIF, JPEG, PING, JS (javascript), or CSS (style sheet) is considered a page.

Session - A Session is a defined quantity of visitor interaction with a website. The definition will vary depending on how Visitors are tracked. Some common visitor tracking methods and corresponding Session definitions:
• IP-based Visitor Tracking: A Session is a series of hits from one visitor (as defined by the visitor's IP address) wherein no two hits are separated by more than 30 minutes. If there is a gap of 30 minutes or more from this visitor, an additional Session is counted.
• IP+User Agent Visitor Tracking: A Session is a series of hits from one visitor (as defined by the visitor's IP address and user-agent, such as Netscape 4.72) wherein no two hits are separated by more than 30 minutes. If there is a gap of 30 minutes or more from this visitor, an additional Session is counted.
• Unique Visitor Tracking (cookie-based, such as Urchin's UTM): A Session is a period of interaction between a visitor's browser and a particular website, ending upon the closure of the browser window or shut down of the browser program.

Visitor - A Visitor is a construct designed to come as close as possible to defining the number of actual, distinct people who visited a website. There is of course no way to know if two people are sharing a computer from the website's perspective, but a good visitor-tracking system can come close to the actual number. The most accurate visitor-tracking systems generally employ cookies to maintain tallies of distinct visitors.

So for example, this month (Nov 2004) Urchin says the following:
Total Sessions 42,961.00
Total Pageviews 143,518.00
Total Hits 412,331.00

Does that mean we really had almost 43,000 of visitors? No. Does it mean we came close to that? Well, that depends on what "close" means. Having 43,000 sessions in a month means that somewhere in the neighborhood of maybe 30,000 folks dropped in for a visit. If your sister in law stopped over at your house 6 times last month, you could say she was responsible for 6 sessions at your house. You didn't have 6 different visitors, but rather one person came over 6 times and did something to annoy you (or maybe that's just me). One question that has always nagged at me was just how many people, unique & different people view a page using human eyes on any give day @LISNews. This number is almost impossible to answer. I'll actually go so far as to say it IS impossible to accurately answer, we can only make an educated guess.

To get a more accurate count of actual human visitors to the site we'd first need to use IP Addresses as the most accurate gauge of uniqueness. So far this month Urchin says we've had 9,681 unique IP addresses. The top 10 addresses were responsible for about 18% of all sessions, that's about 7700 out of about 43,000. This is still an imperfect measurement thanks to caching servers, DHCP and firewalls, but it's probably the best we can do. Then, we'd need to subtract out as many bots, search engines, fee readers, and anything else we can identify as non-human. Then we'd be left with a more accurate measurement of people who viewed at least a page @LISNews. This month, after a considerable amount of addition and subtraction, I've made an educated guess that approximately 25% of IP addresses that access LISNews are some sort of automated computerized non-humanoid thingy.

My guess would be that net ratings companies like MediMetrix and Nielsen have similar, if not worse, issues with reliability and consistency. The web was not designed to do much of what we use it for today. Keeping accurate counts of visitors is just one more example of what how we have improvised any number of hacks to improve and improvise our way to a better system.

The standard Apache web logs do an excellent job of telling us about what pages and files were most popular. LISNews has served almost 143,000 pages so far this month. The most popular being:
Page Number %
1. /article.pl 20,087 14.00%
2. /lisnews.rss 14,295 9.96%
3. / 8,863 6.18%
4. /rss/descriptions.rss 5,315 3.70%
5. /comments.pl 4,659 3.25%
6. /index.rss 3,847 2.68%
7. /journal.pl 3,457 2.41%
8. /rss/popular.rss 3,175 2.21%
9. /robots.txt 3,020 2.10%
10. /article.php3 2,722 1.90%
View Total: 69,440 48.38%

So the top 10 pages were responsible for almost half of all pages served. Another strength of the Slashcode is it's ability to store a large amount of other data in the Slash database. There's an entire second set of more accurate numbers that I'll write up some day. Things like who posts the most comments, how many people have logged in this month, and how many comments were moderated up to 5.

I'll write up some comparisons between Urchin and some other stats packages, and really crunch some numbers at a later date.