Hyperactive spiders: Google's no longer number one

I've always been astonished at the sheer number of Googlebot hits at Cites & Insights--averaging 30 a day for a site that has new content around once every four weeks.

(I eventually realized that Googlebot may be crawling the entire site each time, so that it's really more like 60 hits done once every couple of days...still quite a lot.)

In previous statistics, Googlebot was always way out ahead of any other spider.

That's no longer true. Beginning last December (I think), and continuing strong since then, there's a new champion for hyperactivity: Inktomi Slurp.

Actually, for the month of January 2005, Googlebot's third. Here's what I see:

  • Inktomi Slurp: 1420 hits
  • MSN Robot: 479 hits--yes, MSN is getting serious about search
  • Googlebot: 446 hits

After that, it drops rapidly: Gigablast Robot with 85, Turnitin Robot (!?!) with 75, FAST Enterprise Crawler with 68...and 40-odd others, down to SKIZZLE! Distributed Internet Spider and seven others with one each.

In all, spiders seem to account for just over 10% of the hits--but, fortunately, only about 1.5% of the unique visitors.


Here's how the numbers shake down for LISNews, according to Urchin.

Bloglines 68,108
NewzCrawler 54,072
msnbot 43,920
Googlebot 42,249
YahooFeedSeeker 24,933
NetNewsWire 10,136
SharpReader 9,970
Mediapartners Google 9,712
NewsGatorOnline 9,663

Those are hits for the month of January. #1 by far was listed as "Mozilla Compatible Agent" at 299,695 hits for the month, not even sure what that means. There's a bunch of generic stuff in there as well I just left out.

The Bloglines one is interesting. Bloglines is supposed to restrict its checking to once an hour--but it's probably treating each journal with at least one subscriber as a separate blog. So, let's see, adding the feed for LISNews itself, that suggests that 91 journals currently have at least one Bloglines subscribers, and that one of those journals only showed up about halfway through January.

Or not.

What is surprising about the Turnitin robot? That it's hitting your site that much (or at all), or that it exists?

Some people have questioned the legality of the Turnitin Robot's crawling, given that it's using your content for profit without asking permission.

I agree. I wish more eyes were looking at what Turnitin is doing. The thing I'm wondering, should I advise students to start putting creative copyright licenses on their papers prior to turn in?
Or for that matter should I put these on my own papers?

That it's hitting my site that much. I can't imagine anyone plagiarizing from Cites & Insights (but I don't have a vivid imagination).

Given the CC license I use, my only complaint with Turnitin would be if they were republishing my stuff and selling it. Using my stuff as part of their database: I don't see a CC problem with that, and I don't have a personal problem with it.

You are far more generous and forgiving then I, Walt.