Herodotus: A Peer-to-Peer Web Archival System

Ender, The Duke of URL spotted An Interesting MIT Paper [That’s a PDF, Google has an HTML Cache As Well] that proposes the design and implementation of Herodotus, a peer-to-peer webarchival system. Like the Wayback Machine, a website that currently offers a web archive, Herodotus periodically crawls the world wide web and stores copies of all downloadedweb content. Unlike the Wayback Machine, Herodotus does not rely on a centralized serverfarm. Instead, many individual nodes spread out across the Internet collaboratively performthe task of crawling and storing the content. This allows a large group of people to con-tribute idle computer resources to jointly achieve the goal of creating an Internet archive. Herodotus uses replication to ensure the persistence of data as nodes join and leave. Herodotus is implemented on top of Chord, a distributed peer-to-peer lookup service. It is written in C++ on FreeBSD. Their analysis based on an estimated size of the World Wide Web shows that a set of 20,000 nodes would be required to archive the entire web, assuming that each node has atypical home broadband Internet connection and contributes 100 GB of storage.

misseli says:

October 6, 2003 at 11:31 pm

Wowie zowie …

Cool idea. I’m not a computer scientist, so I refuse to speak to the technical aspects of this proposal. My gut feeling is, why not?

However, the paper is a bit inaccurate in its intro: the Internet Archive is definitely not the only web archiving project out there. A number of countries, such as Australia, New Zealand, Sweden, Austria and the Netherland have Internet archiving projects of their own. However, the Archive is 1) one of the oldest, 2) the largest (has the broadest scope) and 3) has gotten the most publicity (deserved, I think, but I’m biased and a fan of Brewster Kahle).

One thing I didn’t get from the proposal: when it talked about the size of the web and what it would take to archive ‘all of it’, was there a discussion of the Open Web vs. the Deep/Invisible Web? Because that’s the one major drawback of the Archive as it exists today, and I didn’t see anything in the proposal to suggest it’s a concern of the designers.

That said, I think they should definitely submit it to the NDIIPP (National Digital Information Infrastructure and Preservation Program).

You may also like...

1 Response

Leave a Reply Cancel reply

Recent Posts

Recent Comments

LISNews Archives

You may also like...

And I Thought Library Science Was a Hot Job Category

Librarian One Of The Five Most Unpopular Jobs

Legal Woes for Clinton Library in Ark.

1 Response

Leave a Reply Cancel reply

Recent Posts

Recent Comments

LISNews Archives