Herodotus: A Peer-to-Peer Web Archival System

Ender, The Duke of URL spotted An Interesting MIT Paper [That’s a PDF, Google has an HTML Cache As Well] that proposes the design and implementation of Herodotus, a peer-to-peer webarchival system. Like the Wayback Machine, a website that currently offers a web archive, Herodotus periodically crawls the world wide web and stores copies of all downloadedweb content. Unlike the Wayback Machine, Herodotus does not rely on a centralized serverfarm. Instead, many individual nodes spread out across the Internet collaboratively performthe task of crawling and storing the content. This allows a large group of people to con-tribute idle computer resources to jointly achieve the goal of creating an Internet archive. Herodotus uses replication to ensure the persistence of data as nodes join and leave. Herodotus is implemented on top of Chord, a distributed peer-to-peer lookup service. It is written in C++ on FreeBSD. Their analysis based on an estimated size of the World Wide Web shows that a set of 20,000 nodes would be required to archive the entire web, assuming that each node has atypical home broadband Internet connection and contributes 100 GB of storage.