Finally! A Use For That 80 Terabyte Thumb Drive You Didn't Know What To Do With

80 terabytes of archived web crawl data available for research
Internet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from. In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk. To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs. The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

Comments

Post your comment below. Now fortified with cuddly kittens!

  • Web page addresses and e-mail addresses turn into links automatically.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <blockquote> <img> <b> <strike> <del> <p>
  • Lines and paragraphs break automatically.

More information about formatting options

Syndicate content