World’s Largest Library ?

Feedmag has an Interview with Brewster Kahle the founder of Alexa who has built what they claim is the largest library in the world. They have collected thirty terabytes of data, archiving both the web itself, and the patterns of traffic flowing through it on their servers. It\’s interesting to what he considers a library, and how much it costs to catalog a book (hint:he says that\’s a bad thing)

\”In just three years we got bigger than the Library of Congress, the biggest library on the planet,\” he says, arms outstretched, smiling. \”So the question is: What do we do now?\”

Feedmag has an Interview with Brewster Kahle the founder of Alexa who has built what they claim is the largest library in the world. They have collected thirty terabytes of data, archiving both the web itself, and the patterns of traffic flowing through it on their servers. It\’s interesting to what he considers a library, and how much it costs to catalog a book (hint:he says that\’s a bad thing)

\”In just three years we got bigger than the Library of Congress, the biggest library on the planet,\” he says, arms outstretched, smiling. \”So the question is: What do we do now?\”
More from Feedmag

We now have about thirty terabytes of archival material that we data mine. And that\’s 1.5 times the size of all of the books in the Library of Congress. So we\’re now at an interesting point, we\’re now beyond the largest collection of information ever accumulated by humans. We\’ve gotten somewhere! [laughs] We use as our original inspiration the Library of Alexandria. Because they were the first people that tried to collect it all. And they started to actually understand the intersection between completely different self-consistent belief systems. They knew what the Egyptians, Romans and Greeks, Hebrews, Hittites, Sumerians, Babylonians — they knew the mythologies, because they had it all in one place. And they had the scholars to stare at it and try to make the disjunctions conjunctions and start to get an idea of what humans are. The dream is that we\’re in another one of those positions. They got up to five hundred thousand books. Of course, they were scrolls. The Library of Congress — the largest library now — is seventeen million. Only thirty four times more than what we had in 300 B.C. It indicates that the technology hasn\’t scaled. But now we\’ve broken through into a new technology that allows us to bypass the Library of Congress in very little time, and the sky\’s the limit. What can we discover about ourselves as a species? As different peoples? Are we couch potatoes or do we actually have independent will? Do we have interests that go beyond the fifteen demographics of slotted marketing hell? And what we\’re finding is, people are interesting, diverse and peculiar. They are constantly looking for new things that are of interest to them.


KAHLE: Oh, yes. Auto-cataloging is the only way to scale. It costs forty-five dollars to catalog a book in terms of just taking author, title, when it was copyrighted, what subject index should it go into. Forty-five dollars! The Web is about twenty million different sites in terms of content areas that sort of make sense to catalog — and it\’s growing at an astounding rate. That would mean if you tried to catalog it by hand and tried to scale it the size of the Net, that you\’d have to spend almost a billion dollars to catalog the Web today. And a year from now, you\’ll have to spend another billion dollars. And it won\’t be up to date. So we needed new techniques. The search engine work that was done in the sixties by Gerald Saltman and Mike Lesk — phenomenal work. It\’s been doing great. But if we\’re really going to get an idea of what the Net looks like — when every suburb of Denpasar is on the Web and that their soccer team schedules are on the Net, and you\’re trying to find where the game is, how are you going to find it by typing a few key words? You\’re not. You\’re going to have to have these tools that go and say: \”these are the suburbs of Denpasar on our Web sites.\” It has to be automated, otherwise my worst nightmare is it all becomes five thousand channels of nothing on the Web.