Full-text Searching in Books

Scott Boren has a great look at electronic searching of the full-text of books. He covers Several Book Search Web Sites, Text Archives for older materials, Proprietary Book Search sites, some Smaller Collections, Google Book Search, some Emerging Projects, and some sites for Further Reading.

Electronic searching of the full-text of books goes beyond the index and table of contents to search for any and all text in one book or in many. A phrase, number, word, or any string of characters can be searched. A9 same book database as Amazon, Google Book Search, and Live Search Books can all Search Inside The Book, or search inside many books at once. The book search sites listed in this paper are available free of charge, with the exception of “Two Proprietary Book Search Products below.” As Greg Notess pointed out at the Computers in Libraries conference on April 18, 2007 (hereafter Notess 2007, April), another use of book search is to verify citations and to find mentions of passages as when a patron brings a photocopied chapter and asks “What book does this come from? I need to find the source” Or when was a particular word, number, or phrase mentioned? (For early mentions of words, the Oxford English Dictionary is also good.)
The texts on these web sites are images of book’s pages. The underlying database, however, will occasionally differ from the text (the exact image of the book’s pages) you see due to errors introduced by the optical character recognition software. In other words, what you get in a search is not what you see. Let’s remember, to convert books to computer text: the physical books are first scanned and the scanned file then goes through the optical character recognition process. The OCR generated file is what the computer actually searches. And it’s not 100% accurate, (Notess 2007, April). So if one is searching the books at Amazon Books and the phrase “one and one make one” is not found, it may in fact exist among that same corpus of books, but there may be errors due to faulty OCR. If we assume an error rate of 2%, a page of 500 characters contains 10 incorrect characters (Mr. Notess’ example). These errors may well result in not finding phrases which actually exist in the book. Thus errors in OCR adversely affect recall; due to OCR errors all of the relevant documents are not retrieved. So instead of searching “All the fish in the ocean”, I could search “fish in the ocean” or “All the fish”. You get the idea. Also, the faster the OCR process, the more errors are introduced in the searchable database. Sometimes the rights holders send a PDF or a word processing document of the book’s contents which should have fewer errors.

A nice chart comparing Amazon, Google Books, Internet Archive Text Archive, and Live Search Books is provided by Greg Notess at this link: http://www.searchengineshowdown.com/booksearch/



Book Search Web Sites Organized by Category

The categories serve only as a very general organizational scheme.



At Least Some Access To Newer Materials

Amazon.com- Both A9 (http://a9.com) and Amazon.com, search inside the same book database. These search inside some of the current books at Amazon. (More about this later.) Search and see which books mention your search term. (Using quotes for phrases is useful here although underlying algorithms sometimes take over. In other words, it doesn’t work in the literal, rule-governed way a Dialog or Lexis search would work. Try “guitar Zen” and it will return “Zen guitar” among the results.)

For copyright holders, participation in Search Inside the Book! or offering up Chapter 1 of your novel is, of course, purely voluntary. But Amazon offers a strong incentive to list books with at least some content available to web surfers: sell more books at the world’s largest bookstore. Because Amazon (and its A9 web site), search the largest number of recent books, I consider it the best research tool for searching inside a very large collection of recent books.

Google Book Search http://books.google.com/- See “Update on Google Book Search” below.

-Google Scholar http://scholar.google.com/ is still a work in progress. (Google Scholar started sometime before a Search Engine Watch story by Danny Sullivan on Google Scholar, a new service at the time, dated November 18, 2004.) Google Scholar is a mix of scholarly and not-so-scholarly journals, course materials, books, and other sources. In Google Scholar you can sometimes find works that were cited by another work. Find the article first, then click “Cited by”. But it’s not perfect. There were reports that Google Scholar mistakes citing journal for cited journal. ISI Web of Knowledge http://isiknowledge.com/ works better for showing who cites whom. ISI still works at NLE as of 5/15/07. Also, in Google Scholar, you can find related articles. The full-text is sometimes available through Google Scholar. To see if NLE has the work, you can click on “Resources at My Library” and if we have it you can connect to its NLE library information automatically. Last I heard, new articles haven’t been added to the Google Scholar database for a while. So you may not find new documents there.

Live Search Books http://search.live.com/results.aspx?q=&scope=books Microsoft’s offering.



Text Archives That Are Primarily Older Material

Authorama http://www.authorama.com/full/ has old books which are available for use as public domain http://creativecommons.org/licenses/publicdomain/.

-Internet Archive Text Archive (http://www.archive.org/details/texts) is one of the major text archives. The Internet Archive Text Archive works just fine from Department of Education computers although the Wayback Machine (www.archive.org) is blocked. In addition, the Text Archive above has a very good list of other book search web sites.

-Open Library http://www.openlibrary.org/ contains 13 books: old ones (mostly) like an illustrated An International Episode by Henry James and one new one: The Open Library by Brewster Kahle. Clicking on a page turns the page. Searching inside the book was not offered (5/18/07).

-Project Gutenberg http://www.gutenberg.org/ started in 1971, Gutenberg is the first digitized book venture. It has 20,000 free ebooks in its OPAC-like catalog.

-The University of Michigan’s OPAC http://mirlyn.lib.umich.edu/ has some electronic books that Google doesn’t index, such as government documents. Go to advanced search and choose “Electronic Resources”.



Three Proprietary Book Search Products

Amazon- If you buy a book from Amazon, you can often get an electronic version of it through the Amazon web site for a fee.

-In Netlibrary.com, at the National Library of Education we can either search inside one book, or search all the books at once. There are 18 books we purchased access to and 3,400 very old books. You can print from these books, something you cannot really do for in-copyright books at such sites as Google Books or Amazon. However, printing at http://www.netlibrary.com/ is limited to one page at a time, and a certain quota per day.

-Psycbooks contains 1,377 books and 21,314 chapters. It can be accessed through FirstSearch. I went to Psycinfo in the database portal off of ConnectEd. From there I clicked “List All Databases”. Then, “Social Sciences” from the drop-down menu. It does search across all books but apparently just the citation and the abstract. Searching within each book is possible with the find command (control-F).



Smaller Collections That Offer At Least Some Access to Newer Materials

-Professor Lawrence Lessig’s 2004 book Free Culture: How Big Media Uses Technology and the Law to Lock Down Culture and Control Creativity can be purchased or is available free under a Creative Commons license at http://www.free-culture.cc/freecontent/ . Lessig gathered updates and further research on his 2000 Code and Other Laws of Cyberspace, called it Codev2 and made it available for free at http://codev2.cc/. Or you can buy a bound copy. Proceeds from both books benefit such charities as Creative Commons and Public Knowledge.

-55 Ways to Have Fun With Google by Phillipp Lensenn under a Creative Commons Attribution-NonCommercial-ShareAlike license can be downloaded for free at http://www.55fun.com/book.pdf, or purchased.

-Publisher’s web sites The National Academies Press (www.nap.edu/) -which allows you to read the entire book online and print it, Random House http://www.randomhouse.com/ and Harper Collins http://www.harpercollins.com/ are examples.



A multi-booksearch tool: Booksearch x 3

http://kokogiak.com/booksearch/, searches three major search web collections of books: Amazon’s book search via A9, Google Books, and Microsoft’s Live Search Books.(This was pointed out by Greg Notess in “Switching Your Search Engines” Online, May-June 2007 pg. 44-46.)



Update on Google Book Search

Google considers books published before 1923 to be out of copyright and in the public domain: definitely fair game for inclusion in Google Book Search (Notess 2007, April). (Items published before 1923 are in the public domain according to the copyright laws. See the excellent summary chart at “Copyright Term and the Public Domain in the United States” http://www.copyright.cornell.edu/training/Hirtle_P ublic_Domain.htm, provided by Cornell’s Copyright Information Center.) But for items published from 1923 to the present, if there is no agreement from the publisher then those books are not included in Google Books.
-Oddly enough, Google does not scan government documents that are published in 1923 to the present although they are reproducible, so copying them with appropriate attribution seems legal (Notess, April 2007). Google is erring on the side of extreme caution.
-Some periodicals are available (Notess, April 2007).
– Back in 2005, Google had plans to scan even copyrighted books. This early plan was to have books scanned but only snippets would be generated: not the whole text of the book. All books would be included unless the copyright owner opted out. The argument was essentially that it was within the bounds of fair use to do this. Publishers brought suit. The last I heard, in an Information Today article, the suits are still pending. As Notess (April, 2007) pointed out, Google Book Search operates much differently now. If there is no agreement between Google and the rights holder of a book that’s in copyright, then the books are not included in a search-inside-the-book style offering. In other words, if there is no agreement then there is no searching inside the book, and no snippets either. The information from Google’s web site is in agreement with Notess’ statement. As Google states, “Once you send us your titles (or upload them as PDF files), we’ll add them to our index for free”, implying that recent book titles have to be sent to Google one way or the other. Also, Google requires that you be the copyright owner in Google’s form, registering people to sign up their book with Google Book Search. If Google prevails and virtually all books can be included in Google Books (with an opt-out provision of course), there will be more books to computer-search and to view snippets from.

BTW, a Google account is required to see some of the Google Book Search results. Also see About Google Book Search http://books.google.com/intl/en/googlebooks/about. html.



Two Emerging Projects that are digitizing books

-The Million Book Project consists of three web sites:
US site is http://tera-3.ul.cs.cmu.edu/
India site (digital library of India) http://dli.iiit.ac.in/
China site- Don’t bother trying what is supposed to be their website http://www.ulib.org.cn/ because it doesn’t work. (I tried to access it May 4, 2007). I couldn’t find it using Google or Yahoo utilizing such searches as China site “Million Book Project” site:cn . Over 600,000 books have been scanned so far. Most are in the public domain, but permissions have been obtained to include over 60,000 copyrighted works. US government works will be included. There is searchable content in English (as well as in other languages) on the U.S. and India sites.



The Open Content Alliance: Yahoo Digitizes Books Too

According to an October 2005 New York Times article “The new project, called the Open Content Alliance, has the wide-ranging goal of digitizing historical works of fiction along with specialized technical papers. In addition to Yahoo, its members include the Internet Archive, the University of California, and the University of Toronto, as well as the National Archive in England and others.” Source: “In Challenge to Google, Yahoo Will Scan Books” Katie Hafner (2005, October 3) New York Times http://www.nytimes.com/2005/10/03/business/03yahoo .html?ex=1285992000&en=73ae4d4eda58b32a&ei=5090&pa rtner=rssuserland&emc=rss (This article is accessible on the open web, no passwords necessary.) According to the Open Content Alliance web site (www.opencontentalliance.org/), although no content is available now, it will be accessible soon through its web site and through Yahoo!.



Further Reading

  “Google Book Search Has Far to Go” by Mick O’Leary Information Today Vol. 23 No. 10 – Nov. 2006 http://www.infotoday.com/it/nov06/OLeary.shtml
Mr. O’Leary likes the little known Library Search which, in addition to WorldCat, searches 15 other union catalogs representing 30 countries. To search library catalogs go to Google Advanced Book Search http://books.google.com/advanced_book_search; click on “library catalogs”. But, as O’Leary summarizes, “The plan to scan library collections is hindered by copyright problems. Searching in-print books works, but it is inferior to Amazon’s Search Inside! program”. However, in my opinion, Google Book Search is good to have in the toolbox. Just as librarians search more than one search engine, it’s useful to search more than one book search if you are not finding what you need. You can use Booksearch x 3 to search Amazon Books, Google Books, and Live Search Books. But if you are still not finding what you need you may want to search Archive.org’s Text Archive and Project Gutenberg particularly if you are looking for older material.

“Google Book Search Libraries and Their Digital Copies: What Now?”
By Jill Grogg and Beth Ashmore. Searcher April 2007. Pg 18-27.
As an added benefit to participating in the Google Book Search Library Project libraries get digital copies of each of the books scanned. This article reports on the plans and deeds of the ten participating research libraries, what they will do with their digital copies, and on digitization efforts.

Von Bubnoff, Andreas (2005, December 1) Science in the Web Age: The real death of print Nature 438, 550-552. Available on the Internet for free at http://www.nature.com./ A well-researched paper from 2005.

The preceding discussion of full-text searching in books was influenced by Greg Notess’ presentation on April 18, 2007 at the Computers in Libraries conference. The Powerpoint of Greg Notess’ (2007, April) presentation is at http://www.searchengineshowdown.com/booksearch/cil 07/ . I found his presentation very stimulating and combined it with my own efforts to gather information on book search.

Thanks to Ann Slattery for her suggestions on such useful things as usage, grammar, and punctuation.

-Scott Boren
The views herein represent my views and are not necessarily those of the National Library of Education.