Can Full-Text Search Replace Metadata?

From a review by Jeffrey Beall of a presentation by Eric Hellman at ALA Annual 2011 that touched upon that:

Hellman's talk was among the most arrogant and flippant I had ever attended at an ALA conference. His talk was supposed to be about linked data, but he exploited his position as speaker to unwarrantedly trash libraries, library standards, and librarians.

By way of GMANE, you can read what the folks at AUTOCAT had to say in discussing the matter further. Links to the slides used are also discussed in that AUTOCAT thread.

Comments

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.

Sounds like my kind of guy

"Metadata is useful, but not essential" --> "they haven't set the bar too high for speakers"

This reminds me of the talk by Jeff Trzeciak on orgazational change, and how the first comment on I saw stated that the misuse of an apostrophe in "PhD's" in one of the slides made him not worth listening to.

I'm not a cataloger or database designer, but I don't understand how the same computing power that can do something like "map keywords to subject headings" cannot also take the full text of records and calculate co-occurrences, match them from any keyword search, and produce the same type of results as using controlled vocabulary would, but without any manually created subject headings existing.

So how did that guy get invited to speak?

He is obviously naive enough to believe that, just because full-text search meets his marginal needs, it is all anyone needs. Yet another fallacy (I just don't know which one). Even though he may only need to gather up enough quotes to flesh out his next paper for publication, our society as a whole needs more than just rehashed papers. A lot of money is spent uncovering new knowledge which then gets buried - "Raiders of the Ark" style - in the vast warehouse that is our collective store of data.

A) Only certain papers get published at all. The rest languish on some professor's web site or hidden in a drawer.
B) Without metadata to fine tune a search many of the slim proportion of papers that do see the light of day will not be found by almost any search specific enough to avoid wading through thousands of matches.
C) Even with metadata, if it isn't standardized then it provides almost no benefit when you consider the vast quantity of information that will be unfindable because it is practically impossible for searchers to become familiar with all the different metadata standards out there. Not to mention the subtle differences in implementations within any one standard. And you thought the browser wars were bad.
D) Without metadata tagging individual paragraphs, sentences, phrases, or words within a document it is almost impossible to distinguish between all the different meanings for a particular word or phrase. Simply using the alternative definitions listed in a regular dictionary is not enough. Each different academic discipline has its own unique set of meanings for similar or identical terms used in other fields, or even within a field but in different parts of the world.

I believe we need much more metadata, rather than less. Only metadata can save us from the chaos and waste that results from having far more data than we can possibly turn into knowledge. I know some people who are doing good work in building taxonomies and topic networks using only the plain text of biological species descriptions. However, it is very early days indeed. Besides, I truly believe the day we have a computer smart enough to truly search through all the plain text of every document on the internet and build all the connections necessary to fully utilize that knowledge will also be the day computers are smart enough to decide humans are unnecessary.

Context, ambiguity, interpretation

In addition to Anonymous' great points, the meaning of a particular document is often not spelled out in plain language within the document itself. Especially in the humanities, where symbolism and satire are common, how can a computer (at this point in time) correctly interpret and analyze the "big picture" of a document and its historical contexts to make any determinations about subject and meaning? Swift's "A Modest Proposal" is a classic example where the literal text is not the real point of the document. A computer system *may* be able to examine this document, link other reviews and papers to it that explain the social commentary and satire, and end up with those perspectives and appropriate keywords linked to the document, but it still took previous human interpretation and documentation of the work to make that possible. What about for a new work uploaded to a system? Without a human to review and interpret it first, and write something about its meaning, such a work may be tragi-comically lost under headings like "babies" or "population control." What then, is the point of having the computer assign terms if a human has to write about the interpretation first anyway?

Even though scientific literature is usually much more straightforward, there still may be situations where relevant terms based on the context may not be exactly given in the document itself. Synonyms may be inappropriate based on the context as well. Until we perfect a computer intelligence on par with a human mind (in which event we might have bigger problems to worry about...), no, we cannot do away with human indexing and metadata.

Only a Trillion

>>Even though scientific literature is usually much more straightforward, there still may be situations where relevant terms based on the context may not be exactly given in the document itself.

This issue would come up for an article on Thiotimoline. Thiotimoline is a fictitious chemical compound conceived by science fiction author Isaac Asimov and first described in a spoof scientific paper titled "The Endochronic Properties of Resublimated Thiotimoline"

See: http://en.wikipedia.org/wiki/Thiotimoline

Version of article can be found in the Asimov book - Only A Trillion

Only a Trillion also has an article titled "The Sound of Panting" that discusses how quickly scientific information was being published and changing. Asimov was arguing that it was hard to keep up with the literature and this was 1955.

Syndicate content