My Advisor, Ed Felten, has a post examining the problem of metadata errors in Google’s Book Search catalog:
Some of the errors are pretty amusing, including Dickens writing books before he was born, a Bob Dylan biography published in the nineteenth century, Moby Dick classified under “computers”. Nunberg called this a “train wreck” and blamed Google’s overaggressive use of computer analysis to extract bibliographic information from scanned images.
Things really got interesting when Google’s Jon Orwant replied (note that the red text starting “GN” is Nunberg’s response to Orwant), with an extraordinarily open and constructive discussion of how the errors described by Nunberg arose, and the problems Google faces in trying to ensure accuracy of a huge dataset drawn from diverse sources.
Orwant starts, for example, by acknowledging that Google’s metadata probably contains millions of errors. But he asserts that that is to be expected, at least at first: “we’ve learned the hard way that when you’re dealing with a trillion metadata fields, one-in-a-million errors happen a million times over.”
Ed’s conclusion is a good illustration of the difference between top-down and bottom-up thinking:
What’s most interesting to me is a seeming difference in mindset between critics like Nunberg on the one hand, and Google on the other. Nunberg thinks of Google’s metadata catalog as a fixed product that has some (unfortunately large) number of errors, whereas Google sees the catalog as a work in progress, subject to continual improvement. Even calling Google’s metadata a “catalog” seems to connote a level of completion and immutability that Google might not assert. An electronic “card catalog” can change every day — a good thing if the changes are strict improvements such as error fixes — in a way that a traditional card catalog wouldn’t.
Top-down thinkers want to build finished products whose errors have all been corrected before release. Bottom-up thinkers recognize that this is impossible, so they accept that some errors will occur and focus on building processes that reduce the number of errors over time. For really big projects, the top-down approach is simply delusional: if you think your billion-record dataset has no errors, it’s more likely that you’re fooling yourself than that you actually have a perfect quality-control system.
The kind of transparency Google is practicing here is also crucial to bottom-up efforts. Given that large, complex systems inevitably have errors, it’s important for the institutions in charge of those systems to be open about the kinds of errors that occur and the steps being taken to correct them. This has two benefits. First, third parties will often be able to help correct errors, but they can only do that if they’re given reasonable access to the data set. Second, and more important, it allows users of the dataset to understand the appropriate level of skepticism they should apply to information they find in the dataset. A bottom-up world is a world in which end users have to take a bit more responsibility for verifying information they receive from not-necessarily-authoritative sources. Right now, the Google Book search dataset has enough errors that you should really double-check its answer against other sources in cases where accuracy is a high priority.