Many people still resent the demise of their favorite childhood bookstore. For me, it was the Book Rack, a sunny little shop on Main Street, with two rooms of shelves steeped in the spice of aging paperbacks. When I was in middle school, and the local library started to cut back its hours, I was there every week with my little cardboard box filled with the last week’s reading —mostly science fiction, with a few scattered literary classics. I would trade them in for store credit and then prowl the shelves for more.
In the 1990s, everything began to change. One by one, the smaller stores were pushed under or online by the economies of scale created by the Internet. Suddenly, the competition, pricing, supply, and demand weren’t local, they were global. The Book Rack was absorbed into another store in a nearby town that got walk-in tourist traffic and, eventually, closed its doors for good.
Around the time that the Book Rack was picking up stakes, two Stanford graduate students, University of Michigan alumnus Larry Page and Stanford man Sergey Brin, were collaborating on a National Science Foundation grant at the Stanford Library. Although they worked for the library, theirs was primarily a computer-science project: How can useful information be culled from the massive amounts of raw data on the World Wide Web? The question sounded easy enough. The solution, however, proved to be complex.
Page and Brin teamed up to develop a search-engine prototype that, in a few years, would become Google, the largest search engine on the Internet. Following its initial public offering, in 2004, the cash-rich company began to move beyond organizing the data of others to creating content of its own. Page and Brin were taking Google back to its roots and into the library. In December, Google announced that it had teamed up with the libraries at Stanford, the University of Michigan, Harvard, and Oxford, and the New York Public Library to digitally scan their collections and make them searchable online. This marked a major expansion of Google Print, an existing digital book program on the company’s web site.
Google, it seems, doesn’t want to be just a reliable search engine with a no-frills home page and a funny name. It wants to become a cultural institution, a fifth estate, the Library of Alexandria in a digital age. “Google’s mission is to organize the world’s information,” Page said in a press release issued with the announcement. “And we’re excited to be working with libraries to help make this mission a reality.”
The sheer size of the project inspires incredulity. The five libraries are some of the biggest on earth, containing more than fifty-seven million books. Google would like to scan every one, but so far the libraries have committed to digitizing about 10 million books, at least until they see how it works.
Most of the books will come from the University of Michigan, which has already agreed to let Google scan its entire collection, an effort Google estimates will take six years. This aggressive time line has generated a considerable amount of skepticism among librarians and digitization experts. There are about seven million books in the university’s library, with, say, an average of 250 pages each. That’s 1.75 billion total pages, requiring a scanning pace of 800,000 pages per day, or about 33,000 pages per hour, twenty-four hours per day for six years. Each scanned book will be run through optical character-recognition software (OCR) to convert the photographs of the pages into searchable text. The data—images and characters—will be stored on a huge network of thousands of computers.
Google isn’t about to say how it’s going to do it. As an anonymous source at one of the libraries put it, “They’re very secretive.” Consequently, Google has made the libraries sign nondisclosure agreements and keeps the press far away from the company’s proprietary technology. But, in bits and bites, the following picture has emerged.
Google has, apparently, developed a faster scanning camera than any on the market. The fastest commercially available automated book scanners, like the Kirtas APT BookScan 1200, boast rates of more than a thousand pages per hour. Google, according to Daniel Brandt of www.google-watch.org, a web site that keeps tabs on the search behemoth, seems to have doubled that rate, scanning almost as fast as a person can turn the pages. At Michigan (Google cofounder Larry Page’s alma mater), which has served as a kind of testing bed for Google’s scanning technology, workers are believed to turn the pages manually, averaging about fifty books per day each. According to Michigan librarian John Wilkin, Google has already digitized books numbering in the “low tens of thousands.” Most, however, are not yet available online. As the project picks up steam, it will likely evolve toward complete automation. Andrew Herkovic, who is assigned to the project at the Stanford library, couldn’t describe the scanning process in detail. “I haven’t seen it,” he told me by phone. “But they’ve certainly put a lot of engineering effort into how to do this.”
In January, a couple of weeks after the library project was announced, I met with Paul LeClerc, CEO of the New York Public Library (NYPL), in his office at the Forty-second Street main branch in Manhattan. The NYPL counts itself as one of the five great libraries in the world, and book for book, it is more than a match for most university research libraries. The library has agreed to enter into a pilot program of about 50,000 books and plans to expand if everything goes well. Google employees will set up shop in the library and scan a selected portion of the library’s massive collection. As of this writing, exactly what collections are going to be digitized has yet to be decided. Various non-English language books, however, will be included. (France’s national librarian warned that the Anglo-American focus of Google’s digitization efforts thus far threatens to further the dominance of the English language on the web, a concern Google is at pains to ease.)
I asked LeClerc why the NYPL was taking a gradual approach. “Well, I think it’s like learning to swim,” he said. “I don’t want to jump in and sink or swim. This just seemed to all of us here to be a prudent way to start.”
Harvard, too, is taking it slowly. The university will begin with a random sampling of 40,000 books from their holdings of more than sixteen million volumes. The university is testing the waters. Peter Kosewski, a library spokesman, foresees the pilot and its evaluation taking two years.
Stanford, Google’s birthplace, is still working out the kinks in what officials envision as a gradual ramp up to production. Their plan is to check out books to Google, like they would to any other patron, only in much higher volume. The books will then be loaded onto trucks and taken to the Googleplex, Google’s corporate headquarters, seven miles or so from the university.
“It’s an ugly, nasty, complicated process for us to maintain control of books while we hand them off,” Herkovic said. The library expects to open the floodgates eventually, he continued, saying that they planned to scan “a lot,” but that Stanford had never associated itself with a specific number. He said he expects about two million books to be scanned in the first cycle, over the next five or six years. (Oxford, the other participating university, declined to be interviewed for this story.)
By comparison with the other libraries, Michigan’s commitment to scan everything is both radical and a product of its early adoption of digitization. Beginning in 1995, with funding from the Andrew W. Mellon Foundation and in partnership with Cornell University, Michigan began scanning books and journals to create “a digital library of primary sources in American social history from the antebellum period through reconstruction.” Nearly 12,000 volumes comprising 3.5 million pages have been scanned. The university will provide a printed copy of any book at a cost of nine cents per page, or $40 for a hardcover edition. Wilkin says he envisions Michigan doing something similar with the digital copies of the books it receives from Google. Whether the search-engine company also intends to offer print-on-demand copies of out-of-copyright books is still a mystery. It is already clear, though, that Google’s Internet technology disables the print function on all of the web browsers I tried, making the digital books a read-only experience. The company will generate revenue from Google Print by selling advertising on its web pages.
When Google publicly announced its plans to digitize entire libraries, many people hailed the advantages of online versions of millions of books: Google would provide universal access to major libraries; full-text searches could eliminate cumbersome indexes; lost books would find new life online; and no one would need to buy an out-of-copyright book ever again.
Some librarians and academics were more skeptical and cautious, and wondered how this new wealth of information would be used. “This sort of endeavor will change everything we do,” said Sue Waterman, a librarian at Johns Hopkins University. “We all know that students, especially undergraduates, only want information that they can access online. This project will limit further what they are willing to look at. Research will, for at least a large segment of the population, be limited to what is chosen for this project…It’s hard enough to get students to go into the stacks for information now.”
English lecturer Paul Tankard of the University of Otago in Dunedin, New Zealand, has a similar view. “Increasingly, what we read are decontextualized texts—or parts of texts,” he said. “It’s like reading a book via the index…You find (with a good index) what you’re looking for, but only what you’re looking for.” The knowledge gained from reading whole texts, Tankard noted, is lost in a keyword search.