Digitization 101: Newspaper

Showing posts with label Newspaper. Show all posts

Friday, May 20, 2011

Article: Google Shuts Down Ambitious Newspaper Scanning Project

Yes, Google is shutting down one of its digitization efforts. In a statement to Search Engine Land, a Google spokesperson said:

Users can continue to search digitized newspapers at http://news.google.com/archivesearch, but we don’t plan to introduce any further features or functionality to the Google News Archives and we are no longer accepting new microfilm or digital files for processing.

Google's efforts were in partnership with several North American newspapers, ProQuest and Heritage Microfilm, according to a 2008 news report.

In reporting on Google's decision, the Boston Phoenix wrote:

News Archive was generally a good deal for newspapers -- especially smaller ones like ours, who couldn't afford the tens or hundreds of thousands of dollars it would have cost to digitally scan and index our archives -- and a decent bet for Google. It threaded a loophole for newspapers, who, in putting pre-internet archives online, generally would have had to sort out tricky rights issues with freelancers -- but were thought to have escaped those obligations due to the method with which Google posted the archives. (Instead of posting the articles as pure text, Google posted searchable image files of the actual newspaper pages.) Google reportedly used its Maps technology to decipher the scrawl of ancient newsprint and microfilm; but newspapers are infamously more difficult to index than books, thanks to layout complexities such as columns and jumps, which require humans or intense algorithmic juju to decode. Here's two wild guesses: the process may have turned out to be harder than Google anticipated. Or it may have turned out that the resulting pages drew far fewer eyeballs than anyone expected.

The lesson is that jumping on the Google bandwagon can be good thing, if the wagon keeps on moving. A lesson that those involved in Microsoft's book digitization program also learned the hard way.

Addendum (10:53 a.m.): Gary Price at INFOdocket wrote a good piece on this. Price noted:

New leadership is in place at Google and new leadership can often bring changes. This is likely one of them.

Tuesday, May 25, 2010

ICON: International Coalition on Newspapers

Since there is no central repository for digitization program, I'm always pleased when I discover a list that tries to be comprehensive in some way. As the International Coalition on Newspapers web site says:

ICON provides a freely accessible database of bibliographic information for more than 25,000 newspaper titles from participating institutions.

The web site is continually updated. Date ranges on newspapers are noted, when applicable.

Friday, July 31, 2009

The Paper of Record - good news

On July 13, I finally blogged about the changes and problems with the Paper of Record, which had been bought by and integrated into Google. Commenters noted that there were still problems. Well, today that blog post received the following comment (links added):

Finally, we've just completed another go around with Google and www.paperofrecord.com will return in it's normal form at World Vital Records. Institutional subscribers will access POR in the format most academics and students have grown accustomed to. A press release will follow in the next few days.

Best-

Bob Huggins
Founder
PaperofRecord.com

I know of many people who will be anxious to read the press release. I'm sure that once the Paper of Record is available through World Vital Records that researchers will test the archive for completeness. I hope that they truly do find everything that has been lacking on the Google site.

Technorati tags:

Google,

Newspaper

Monday, July 13, 2009

Problems and resolution with the Paper of Record (Google)

Back in Dec. 2008, I noted that Google has purchased the Paper of Record. At that time, the Paper of Record had 20 million digitized historical newspaper pages. This came a few months after Google announced a newspaper digitization project with ProQuest and Heritage Microfilm (post, post). The April 2009 Google Book Search newsletter said:

Try a search for "Americans walk on moon" on Google News Archive Search, and you'll be able to find and read an original article from a 1969 edition of the Pittsburgh Post-Gazette. Not only will you be able to search these newspapers, you'll also be able to browse through them exactly as they were printed -- photographs, headlines, articles, advertisements and all.

While this alerted (or re-alerted) people to the fact that Google was adding newspaper content, at least one email discussion list began talking about this in January and the effect the acquisition was having on research. At some point, the PaperOfRecord.com web site was redirected to http://news.google.com/archivesearch. Although that seems minor, researchers from around the world noted that content once available through the Paper of Record was missing from the Google site.

In February, a Google employee said in email (as part of the discussion):

We're currently working on the most effective way to search and browse this valuable content. We're doing our best to find a solution to include as much of the acquired content as possible.

While a lot of this content has been made available through Archive search, we're still refining processes to include incompatible newspaper images in our index. We're also working with certain publishers to acquire the rights to display their content. All of this takes time, and we appreciate your patience. We're constantly making improvements to ensure the best user experience.

Researchers wondered by Google had not left the old PaperOfRecord.com site available while it when through this transition. Google's blindness to Paper of Record users made matters worse. Several things happened between February and June when things seemed to get resolved (article). In his article on the topic, Robert B. Townsend said:

Regrettably, this proves yet again Roy Rosenzweig’s warning to the profession six years ago about the “the fragility of evidence in the digital era.” While it may be beyond our capacity to adjust copyright laws and the behavior of large corporations (however well meaning), as a profession we can and perhaps should develop new habits for working with digital materials—by copying down information when we see it online, and not becoming overly dependent on any one data source or having illusions about its permanence.

In early June, a Google employee provided this information on the content from the Paper of Record:

4.91M articles representing 522 titles obtained from Paper of Record are now live on Google News Archive search. This includes previously live content as well as content added as of this week from Paper of Record, all free of charge. Please note that all articles from these titles may not be comprehensively available, but will otherwise be made available in browse-only mode within 3 months. The full list is here [2].
~0.5M pages representing 381 titles obtained from Paper of Record will be made available in browse-only mode within 3 months, also free of charge. The full title list is here [3]. Many of the images we obtained were of low quality, and we were therefore unable to get quality text after following the OCR process. We are working to put up content from these titles so that they can be browsed.
Finally, for these 10 titles here [4], we don't have the rights to display these newspapers. We've reached out to the publishers who hold rights to these papers, but not all want to participate in Google's programs. To access these, you may need to travel to a library if you can't find an online source, or contact the publisher directly.

So, nine months after announcing the acquisition of the Paper of Record (and actually three years after it had secretly acquired the database), Google finally was able to provide information that users needed. In between, Google frustrated researchers who wrote blog posts, articles, and letters of protest. Google's inability to be customer focused left a bad taste in many people's mouths.

I heard today that there is one remaining question - Will the Paper of Record (or WorldVitalRecords.com which seems to have access to the same content) make institutional access available to historical and genealogical societies. Evidently societies have inquired about this, but have not received a response. I believe (and please correct me if I'm wrong) that part of the issue is that the Google search interface is not robust enough.

Finally, while some people saw the acquisition as moving Google one step closer to world domination, what it really showed was:

Google can be sneaky in its dealings.
Google doesn't have the users' best interests in mind.
We cannot have an illusion over the permanence of any content.

Sadly, every day we all become more reliant on Google. Google, however, is not some government agency that receives public oversight. Google is a large for-profit company. If it becomes the center of all of our universes (whether we like it or not), it will make a profit.

BTW on the gossipy side of things, this blog, Gawker, carries news tidbits about Google that some might find interesting (e.g., which executives are leaving the company like Doug Bowman).

Thanks to Rod Nelson for alerting me to the Paper of Record story. Rod, sorry that it took me so long to dig into it.

Technorati tags:

Google,

Newspaper

Friday, July 10, 2009

Newspaper digitization

I'm going through notes from various conferences I attended this spring and have come across notes from a session at the Society of Ohio Archivists Annual Conference where members of the Ohio Historical Society talked about newspaper digitization. They began this past winter on a two-year newspaper digitization program under the auspices of "Chronicling America". Here are my notes:

Newspapers have not had a standard format over the years, which makes them more difficult to digitize.
Chronicling America is using a standard set of practices that were outlined by the National Digital Newspaper Program (NDNP).
Ohio Historical Society is selecting one newspaper from each of its 10 regions.
Difficulties have included copyright on the microfilm as well as some technology concerns.
They are doing three levels of quality control.
Scanning at 300-400 dpi, grayscale. They are creating TIFF file (master), then derivative files (PDF and JPEG200 files) as well as OCR'd text.
Metadata is being embedded into the files themselves so that the metadata can travel with the files. (As much metadata is embedded as possible.)
They are using descriptive, structural, administrative, technical and preservation metadata.
Rather than plain OCR, they are doing optical word recognition (OWR) which tries to predict what the word is not just what the characters are.

If this topic interests you, the Documents section of the project wiki contains links to both presentations the team did at the SOA Annual Conference.

Technorati tag:

Digitization,

Newspaper

Tuesday, May 26, 2009

Open Loops: Articles worth noting

"Open loops" can be defined as unfinished business or, in my case, blog posts never written. Here are three articles that I meant to write blog posts about, but didn't. At this point, this may be just good documentation for the future.

Ancestry.com starts worldwide archiving project (April 18, 2009)
Digitization and democracy (Feb. 9, 2009) about the Library of Congress
Heritage Microfilm Announces International Initiative (Feb. 7, 2009) about newspaper digitization

Technorati tags:

Digitization,

Newspaper

Tuesday, December 09, 2008

Google buys Papers of Record

Google is in the news again. Quoting SearchEngineWatch.com:

Google has completed the purchase of 20 million digitized historical newspaper pages from PaperofRecord.com. The two have had an agreement for two years and has now concluded in a sale that was voted on by shareholders of PaperofRecord's parent company, Cold North Wind, Inc.

This will definitely boost their historic newspaper digitization initiative (post, post). However, as Steve Arnold said in his blog post:

My thought is that this acquisition may be like putting a toe in the water. If it “feels” good, the GOOG may start making commercial databases free to users. The content becomes a platform for the online ads. With commercial database publishers hanging on to an outmoded business model, the commercial database sector could suffer sharp revenue drops. Libraries will point users to “free” services and if these prove satisfactory, commercial databases may be starved for revenue.

Google's vortex is getting stronger...

Technorati tags:

Digitization,

Google,

Newspaper

Friday, December 05, 2008

Article: The Current State-of-art in Newspaper Digitization: A Market Perspective

Yes, this is an old article (Jan. 2008), but not out of date. It is worth perusing, especially if you're into newspaper digitization. The article is the result of research and surveys conducted with newspaper digitizaton program and vendors. It discusses (in somewhat of a bird's eye view):

Market parties
Digital imaging
Optical Character Recognition (OCR)
Zoning and segmentation
Metadata
Searchability
Presentation

Technorati tags:

Digitization,

Newspaper