Wednesday, June 23, 2010

Optical character recognition (OCR) in Google Docs

A couple of months ago, my co-worker, Mike, showed up at my desk with a pile of paper, each of the yellowed sheets densely covered with an ancient-looking typewriter font. His wife had recently discovered parts of her family chronicles in the attic, typed up by her grandmother many years ago! Now he was wondering if there was a way for her to continue writing the chronicles in Google Docs.

The papers sat on my desk for a while, but recently, I returned them to Mike with a smile, cheerfully telling him that what started as my 20% project is now ready for everyone to use -- Google Docs now officially supports importing scanned documents. What we launched as an experimental feature for the Documents List Data API last year is now available on the upload page: check the “Convert text from PDF or image files to Google Docs documents”, upload your scanned images (JPEG, GIF, PNG) or PDFs, and Google Docs will extract text and formatting from the scans for you to edit away.

For the technically curious: we’re using Optical Character Recognition (OCR) that our friends from Google Books helped us set up. OCR works best with high-resolution images, and not all formatting may be preserved. The original images will be included in the new document to make it easier for you to correct mistakes. Supported languages include English, French, Italian, German and Spanish, with more languages and character sets on their way. We’re looking forward to get feedback from you while we keep improving the feature over the next months.

And Mike’s scanned family chronicles have even been extended by an additional chapter in Google Docs: his wife recently had a baby boy named James!

Posted by: Jaron Schaeffer, Software Engineer, Google Docs

No comments:

Post a Comment