Google Search Engine Update 31 October 2008 – OCR Update for PDF’s

Share via email

Today there were quite a few changes in the Google search pages, with many sites dancing a jolly little jig while Google was making an important update to its search algorithm. I was curious, so did a bit of digging, and eventually found a possible reason on Google’s official blog. here is an excerpt from their announcement:

“Every day, people all over the world post scanned documents online -everything from official government reports to obscure academic papers. These files usually contain images of text, rather than the text themselves. But all of these documents have one thing in common: someone somewhere thought they were they were valuable enough to share with the world.

In the past, scanned documents were rarely included in search results as we couldn’t be sure of their content. We had occasional clues from references to the document– so you might get a search result with a title but no snippet highlighting your query. Today, that changes. We are now able to perform OCR on any scanned documents that we find stored in Adobe’s PDF format. This Optical Character Recognition (OCR) technology lets us convert a picture (of a thousand words) into a thousand words — words that can be searched and indexed, so that these valuable documents are more easily found. This is a small but important step forward in our mission of making all the world’s information accessible and useful.

While we’ve indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however — it is a picture of the printed words. Often you can see tell-tale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages.

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter ‘O’, just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.” Posted by Evin Levey, Product Manager Permalink

So, this update today (actually looks like it started yesterday) will allow us to find more documents online. Many companies publish reports in PDF format, and these are rarely found until now (or soon). Hopefully some of the negative effects of the Google update will soon settle down, and people will start to get their traffic levels back again. Note that this was not a “page rank” update. In fact, some would say that page rank updates are meaningless now. Toolbar pagerank is a guide, real page rank is updated constantly. Or so I have been informed!