Optical Character Recognition (OCR): The Sherpa Guide

Since its introduction in 2004, customers have been using Sherpa Software’s Discovery Attender  to locate email messages and other electronically-stored information containing specific text, or that meet other criteria such as age, size, etc. By far, however, it is the text searching feature that is the most widely used of all included features. Discovery Attender can search across vast expanses of data to retrieve the “needle in the haystack” needed for litigation or compliance that would be impossible to find manually.


Without getting mired in technical details, Discovery Attender performs this magic by actually recognizing the text residing on the documents. Obviously, Discovery Attender overlooks details such as fonts or colors; it’s only looking for the actual character values. It’s even possible to scan for characters in non-Western writing systems, such as Cyrillic, Hebrew, Chinese etc. If you can input it into Discovery Attender’s search fields, the program knows how to find it in your electronically-stored data.

Despite its ability to scan for text across multiple content sources such as Exchange Mailboxes, archives, documents on hard drives and file shares, SharePoint, Lotus Notes – even web-based mail, such as Office 365, Gmail and Hotmail – Discovery Attender does still have some limitations.

One limitation that customers run into from time to time is the need to recognize text on a JPG, GIF, or other similar image document. In these cases, there isn’t actual text on the document, at least not in the sense that someone actually sat down and entered keystrokes from a keyboard. Text in an image document may have started out as pure text, but now exists strictly as an image – a “photograph” of text, but not actual, individual characters that can be “read” electronically.




One technique to remedy this problem is Optical Character Recognition, or OCR. Originally conceived as a way to translate reading material for the visually-impaired, OCR is able to distinguish both alphabetic and numeric characters from other image “noise” on a document. Essentially, OCR allows you to translate an image of text into actual editable or searchable data. This is accomplished by isolating specific blocks and lines within an image, and making predictive assumptions about what image components may or may not be a written character. The program then compares these probable characters against a known set of character images and makes additional judgments about what the specific character may actually be. Early versions of OCR were constrained by limitations with recognizing specific fonts, but today’s technology is much more adept at interpreting a variety of alphabets and character sets.

At Sherpa Software, our engineers are currently working on incorporating OCR technology into Discovery Attender and integrating character recognition with its extensive searching and results organization capabilities. Be on the lookout for an exciting announcement soon!

[hs_action id=”4086″]

Leave a Reply

Your email address will not be published. Required fields are marked *