MENU

EXPOSING HIDDEN DATA

NOVEMBER 11, 2015 Hidden Data

One of the recurring themes in big data is that we are confronted with a tsunami of unstructured data. We then rush out to look for tools to digest and process this data. For example, a favorite tool is to index everything and then use a search engine to work with this data. We search when we have lost something. In reality, oftentimes it is not that something was lost, but that we used a workflow that hides the desired piece of information. In this case, the solution is not a big data tool but a change of the workflow. Here we look at scanned documents.
THE PROBLEM WITH PAPER
In the late 80s, before the Internet became commercial and the data interchange standard XML was developed, industry moved away from business transactions via letters, fax, and telex. Driven by the meme of just-in-time manufacturing, suppliers had to connect via modem to their customer’s enterprise resource planning (ERP) systems to fetch order data and submit invoices. A few agile companies became very successful by writing programs that bridged the supplier and customer ERP systems through SQL queries and data transformations.

This was a long time ago and supply chain management has become very efficient. There still are a small number of business segments that operate on dead trees, for example the real estate escrow industry. Real estate transactions need a large number of legal documents, which are mailed on paper or faxed (a faxed document has legal status, an email attachment does not).

The physical storage of documents is expensive because it takes a lot of space in what is usually premium real estate where these businesses are located. The common approach is to scan the documents and keep the images online, while the originals can be warehoused in an inexpensive location.

When a detail has to be verified in an old real estate transaction, it might be difficult to find the document images. For example, the documents may be organized by customer name, but the detail to be verified is the ratio between the structure and the lot size, or the existence of a granny unit. Such data is hidden, because it is not visible to an information retrieval system—a human must browse all the documents and inspect them visually.

In one example, this caused three business issues:

  1. Almost 500 agents spending a quarter of each day searching for internal data resulting in a significant productivity loss. Goal is one sixth.
  2. Customer satisfaction decreasing as their inquiry response times grow from two weeks to over a month in 2015 because of inadequate and incomplete search results. Response time goal less than a week.
  3. Unique document syntax for each of 32 States leads to incomplete retrieval results. User is unaware.

OCR: AUTOMATED DIGITIZATION
The data is exposed by transforming it into text and tables that can be indexed and searched by computer. This transformation from atoms to bits used to be carried out by armies of typists, but for the last 60 years, computers have carried out this transformation, which is generally known under the name OCR (optical character recognition). Once each document contains the actual text behind each image, the document repository can be indexed for a search engine.

There are various OCR products on the market and there is a wide range of prices. How do you choose the best one to expose your hidden data? Each product has these steps: document dry-cleaning & segmentation, binarization, training, spell checking and context analysis. The products use very similar algorithms and the differences are in how well they perform each step.
MEASURING THE EFFECTIVENESS OF OCR
The art is to measure the performance (accuracy) in a way that you can get an idea of how well they perform the steps important for your particular document images and you particular processing requirements. The standard accuracy metric is tomeasure precision and recall by performing a number of queries on the document repository and comparing the result from the original text with that from the OCR-ed text.

Remember that precision is the probability that a randomly selected retrieved document is relevant, while recall is the probability that a randomly selected relevant document is retrieved in a search. When we need a single metric, we use the harmonic mean of precision and recall, which is known as F1 in information retrieval.

We tested four leading OCR products with 200 document images from the UNLV/ISRI corpus, which is used to test OCR algorithms. Measuring the number of recognition errors can be done in a fully automated way by computer using the Levenshtein distance, sometimes called the string edit distance (SED). We obtained these results:

  1. OCR-1:F1 =94±9
  2. OCR-2: F1 = 93±10
  3. OCR-3:F1 =93±7
  4. OCR-4:F1 =92±9

Our first conclusion is that today all OCR products have a similar very high quality. Therefore, there is no excuse to store document images without recognition. Scanners come with an OCR program and at the very least it should be enabled so you can index and search your documents.
MAKING OCR WORK FOR YOUR BUSINESS
If you looked at the above results carefully you noticed that with the calculated standard deviations the results could be better than 100%. This numerical result is due to the fact that the Levenshtein distances are not normally distributed: they have high negative skewness (this is bad) and pronounced kurtosis (this is good). We have to plot the data:f1peaxy

We have to look at the actual recognition errors—in particular the outliers—in the documents and examine why they occurred. It turns out that there are two possible causes. The first is that an OCR engine might not have been trained on documents similar to yours. If this happens, choose an OCR product that can be trained on your own document images.

The second cause is that the spelling checker can only check dictionary words but not terms like part numbers or parcel numbers. This is the context analysis step mentioned above in the recognition pipeline. You need to add a tool that lets you check and correct your peculiar terms like parcel numbers, social security numbers, etc. In the case of parcel numbers, you want to check them against the comprehensive plan, and maybe even deeper than that, on a parcel zoned residential reject a permit for an office building.

A fast algorithm related to the problem of finding approximate substrings in a string in linear time was developed by Horst Bunke. Given a word, Bunke’s algorithm can find the word with smallest Levenshtein distance among a set of prototype words in linear time.


As we have seen, adjusting workflows to add OCR exposes all the unstructured data hidden inside paper documents. Combined with an effective indexing and search system like the PeaxyHyperfiler, this critical business data is now available for use in the future. Your data is much more valuable when it is exposed in full and correctly!