MENU

Stats Talk: An OCR Experiment and Preserving Validity in Adaptive Data Analysis

November 15, 2016 OCR, Data Validity, Analytics

One of the fundamental principles in statistics is that you must design the experiment in full detail before you start collecting any data and analyze it. This is also known as preregistration and the analysis is called nonadaptive. Unfortunately, in the heat of the battle this is often forgotten, sometimes on purpose because it can be very expensive. Here is a new solution to this problem.

Designing my OCR Experiment

When I embarked on my adventure to compare OCR algorithms to expose hidden data, I had no idea how to design the experiment. On one side, the technology is very old and I expected the performance to be similar. On the other side, there was a huge price range and I expected vendors to have some secret sauce. I did not know what performance metric to use, because OCR errors are much harder to spot than human typing errors, so I needed an automatic method. I was wondering if I could just treat the groundtruth and the OCR output as two strings and compute their Levenshtein distance. If so, how would the distances be distributed? Are they normal? Are they well-behaved or are there pesky outliers that must be analyzed?

This is the common situation in statistics: life is never as easy as it is in textbook examples. In real life, the practice of data analysis is an adaptive process in which new analyses are chosen on the basis of data exploration and previous analysis. The key point is not to reuse the same data. If you do, you end up in what is called p-hacking and produce invalid research findings.

In my case, preserving the validity was easy. Although producing the document test set was tedious, I decided I would bite the bullet and produce a fresh set once the experiment was finalized. The less obvious requirement was not to use the same OCR programs for the design as for the final test. I used some heuristics (a euphemism for “slightly fishy hand-waving”). In the old days, OCR input mostly came from scanners, while today a lot of input comes from mobile device cameras — so OCR vendors have had to change their algorithms. With this in mind, I used six old programs to design the experiment. Then I used the latest versions for the final test (which is based only on four programs, because one had a firewall issue with the license validation and another was too expensive for my budget).

A Framework for Preserving Validity

In most practical cases we are not so lucky, because the data sets are difficult to come by. Fortunately, recently a group of eminent researchers got together and developed a method for preserving validity in adaptive data analysis. Their paper, “The reusable holdout: Preserving validity in adaptive data analysis,” was published in Science in August 2015.

First I need to straighten out my terminology. In the OCR case, the first document test set with the old programs is called the training set. The final document test set with the new programs is called the holdout set. The popular approach to avoid the issue of adaptivity is to validate data-dependent hypotheses or statistics on the holdout set. As the data analyst, you start by partitioning data samples randomly into training data and holdout data. You interact with the training set to obtain a data statistic of interest. The statistic is then validated by computing its value on the holdout set. Because the holdout was drawn from the same data distribution independently of the statistic, standard statistical inference procedures can safely be used.

The drawback of this basic approach is that the holdout set is not reusable. If you use the outcome of the validation to select an additional data statistic, that statistic is no longer independent of the holdout data, and further use of the holdout set for validation can lead to incorrect statistical inference. To preserve statistical validity, the only known safe approach is to collect new data for a fresh holdout set like I did in the OCR case. As the authors write, this conservative approach can be very costly and thus is frequently abused, resulting in overfitting to the holdout set.

Protecting the Holdout Set

Their solution is to use a technique from differential privacy—a notion of privacy preservation in data analysis—consisting in hiding the actual holdout data set and adding Laplace noise when the data set is interrogated. By the way of history, Pierre-Simon Laplace published his first law of errors in 1774 when he noted that the frequency of an error could be expressed as an exponential function of its magnitude once its sign was disregarded. Differential privacy ensures that the probability of observing any outcome from an analysis is essentially unchanged by modifying any single data set element. Such a condition is often called a stability guarantee.

In a nutshell, the reusable holdout mechanism proposed by the eminent researchers is simply this: access the holdout set only via a differentially private mechanism. The intuition is that if you can learn about the data set in aggregate while provably learning very little about any individual data element, then you can control the information leaked and thus prevent overfitting. In their paper, the eminent researchers present an implementation of the reusable holdout, called Thresholdout, and show that it provably validates a large number of adaptively chosen statistics.

You, the, analyst are given unfettered access to the training data set but can only access the holdout set via this differentially private mechanism that allows you to validate statistics on the holdout set. The reusable holdout gives you a general and principled method to perform multiple validation steps where previously the only known safe approach was to collect a fresh holdout set each time a function depends on the outcomes of previous validations.

This new methodology was difficult to design and prove statistically correct. The result is a simple procedure based on differential privacy that allows you to safely keep improving your experiments. Now, you no longer have an excuse for p-hacking!

(Cherry on top: the paper has a link to the Python code for the numerical experiments reported in their figures.)