(There wasn’t going to be a Part III to this series, but recently I tried to explain SGT to a colleague and found myself hopelessly tongue-tied. As SGT is an important technique in Bayesian Filtering, it deserves an article of its own.)
In Part II of this series, I mentioned that the biggest problem facing the implementors of a Bayesian filter is the problem of infinite surprise: the poisoning of the algorithm by probabilities of zero or one (absolute certainties). The practice of avoiding such probabilities goes by the name Cromwell’s Rule.
We avoided the problem by invoking additive smoothing, namely by replacing the observed counts by an estimation: \[ \hat r_i = {r_i + \alpha\over N + \alpha |V|} \] where \(r_i\) is the number of observations (possibly zero) of word \(i\), \(N\) is the total number of observations and \(|V|\) is the size of our vocabulary. In the example from before, \(\alpha\) was set to 1 (a special case known as Laplace smoothing) to make the arithmetic simpler.
This worked, but Laplace is a terrible estimator. If you add too large a number, the unseen words dominate the seen words, and you end up guessing according to your prior assumption. Adding too small a number can skew results the opposite way, by penalizing classes that have too many zero-count words. What we need is a better estimator, something that will take the underlying distribution of words into account.
Remember the motto of the true Bayesian: the only thing we know about the future is the past. Instead of asking the question, “what is the probability that the next word is \(X\)?” we can ask the more general question, “what is the probability that the next word has a zero count?” In other words, if we treat the word under analysis as just another data point, what is the probability that the word will make the transition from zero to one? Well, how many times did a word in the past end up with exactly a count of one at the end of our analysis?
Let \(C_r\) be the set of words which have a count of \(r\), and let \(N_r\equiv|C_r|\). By definition, \(N_1\) words ended the analysis with a “one” count, out of a total count of \(N\). Using the above analysis, \(p(x\in C_0) = N_1/N\). Assuming that most of the words in \(C_0\) are equivalent, then we can say \(p(x = X) = N_1/(N\cdot N_0)\) for any \(X\in C_0\). Extrapolating this line of thinking to non-zero counts, we get the general formula \[p(x = X) = {1\over N}(r+1){N_{r+1}\over N_r}\] for any \(X\in C_r\). This is known as the Turing Estimator. (Yes, that Turing.)
Of course, you cannot get something for nothing, and in this case we’ve replaced the problem of zero probabilities for zero-count words with the problem of zero probabilities for words where \(N_{r+1}\) is zero. We need a non-zero estimate \(\hat N_r\) for all values of \(r\). When such smoothed values of \(\hat N_r\) are used in place of \(N_r\), the estimator is known as the Good-Turing Estimator.
It might help to look at a graph of \(r\) vs. \(N_r\) for inspiration:
Well, that’s not very helpful, except its shape is vaguely hyperbolic. Perhaps switching to a log-log plot might be more illuminative:
Bingo! That looks linear, with a few caveats. First, there’s that staircasing effect for large values of \(r\), and there’s a slight bend in the line for small values of \(r\). Let’s ignore the bend for now and tackle the staircase.
As Kronecker reportedly said, "God made the integers; all else is the work of man.” In nature, especially in counting, one rarely encounters anything but integers. If I give you five coins, ask you to flip each one, and then give me the total number of heads, you will never say “3.6.” You will say either 2 or 3 (most likely), followed by 1 or 4, and occasionally we might be lucky and hear a 0 or 5. The expected number, however, is 2.5: the average value over a large number of trials.
Equivalently, if you roll a single die and report the number of sixes you see, you will give either a zero or a one. If we were to plot the outcome of fifty trials…
…we would see a lot of zeroes with the occasional one approximately one-sixth of the time. This is what is happening in the staircase: high-frequency words are rare and their expected counts are less than one, but because we are dealing with observations, we end up with small integers surrounded by zeroes. Specifically, if we have an expected count of \(1/N\), we would expect to see one “one” surrounded by \(N-1\) zeroes on either side. This gives us a hint on how to effectively recreate expected values from observed counts.
Let \(q = \max\{x:x<r\land N_x≠0\}\) and \(s = \min\{y:y>r\land N_y≠0\}\); in other words, the non-zero \(N_r\) to the left and right of a given point. Where \(q\) and \(s\) exist, let \(Z_r=2 N_r/(s-q)\). Where did the two come from? In short, each point shares the surrounding zeroes with its neighbors, so we have to avoid counting them twice.
What about the points where either \(q\) or \(s\) do not exist? Gale defines \(Z_r=N_r/s\) when \(q\) does not exist, and \(Z_r=N_r/q\) when \(s\) does not exist. I prefer dropping those points for two reasons. For large values of \(r\), \(s\gg q\), and estimating \(s=q\) for the final point (as Gale is effectively doing) seems to be adding unnecessary error. As far as \(N_1\) is concerned, it is usually much lower than what the linear model we are building would predict, and including it would skew the model. As they say, your mileage may vary.
So, we now have a new set of data points. What does the plot look like now?
The staircase is gone, we have strong linearity. Time to find those missing \(N_r\) values!
Linear regression will find the parameters for the line \[\log_{10}(\hat Z_r) = m\log_{10}(r)+b\] or equivalently, \[\hat Z_r=10^b\cdot r^m.\] Plugging our estimates for \(Z_r\) into the original Turing formula gives us: \[p(x=X)={1\over N}(r+1){10^b\cdot(r+1)^m\over 10^b\cdot r^m}\] or with a little algebraic manipulation: \[p(x=X)={1\over N}\,r\left(1+{1\over r}\right)^{m+1}\]
This is the Linear Good-Turing estimator.
Of course, this is meaningless when \(r=0\). It also tends to overestimate for small values of \(r\), as can be seen when we superimpose the line over the graph of \(Z_r\):
To solve these problems, we use the Turing estimator for small values of \(r\) and the linear estimator for large values of \(r\). The switchover point is difficult to determine. Gale uses the standard deviation of the Turing estimator (a formidable piece of algebra that I won’t reproduce here); I tend to be lazy and simply switch over for the first value of \(r\) when Turing is greater than the linear. Again, YMMV.