Topics in AI: Bayesian Filtering: Part II

All right, last time we saw how a test is dependent not only on the effectiveness of the test (in terms of false positives and negatives) but also on the prevalence of what we are testing for.

Let’s look at the consequences of this in terms of spam filtering. Here is an artificial corpus of messages, based (loosely) on my inbox.

Word frequency in e-mail corpus by category
Spam Non-spam Overall
Nigeria 1 0 1
Bayesian 0 1 1
Credit 4 2 6
Program 1 3 4
Thank 4 4 8
Total 10 10 20

Let’s take a qualitative look at this data. The first two rows of the table describe a perfect filter: if the e-mail contains “Nigeria,” then with 100% certainty it is spam; if the e-mail contains “Bayesian,” then with 100% certainty it is not spam. The problem is that these words appear infrequently (~ 10%) so their effectiveness is limited.

“Credit” and “Program” are better in terms of frequency, but not in terms of separation since they appear in both spam and non-spam messages. “Thank” is the worst—it carries no information whatsoever since it appears equally in both.

Before we develop our simple spam filter from this corpus, we need to get a concept out of the way: statistical independence. Simply put, statistical independence says that the outcome of one event has no impact on another event. In the case of rolling snake eyes, for instance, the odds of rolling a “one” on the first die is 1/6, and the odds of rolling a “one” on the second die is also 1/6, independent of what we got on the first die.

The occurrence of words in an e-mail message are not statistically independent (seeing “Nigeria” almost guarantees the occurrence of “credit” elsewhere in the message) but for analysis it is often easier to assume that they are. This is called the Naïve Bayes assumption, and it significantly mitigates complexity in designing our filter.

The nice thing about statistical independence is that the probability of both events is simply their product. We encounter this every day: the odds of rolling snake eyes is simply 1/6 × 1/6 = 1/36.

We’ll apply the independence assumption to our filter and say that the probability of seeing a set of words given that the message is spam is simply the product of the probabilities of each word given that the message is spam:

P(Credit, Thank | spam) = 40% × 40%
= 16%
P(Credit, Thank | ¬spam) = 20% × 40%
= 8%

Note that these are the likelihoods of seeing this combination of words in specific classes of messages, not the actual probability of the message being spam given the presence of those words. These probabilities correspond to the effectiveness of our cancer test, and as we saw in that case, we need to incorporate our knowledge of the likelihood of cancer overall to calculate a “true” probability.

On the other hand, these likelihoods are a good estimator of the true probabilities, especially if we assume nothing about the incoming message. In this case, we have what is known as a maximum likelihood classifier. If we incorporate an estimate of the probability of the message being spam before we examine it (otherwise known as the prior probability), then we get what is known as a maximum a posteriori (posterior) classifier. As the size of the corpus grows, the two tend to converge.

Unfortunately there are a couple of serpents in our little Bayesian Garden of Eden: “Nigeria” and “Bayesian.” Note that:

P(Nigeria, Bayesian | spam) = 10% × 0%
= 0%
P(Nigeria, Bayesian | ¬spam) = 0% × 10%
= 0%

In other words, the probability that those words appear in combination in a spam message is 0%, and the probability that they appear in a non-spam message is also 0%. Yet if you were to forward this article to a friend, they would both appear in an arguably non-spam message.

The problem lies within our model of estimating probabilities. We have been using the frequency of words in our corpus to estimate the true (unknown) probabilities of the world at large. Yet there is a slim (but non-zero) chance that we can get an e-mail mentioning “Nigeria” that is not spam. What we need to do is adjust our probability estimates so that they are all non-zero. The easiest way to do this is Laplace Estimation, which inserts a synthetic entry for each class into our corpus containing every word. In essence, we add 1 to every count:

Laplace estimate of word frequency
Spam Non-spam Overall
Nigeria 2 1 3
Bayesian 1 2 3
Credit 5 3 8
Program 2 4 6
Thank 5 5 10
Total 15 15 30

(In a tiny corpus such as ours this adds far too much mass to unseen examples, but for the purposes of pedagogy it will do.)

There are no more zeros to poison our products, so all the probability estimates will be non-zero. This technique is called smoothing, and there are other approaches than Laplace of varying complexity and effectiveness.

P(Nigeria, Bayesian | spam) = 13% × 7%
= 0.9%
P(Nigeria, Bayesian | ¬spam) = 7% × 13%
= 0.9%

Note that without further evidence, the probability that the message is spam is 50%, but at least we have eliminated the contradiction.

The Mathematical Interpretation

Mathematically, x and y are considered independent if and only if P(x,y) = P(x) ⋅ P(y).

Assuming the words in a message are statistically independent, we get:

P(words | spam) = P(w | spam)
w ∈ words

Since these probabilities are vanishingly small, we run into the real danger of underflowing the precision of our computer. We perform a popular mathematical transformation by applying logarithms:

ln P(words | spam) = ln P(w | spam)
w ∈ words

We can also apply logarithms to Bayes’ Theorem:

ln P(spam | words) = ln P(words | spam) + ln P(spam) - ln P(words)

We still have the problem of that P(words) term. We could either calculate it or find a reason to eliminate it. Let’s try eliminating it.

Note that our classifier has been finding actual probabilities and flagging the message as spam if it exceeds a threshold. Another approach would be to simply flag it as spam if the probability of it being spam exceeds the probability of it being non-spam:

Reject if P(spam | words) > P(¬spam | words)

Since logarithm is monotonic (that is, a > b ⇔ ln a > ln b), we can substitute the logarithms into this equation:

P(spam | words) > P(¬spam | words)
ln P(spam | words) > ln P(¬spam | words)
ln P(words | spam) + ln P(spam) - ln P(words) > ln P(words | ¬spam) + ln P(¬spam) - ln P(words)
ln P(words | spam) + ln P(spam) > ln P(words | ¬spam) + ln P(¬spam)

In the last line, we added ln P(words) to both sides of the inequality, eliminating it from our calculation. Now we can replace the ln P(words | spam) with our sum from earlier, producing our final classifier:

ln P(spam) + ln P(w | spam) > ln P(¬spam) + ln P(w | ¬spam)
w ∈ words w ∈ words
ln P(spam) + ln P(w | spam)
w ∈ words
ln P(¬spam) + ln P(w | ¬spam)
w ∈ words

Note that a zero probability is reflected here as a negatively infinite value. Some researchers refer to the quantity -ln P(x) as the surprise of x; thus, zero probabilities poisoning the classifier’s calculation is known as the problem of infinite surprise. Smoothing, as mentioned above, is the best way of eliminating this problem.