Topics in AI: Bayesian Filtering: Part I

As a model for this essay, I am going to try to follow (at least initially) the non-algebraic approach taken by Colin Bruce in his amusing book, Conned Again, Watson! (Amazon - BN) which I highly recommend. Buy two; they’re cheap.

Bayes’ Theorem (or, Why There are Lies, Damn Lies, and Statistics)

The basic principles of probability elude people. Most understand that if you have a fair die, the odds are that you will roll any given number with 1/6 probability. More sophisticated people will understand that with two dice, the odds of snake eyes (double ones) is 1/36, while the odds of a seven is a much larger 1/6.

But if a relatively accurate test for a rare form of cancer shows up positive, should we assume that the patient has the cancer? Most people would, and in that lies the danger of not understanding probability; specifically, Bayes’ Theorem.

First, let’s associate some numbers with our cancer test. First off, the cancer is rare: only one out of every five hundred people (0.2%) have it. Second, our test is extremely accurate: it has a 90% positive accuracy and an 85% negative accuracy. Note that the test has two numbers denoting its accuracy: the first tells us the likelihood of the test returning positive when the patient has the cancer; the second tells us the likelihood of the test returning negative when the patient is healthy.

Let us assume that we have tested 10,000 people. What is the odds that a person who tested positive has the cancer? 90%, right?

WRONG!

What most people fail to take into account is that the cancer is extremely rare. Let’s first examine the subgroup of subjects that have the cancer. We know one in five hundred have it, so we expect 20 people of our test group to have the cancer. Of this subgroup, the test has an 90% of returning positive. So 18 of this group will get a positive reaction, and the remaining two will see a negative reaction.

Similar reasoning applies to the 9,980 people who do not have the cancer. The test will return negative for 85% of them and positive for the remainder. So of this subgroup, we will see 8,483 negative results, and 1,497 positive results.

	Negative	Positive
Healthy	8,483	1,497
Cancerous	2	18
Overall	8,485	1,515

Take note of that second column: that’s 1,497 cases of “Healthy” versus 18 cases of “Cancerous.” In other words, even after receiving a positive test result, the odds of the patient having the rare form of cancer has risen from 0.2% to a whopping 18/1,515 = 1.2%! So much for the infallibility of modern medical diagnoses.

The Mathematical Interpretation

Let’s convert these numbers into something more formal by working with probabilities rather than subgroups and percentiles. First we note the probability of having the cancer without any other information is P(C) = 0.0020, and the corresponding probability of not having cancer is P(¬C) = 0.9980, since it is a yes-or-no proposition.

We have four probabilities associated with the test: two for when the subject is healthy, and two when the subject is cancerous. We denote conditional probabilities by putting a bar between the unknown and the known; e.g., P(⊕ | C) = 0.9000 says that there is a 90% of the test returning positive given the subject has the cancer. Likewise, P(⊖ | C) = 0.1000, P(⊕ | ¬C) = 0.1500, and P(⊖ | ¬C) = 0.8500.

Let’s now examine the table. The upper-left cell contains the subgroup that is healthy AND return negative on the test. This joint probability is denoted by P(⊖, ¬C) = 0.8483. Remember how we derived 8,483 in the first place: we took the fraction of subjects that were healthy (9,980) and found the fraction of this subgroup that would return negative (85%). Taking a fraction is simply multiplying, so the relationship between P(¬C), P(⊖ | ¬C), and P(⊖, ¬C) should be clear:

P(⊖, ¬C) = P(⊖ | ¬C) ⋅ P(¬C)

(Well, clear if you do math for a living. Others may have to play with the numbers to make sure it works out.)

Looking at the other joint probabilities, a general pattern emerges: we can find the joint or conditional probability from the other if we know the probability of the condition. Algebraically, we can denote this relationship as:

P(x, y) = P(x | y) ⋅ P(y) = P(y | x) ⋅ P(x)

Note the last equivalency. We’ve just shown how to convert a conditional probability to a joint probability and back again. With a little algebraic manipulation, the last equivalency becomes:

P(y \| x) =	P(x \| y) ⋅ P(y)
	P(x)

This is Bayes’ Theorem, and it’s importance to probabilistic analysis cannot be overemphasized. Let’s plug our numbers into the formula and see what is the probability of having the cancer given that the test returned positive:

P(C \| ⊕)	=	P(⊕ \| C) ⋅ P(C)
		P(⊕)
	=	0.9000 × 0.0020
		0.1515
	=	0.0119

which is what we calculated before using the table. Bada-bing, bada-boom, bada-Bayes.

“But Rob,” I hear you cry, “what does any of this have to do with spam?” Glad you asked. We can equate “cancer” with “spam” and “test” with “suspicious words in the message like Viagra or Nigeria.” Then, using Bayes’ Theorem, we can rewrite the problem of spam detection as:

P(spam \| words)	=	P(words \| spam) ⋅ P(spam)
		P(words)

In English, this reads as “the probability of an e-mail message being spam given the suspicious words in the message is equal to the probability of seeing those words in a known spam message times the probability of getting spam divided by the probability of seeing those words in any e-mail message.” Set a threshold on the computed probability and you have yourself a spam filter.

The probability on the left is unknown, but all of the probabilities on the right can be readily estimated from a corpus of spam messages. Just how is a topic for another day.

Part II of this topic is now available.