Posted by: yanzhang | November 23, 2011

## On Being Naive

I gave a talk at the MIT applied math seminar recently. Some people wanted a blog post, so here it is.

Some of my mathematical hobbies include probability and machine learning. I have recently realized that many simple but effective ideas of these fields really all come from one thing: an independence assumption. It was only until I saw the same example in several different guises, however, before I really caught on. As something Occam would surely approve of, the extreme naiveté this approach embraces can actually go a long way. Our key player is simple: we say that $A$ and $B$ are conditionally independent given $C$ if $P(A|B, C) = P(A | C)$. This can be written in the more symmetric form $P(A, B|C) = P(A|C) P(B|C)$. Now I will tell a few stories. Most of these should be old for a specialist, but I hope I’ve included some remarks that even they may appreciate.

Weighing Evidence

(this is mostly stolen from inspired by Jaynes from Probability Theory: the Logic of Science) Suppose we have two hypotheses about the state of the world, say $E$ and $\overline{E}$, and we know that exactly one of them is true. Now suppose we are getting consecutive pieces of data $D_1, D_2, \ldots$.  How does this data update our belief in $E$ or its complement?

A standard use of Bayes’s Theorem gives that $P(D_1, D_2, \ldots , E) = P(E) P(D_1 | E)P(D_2 | D_1, E) P(D_3 | D_1, D_2, E)\cdots$ Doing the same for $\overline{E}$ and dividing gives

$\dfrac{P(D_1, D_2, \ldots, E)}{P(D_1, D_2, \ldots, \overline{E})} = \dfrac{P(E)}{P(\overline{E})} \dfrac{P(D_1 | E)}{P(D_1 | \overline{E})} \dfrac{P(D_2 | D_1, E)}{P(D_2 | D_1, \overline{E})}\cdots$

Here, let’s make the naive assumption that the $D_i$‘s are conditionally independent given $E$ or $\overline{E}$. Not only are these different, we in fact want the slightly stronger assumption that the $D_i$‘s are conditionally independent of the intersection of the previous $D_i$‘s given $E$ or $\overline{E}$. If so, on the right we just get a product of $\frac{P(D_i | E)}{P(D_i | \overline{E})}$. We want to take logs here and rewrite our equation as

$O(E | D_1, D_2, \ldots) = O(E) + \sum_i \log(\frac{P(D_i | E)}{P(D_i | \overline{E})}),$

where $O(E|D)$ is the odds $\log(\frac{P(E|D)}{P(\overline{E}|D)})$. We now make the natural guess we are in the world $E$ exactly when this is positive (which corresponds to the $P(E|D_1, \ldots)$ having higher probability).

The cute thing about this situation is that we are really “weighing” our evidence, as on a balance! Each new piece of data just contributes a number, and we mentally keep a tally and just decide yes or no given the sign of the final sum. Our original “bias” is exactly the odds of $E$ given no other information, which exactly corresponds to the Bayesian information contained in our prior knowledge of $E$, so the entire prior information comes into play as a “head start” bias in one direction, as if it were a single instance of data with some weight! This is a very clean way to make decisions, and really makes binary hypothesis testing very intuitive (it is fun/frustrating to try to generalize this to more than $2$ hypotheses, where there are quite a few unexpected pitfalls). The key, however, was our conditional independence assumption.

Naive Bayes

A very naive approach to spam filtering is the following (generative) model. Let nature choose whether a message is spam (call this hypothesis $H$) with some probability $P(H)$ and then, for each word in the dictionary, pick whether each word $w_i$ is in the message (abuse notation and call this the event $w_i$) with probability $P(w_i | H)$ (or $P(w_i|\overline{H})$). You then pick maximum likelihood over sample data to learn the probabilities, and on a new piece of data just get the $w_i$‘s and see whether $H$ or $\overline{H}$ was more likely.

So what was the naive assumption? It was that the events $w_i$ were conditionally independent given either hypothesis. Cutely, the final log odds $\frac{P(H|D)}{P(\overline{H}|D)}$ is additive because of this (an exercise I leave to the reader) and extremely easy to calculate (and fast for computers, too!). The punchline is that this additivity is exactly the same additivity we had in the hypothesis testing. The expressions work out such that the inclusion/exclusion of every word will add some positive/negative number to a running total, biased in the beginning by some initial value determined by $P(H)$ with no other information, and we choose to classify a message as spam based on whether this running total is positive or not.

Before continuing, I want to put in e-print my long-running gripe that “Naive Bayes” is a misnomer, since the only thing naive I can possibly think about doing with Bayes’s Theorem is conditional independence, and so every such algorithm should be called “naive.” The name offers no information about the particular algorithm that is associated with spam filtering, and the graphical network it corresponds to is not even the unique simplest graph (you can reverse all the arrows, for instance, and get Noisy-Or or any other ICI, but I admit those have another layer of complexity that I am not going into here). Unsurprisingly, there are actually many flavors of Naive Bayes depending on where you want to insert your naivete — see Metsis, Androutsopoulos and Paliouras, Spam Filtering with Naive Bayes – Which Naive Bayes?

The Unreasonable Effectiveness of Naive Bayes

What’s the yoga here? Here’s my take: the conditional independence became a multiplicative condition and thus an additive condition, so the convenience of independence corresponds to the convenience of linearity. Thus, the hyperbolic punchline of this post is that “independence is linearity.”

I see a strange phenomenon (at least among pure mathematicians casually talking about applied math; I’m sure applied mathematicians have a better intuition) that people are very comfortable accepting linear approximations but not as comfortable accepting independence, whereas at least in my very simple setup they are exactly the same. I will audaciously extend my analogy to say that this intuition is inconsistent, and I don’t know why people seem to be completely fine with logistic regression (which really just says the log-odds is additive and is thus a third story equivalent to the two stories I’ve told in this blog post!) while careful about making disclaimers about Naive Bayes.

In fact, Naive Bayes, contrary to popular opinion, is actually also very good (and provably optimal, with certain definitions of optimal) when events are very dependent! It is only in the middle regions where it suffers, and it really doesn’t suffer by much. We also overestimate its problems because we like to think in terms of $L^2$ but errors in classifying are frequently done under zero-one loss (this is a really interesting nuance that I would love to talk about some other time, but this post has gotten long enough). For a more in-depth look, see Domingos and Pazzani, On the Optimality of the Simple Bayesian Classifier under Zero-One Loss.

Appendix: Noisy-Or and Bayesian Networks

When we talk about conditional independence, we really should take the setup of Bayesian networks, which gives a natural excuse to introduce Naive Bayes’ much lesser well-known sister, Noisy-Or (that often does better than Naive Bayes!). I spent some time in my talk going over the basics of d-separation, Markov blankets, etc.. However, I realized that I had no real interesting observations so I won’t talk too much about it in blog format, where the reader is very close to Google and smarter people who know much more than I do. I did have one silly “original” contribution, however, so I share it here.

Here is an example that I thought was surprisingly clean and possibly helpful for someone interested in the basics of Bayesian networks: consider the events A(AC), B(Battery), and C(Computer), corresponding to whether the corresponding electronic gizmo is on or off (with the computer connected to both the AC and the battery). This corresponds to a Bayesian network with two edges $A \rightarrow C$ and $B \rightarrow C$.

It is obvious that $A$ and $B$ are independent until $C$ is observed, which makes them conditionally dependent; if you know the Computer is on or off, then the other two power sources’ integrities are coupled. Otherwise, your blissful ignorance gives you no information. This is starkly different from every other orientation (3 possible) of the edges, where $A$ and $B$ are dependent but conditionally independent given $C$! This quirkiness makes the weird d-separation criterion necessary, and I thought this example very mnemonically convenient for marking the “bad” edge-orientation.

-Yan

Also, \dfrac{}{} is nice to know: $\frac{a}{b}$ versus $\dfrac{a}{b}$.