How to remember and understand Bayes' theorem

When writing this post, I consulted the first 6 hits from Google that were returned when I searched for “How to remember Bayes’ theorem” as well as two books that I had that seemed relevant. All of these resources are listed in references section at the end of the post. Credit for good ideas contained in this post go to the creators of those resources. Any mistakes are mine.

Motivation for Bayes’ theorem

Bayes’ theorem is formula for computing conditional probabilities. It can also be a way of systematically updating the degree of certainty with which we hold beliefs in a world where you can perform imperfect tests of those beliefs. The latter way of thinking of Bayes’ theorem is sometimes referred to as the “diachronic interpretation” [8]. The “chron” in “diachronic” refers to time, which is how you can remember that this interpretation is related to updating beliefs as time goes on (and more evidence is collected).

Being able to rationally update beliefs given some evidence is a useful thing to be able to do because all tests are imperfect but we nonetheless want to be able to rationally change our beliefs when we collect new evidence related to that belief. Bayes’ theorem can be a way doing just that.

Why is it important to realize that all tests are imperfect? We are interested in learning about an underlying reality, and to learn about that underlying reality we have to perform tests. One common mistake that is made is to conflate the results of those tests with the underlying reality itself. This conflation would only be valid if we had access to a perfect test. Tests can succeed and fail in multiple ways, though. For the purposes of the following discussion, we will consider hypotheses that are either true or false. Bayes’ theorem can be extended to analyze hypotheses with a larger finite number of possibilities and, indeed, to continuous distributions.

True positives and true negatives are test results that accurately reflect positive and negative underlying realities, respectively. False positives and false negatives are testing failures that incorrectly identify negative and positive underlying realities, respectively. The existence of these testing failures is why we must be cautious when interpreting the results of tests.

There are lots of good examples out there that demonstrate why it is a mistake to conflate the results of an imperfect test with the underlying reality. Imagine that you are living during the Second Red Scare and you found out that Joseph McCarthy has determined that your neighbor is a communist. How likely is it, then, that your neighbor is a communist? You might suspect that Joe’s communist detector has a very high false positive rate, and you’ll want to take that into account when updating your belief about your neighbor. Let’s see how to do that.

Remembering Bayes’ theorem

To the uninitiated, Bayes’ theorem can look mysterious and can certainly be hard to remember. Have a look at the way that it is most commonly presented:

\begin{align} P(A|B) = \frac{P(B|A)P(A)}{P(B)} \end{align}

If I were to try to read this out loud, I might say “The probability of $A$ given $B$ is equal to the probability of $B$ given $A$ times the probability of $A$ and divided by the probability of $B$.” Not terribly memorable or meaningful. Let’s see if we can do better. First, we’ll work on how to remember the formula in the first place and then talk about how to better understand what it means. Consider the very simple diagram below, which represents the part of our sample space that we are presently interested in [5].

If we drew samples repeatedly from within the colored region, we will sometimes draw a blue sample (only $A$ is true), sometimes draw a yellow sample (only $B$ is true), and sometimes draw a green sample (both $A$ and $B$ are true). With Bayes’ theorem, we are interested in answering the question, “Given that I know $B$ is true, what is the probability that $A$ is also true?” In terms of our diagram above, this is equivalent to asking, “What is the ratio of the area of the green region to the sum of the areas of the green and blue regions?” In symbolic terms, that is:

\begin{align} P(A|B) = P(A\cap B)/P(B) \end{align}

We can ask the equivalent question for $P(B

A)$ and we will obtain:

\begin{align} P(B|A) = P(B\cap A)/P(A) \end{align}

The intersection ($\cap$) or overlap of two sets is independent of the order, i.e., $P(A\cap B)=P(B\cap A)$. Another way of saying the same thing is that conjunction is commutative [8]. The conjunction of A and B simply refers to the intersection of A and B, and an operation is said to be commutative if swapping the order of the operation always gives the same results. So if we combine the two equations above we can obtain Bayes’ theorem.

\begin{align} P(A|B) = P(B|A)P(B)/P(A) \end{align}

Now that we know how to quickly rederive Bayes’ theorem, let’s see if we can understand better what it actually means.

Understanding Bayes’ theorem

To make Bayes’ theorem easier to understand, let’s first change the variable names to something that makes a stronger connection with our new understanding of the motivation for Bayes’ theorem [6]. Instead of $A$ and $B$, let’s use $H$ (for hypothesis) and $E$ (for evidence). With that variable change, we now have:

\begin{align} P(H|E) = \frac{P(E|H)P(H)}{P(E)} \end{align}

If we try reading just the left hand side of the equation, it would be, “The probability of my hypothesis given the evidence is equal to …”. That’s an improvement over talking about $A$ and $B$ because it makes our intentions clear: we want to calculate the probability of our hypothesis being true given that we have observed some evidence related to that hypothesis.

It is useful at this point to bring in some terms that are commonly used in the context of Bayesian reasoning. You will often hear people referring to the prior and posterior distributions. The prior distribution $P(H)$ reflects our belief about the hypothesis prior to observing the evidence $E$. The posterior distribution $P(H|E)$ is our updated belief after observing the evidence $E$. In other words, we “update” our prior beliefs $P(H)$ by multiplying them by $P(E|H)/P(E)$ to obtain our new (posterior) beliefs $P(H|E)$.

The term $P(E|H)$ is referred to as the likelihood. In words, the likelihood is the probability of a positive test given that the hypothesis is true. $P(E)$ is the probability of a positive test regardless of whether or not the hypothesis is true. For the purposes of understanding the meaning of the right hand side of Bayes’ theorem, it can be useful to decompose $P(E)$ as follows:

\begin{align} P(E) = P(E|H)P(H)+P(E|\lnot H)P(\lnot H) \end{align}

Here we’ve introduced a new piece of notation: $\lnot H$ is the negation of $H$. The above equation says that “the probability of observing $E$ is equal to the probability of observing $E$ when $H$ is true times the probability of $H$ being true plus the probability of observing $E$ when $H$ is not true times the probability of $H$ being untrue. Essentially, we have just enumerated all of the possible ways of having a positive test and weighted them by the appropriate probabilities. A set of hypotheses that covers all possibilities and do not overlap are sometimes referred to as “collectively exhaustive” and “mutually exclusive” [8].

$P(E|H)P(H)$ is the fraction of true positives, i.e., the fraction of positive tests that you obtain when we perform our tests on actually true cases. $P(E|\lnot H)P(\lnot H)$ is the fraction of false positives, i.e., the fraction of positive tests that result (erroneously) when we perform our test on actually negative cases. And, if you notice, the numerator on the right hand side is also equal to $P(E|H)P(H)$, i.e., the numerator is equal to the fraction of true positive tests. So we might choose to express Bayes’ theorem in the following way:

The probability that H is true given that I observed E is equal to the fraction of true positives divided by the sum of the fraction of true positives plus the fraction of false positives.

In symbolic form, this looks like:

\begin{align} P(H|E) = \frac{P(E|H)P(H)}{P(E|H)P(H)+P(E|\lnot H)P(\lnot H)} \end{align}

Finally, note that both the true positive rate and the false positive rate have a component that is related to the accuracy of the test ($P(E|H)$ and $P(E|\lnot H)$) and a component that is related to the underlying reality ($P(H)$ and $P(\lnot H)$). Naively conflating the results of a test with the underlying reality is, therefore, more and more problematic as $P(E|\lnot H)$ or $P(\lnot H)$ (or both) get farther from 0 and closer to 1.

Still confused? It’s normal (no pun intended). I recommend reading through the references below (and others) until it clicks. Different explanations will be effective for different people. If you’re interested in rationally updating your beliefs (or teaching computers to do likewise), it will certainly be worth your while.

References

Written on May 20, 2019