Activity


	Probability Essentials	Home \| Contact us


Main Concepts \| Demonstration \| Activity \| Teaching Tips \| Data Collection & Analysis \| Practice Questions \| Milestone \| Fathom Tutorial

Activity

In this activity, we'll explore conditional probability through data and look at multiple ways of displaying categorical information. Our situation: the search for terrorists in the United States.

We will start with the hypothetical data suggested in a recent article (pdf) by John Allen Paulos. (Note: you should read the article after this Activity.)

• Suppose that 1,000 of the 300,000,000 residents of the United States are involved in terrorist activity.

• Suppose further that the government's terror detection system is "99% accurate" in two senses: 99% of all terrorists are flagged as terrorists, and 99% of all non-terrorists are correctly identified as non-terrorists.

Our goal is to answer the following question: If our terror detection system flags someone as a terrorist, what is the probability he really is a terrorist?

Let's first try to display the data we have in a table. We might begin this way:

	Yes	No	Total
Terrorist?
Flagged?
Total			300,000

Will this table format work? No: in this table, we cannot cross-classify individuals (e.g., where do we fill in the number of flagged terrorists?). In the language of probability, we must have mutually exclusive rows in our table. The events "Terrorist" and "Flagged" are not mutually exclusive. Students often make this mistake in building a contingency table, especially if the two category labels are similar, like "has a Visa card" and "has a MasterCard."

Instead, we have to format the table as below:

	Terrorist	Non-terrorist	Total
Flagged
Not Flagged
Total

Use the hypothetical numbers from above to fill in all 9 cells of the table. I'll get you started: the total number of terrorists is 1,000, and the grand total number of people is 300,000,000. You can find two of the interior cells using the 99% accuracy rates, then the remaining cells are deduced by simple subtraction. Think you have the answer? Compare your numbers to the table below:

	Terrorist	Non-terrorist	Total
Flagged	990	2,999,990	3,000,980
Not-flagged	10	296,999,010	29,6999,020
Total	1,000	299,999,000	300,000,000

We can now answer lots of questions about this hypothetical terror detection system. For example, the probability someone is really a terrorist, given that he was flagged, is just 990/3,000,980 = 0.033%. Scary, eh?

It might be clear to some of you why this percent is so low, but we'll explore this a little later in the Activity.

To comfort you, let's compute the probability someone is not a terrorist, given he is not flagged: 296,999,010/296,999,020 = 99.9999966%. Whew!

How do these numbers relate to the standard conditional probability formula? Let T and F stand for the events "Terrorist" and "Flagged" respectively. We want P(T|F), which supposedly equals P(T and F)/P(F). We still need the table to find either of these quantities.

P(T and F) is the probability someone is a terrorist and is flagged, equaling 990/300,000,000 according to the table. Likewise, P(F) is the probability someone is flagged, which the table shows to be 3,000,980/300,000,000. Take the ratio, and the denominators of 300,000,000 cancel.

A contingency table is the ideal way to display the relationship between two categorical variables, especially when we have whole numbers (rather than just percents). It takes some time to assemble the entire table, but it allows students to easily find conditional probabilities once the table is complete.

When we only have percentages, and some of those percentages are conditional probabilities, a tree diagram is ideal. In our terrorism problem, let's say we didn't know the population size, but only that P(T) = t; i.e., 100t% of all U.S. residents are involved in terrorist activity. Since that information is not conditional, Terrorist versus Non-terrorist (T and Tc) can form the first branches of our tree. You should draw the diagram yourself; one branch has probability t, so what's the other branch's probability?

From each of these primary branches, draw two secondary branches: Flagged and Not Flagged (F and Fc). Persist with "99% accuracy" for now: P(F|T) = 99%, and which other conditional probability equals 99%? What conditional probabilities are on the remaining two secondary branches?

With your tree diagram complete, you can again find P(T|F). Use the formula P(T|F) = P(T and F)/P(F) as your inspiration, and follow the branches.

Here's what you should get: the numerator has only one term, since only one "node" corresponds to T and F; its probability is t * 0.99. The denominator has two terms, since there are two "nodes" corresponding to event F; their collective probability is t * 0.99 + (1-t) * 0.01.

And so, P(T|F) = 0.99t/[0.99t + 0.01(1-t)]. If you plug in t = 1000/300,000,000, you'll get our answer from earlier.

Two pedagogical notes, before this Activity really gets interesting. First, students master tree diagrams faster than you'd expect, especially if you lead them through the fundamentals a few times (unconditional probabilities on the first branches, multiply probabilities along paths, etc.). Second, you might recognize our fractional answer above as a form of Bayes' Formula, and you're right! Bayes' Formula is not on the AP syllabus, but your students should still be able to answer conditional probability problems like this terrorism question. (They're expected to use a tree diagram or construct a contingency table.)

Digging deeper
Why is P(T|F) so low? Or, equivalently, why is the "false positive" rate – which equals 1-P(T|F) in our scenario – so high?

We found that P(T|F) = 0.99t/[0.99t + 0.01(1-t)], where t equals the proportion of all U.S. residents involved in terrorist activity. Make a graph of this probability as a function of t (what is the domain of t?). What do you observe?

With the correct graph, you'll see that the "true positive" rate is low when t is extremely low, but that rate improves dramatically even for modest values of t. Using the Trace tool on your calculator (or basic algebra), find the lowest value of t for which the "true positive" rate is at least 90%. (You should get t = 8.33%.)

What's going on mathematically? If almost nobody is a terrorist, then 99% of all terrorists is a tiny number relative to 1% of everyone else. So, the pool of "flagged" individuals consists almost entirely of innocent people who were flagged by mistake. In real life, the solution is obvious: anti-terrorist agencies use a set of criteria to narrow down the field of "likely" terrorists.

The same issue arises in medical testing, where you might have heard the term "false positive" before. If every American were screened for AIDS or some other rare disease, most positive tests would be false, and panic would ensue. Instead, the government does not mandate so-called "mass screening"; only those in high-risk groups are encouraged to get tested. Furthermore, those who receive a positive test result are encouraged to get re-tested.

Let's explore one last parameter: the "accuracy rate" of our detection system. For simplicity, we'll stick with the same parameter for both terrorists and non-terrorists: P(F|T) = P(Fc|Tc) = p. Then the conditional probability of a correct flag equals P(T|F) = p*t/[p*t + (1-p)*(1-t)].

So we can graph this function; let's set t = 10% to begin. Graph 0.10p/[0.10p + 0.90(1-p)], and you shouldn't be surprised at the result: our ability to flag the terrorists increases monotonically with p. How does the graph change if t = 0.01%? Plot both graphs on the same axes. Notice that when t is really small, we need p to be extremely large to have even a moderate "true positive" rate. We saw this with the original (hypothetical) data: even with 99% "accuracy," the proportion of terrorists among all flagged individuals was just 0.033%.