|Main Concepts | Demonstration | Activity | Teaching Tips | Data Collection & Analysis | Practice Questions | Milestone | Fathom Tutorial|
In this activity, we'll explore conditional probability through data and look at multiple ways of displaying categorical information. Our situation: the search for terrorists in the United States.
We will start with the hypothetical data suggested in a recent article (pdf) by John Allen Paulos. (Note: you should read the article after this Activity.)
• Suppose that 1,000 of the 300,000,000 residents of the United States are involved in terrorist activity.
• Suppose further that the government's terror detection system is "99% accurate" in two senses: 99% of all terrorists are flagged as terrorists, and 99% of all non-terrorists are correctly identified as non-terrorists.
Our goal is to answer the following question: If our terror detection system flags someone as a terrorist, what is the probability he really is a terrorist?
Will this table format work? No: in this table, we cannot cross-classify individuals (e.g., where do we fill in the number of flagged terrorists?). In the language of probability, we must have mutually exclusive rows in our table. The events "Terrorist" and "Flagged" are not mutually exclusive. Students often make this mistake in building a contingency table, especially if the two category labels are similar, like "has a Visa card" and "has a MasterCard."
Use the hypothetical numbers from above to fill in all 9 cells of the table. I'll get you started: the total number of terrorists is 1,000, and the grand total number of people is 300,000,000. You can find two of the interior cells using the 99% accuracy rates, then the remaining cells are deduced by simple subtraction. Think you have the answer? Compare your numbers to the table below:
It might be clear to some of you why this percent is so low, but we'll explore this a little later in the Activity.
To comfort you, let's compute the probability someone is not a terrorist, given he is not flagged: 296,999,010/296,999,020 = 99.9999966%. Whew!
How do these numbers relate to the standard conditional probability formula? Let T and F stand for the events "Terrorist" and "Flagged" respectively. We want P(T|F), which supposedly equals P(T and F)/P(F). We still need the table to find either of these quantities.
P(T and F) is the probability someone is a terrorist and is flagged, equaling 990/300,000,000 according to the table. Likewise, P(F) is the probability someone is flagged, which the table shows to be 3,000,980/300,000,000. Take the ratio, and the denominators of 300,000,000 cancel.
From each of these primary branches, draw two secondary branches: Flagged and Not Flagged (F and Fc). Persist with "99% accuracy" for now: P(F|T) = 99%, and which other conditional probability equals 99%? What conditional probabilities are on the remaining two secondary branches?
Here's what you should get: the numerator has only one term, since only one "node" corresponds to T and F; its probability is t * 0.99. The denominator has two terms, since there are two "nodes" corresponding to event F; their collective probability is t * 0.99 + (1-t) * 0.01.
And so, P(T|F) = 0.99t/[0.99t + 0.01(1-t)]. If you plug in t = 1000/300,000,000, you'll get our answer from earlier.
We found that P(T|F) = 0.99t/[0.99t + 0.01(1-t)], where t equals the proportion of all U.S. residents involved in terrorist activity. Make a graph of this probability as a function of t (what is the domain of t?). What do you observe?
With the correct graph, you'll see that the "true positive" rate is low when t is extremely low, but that rate improves dramatically even for modest values of t. Using the Trace tool on your calculator (or basic algebra), find the lowest value of t for which the "true positive" rate is at least 90%. (You should get t = 8.33%.)
The same issue arises in medical testing, where you might have heard the term "false positive" before. If every American were screened for AIDS or some other rare disease, most positive tests would be false, and panic would ensue. Instead, the government does not mandate so-called "mass screening"; only those in high-risk groups are encouraged to get tested. Furthermore, those who receive a positive test result are encouraged to get re-tested.
So we can graph this function; let's set t = 10% to begin. Graph 0.10p/[0.10p + 0.90(1-p)], and you shouldn't be surprised at the result: our ability to flag the terrorists increases monotonically with p. How does the graph change if t = 0.01%? Plot both graphs on the same axes. Notice that when t is really small, we need p to be extremely large to have even a moderate "true positive" rate. We saw this with the original (hypothetical) data: even with 99% "accuracy," the proportion of terrorists among all flagged individuals was just 0.033%.
Finally, if your calculator or computer can graph in 3D, explore what happens when P(F|T) and P(Fc|Tc) are not the same. Even if you can't make the graphs, at least work out how to adjust the formula. If P(F|T) = p1 and P(Fc|Tc) = p2, what is P(T|F)?