Comparing Two Populations

 HomeContact us    
  Main Concepts  | Demonstration  | Activity  | Teaching Tips  | Data Collection & Analysis  | Practice Questions  | Milestone   | Fathom Tutorial 
 

   

 Activity

Students have a hard time understanding whether or not to treat a dataset as "paired." In this activity, we're going to see why "pairing" is important, and why it's wrong to treat paired data as two independent samples.

1) Download the acid rain data set and load it into Fathom.

2) This data set includes pH measurements from rain samples at 32 locations in a particular county. A substance at room temperature is
considered to be a "base" if pH levels are above 7.0 while pH levels below 7.0 are consider "acidic." Pure rain falling through clean air registers a pH value of 5.7. So, rain water in its natural state is considered acidic.

The first variable, "lastyear", consists of measurements taken from water samples from 32 locations last year. County officials are concerned that the rainwater is becoming more acidic, which would indicate a problem with pollution. The second variable, "thisyear", consists of measurements taken from the same sites as in "lastyear". Our goal is to determine whether the mean pH level has fallen since last year.

As a first step, we'll consider the variables separately (unpaired). Make an appropriate graphical summary. (a) Describe the graph. (b) What would you conclude?

3) Find a 95% confidence interval for the difference in the mean pH level for the current year and for last year. What do you conclude about the difference of the means?

4) Perform a hypothesis test with a 5% significance level to see whether the mean pH level is lower this year.

Now we'll "pair" the data. The reason for pairing is that the measures in each year come from the same site. So, for example, the first value in "lastyear" is taken from the same site as the first value for "thisyear". And for this reason, we have good reason to suspect that the assumption that the samples are independent has been violated.

5) Create a new variable called "diff". In Fathom, click twice on the collection to open the inspector. (If the inspector is already open, skip this step.) On the inspector, click on the "New Attribute" field and name the new attribute "diff". Click twice on the "formula" field that corresponds to this new attribute. Type "thisyear-lastyear" ; the new variable will now have the value of this year's value minus last year's.

6) Make an appropriate graphical summary of the difference between the two years. Describe the graph and state your (preliminary) conclusion.

7) Find a 95% confidence interval for the mean of the differences. What do you conclude?

8) Perform a hypothesis test with a 5% significance level on the difference of the means to see if the pH level is lower this year.

Why did we get different conclusions?

9) To see this, make a scatterplot of "thisyear" against "lastyear". What's the correlation?

10) Remember that the width of a confidence interval depends on the standard error of the estimator; the width of a confidence interval is 2*K*SE. Let Xbar represent the average of this year, and Ybar represent the average of last year. Then Var(Xbar-Ybar) = sigma2x/n + sigma2y/n - 2*rho*sigmax*sigmay, where rho is the correlation between X and Y. (To refresh your memory, review the data collection section of Unit 7.)

For this dataset the correlation between X and Y is about 0.9. Now the standard error is the square-root of the quantity above. You can see that if we were to ignore the fact that X and Y are correlated -- which is equivalent to setting rho=0 -- we would get a wider confidence interval. So when X and Y are positively correlated, if you ignore the pairing you get a confidence interval that's too big and might miss an interesting difference in means. On the other hand, if X and Y are negatively correlated, you'll get a confidence interval that's too small, and might mistakenly think there's a difference when there's really not! This illustrates what can happen when you do an "unpaired" test with paired data.

If Xbar represents the average of thisyear, and Ybar the average of lastyear, then the standard error is SD(Xbar-Ybar). Calculate this assuming (a) Xbar and Ybar are independent and (b) assuming the true correlation is the same as the sample correlation that you calculated in (9).