Two-Variable Relationships |
||
Main Concepts | Demonstration | Activity | Teaching Tips | Data Collection & Analysis | Practice Questions | Milestone | Fathom Tutorial | ||
Teaching Tips
• When it comes to interpretation, the slope often carries quite a bit of meaning and the intercept does not. It is not unusual for the intercept to be meaningless. Many variables, height for example, do not take on the value 0. • Interpreting the slope in context of the data collected, and without implying a causal connection between x and y, is hard. You want to be very careful not to use the algebraic interpretation of slope: as x goes up by 1 unit, y changes by b units. One reason for this caution is that quite often in a data set, we do not get to observe x changing for any given unit. So if y represents "sons' heights" and x represents "fathers' heights", then to interpret the slope to mean that as father's heights increases by 1 inch the average sons' height goes up by b inches is non-sensical, since fathers do not grow. • For most observational studies, (a good example is the father's and son's heights referred to in the last bullet), a good interpretation of the slope is to compare how y values differ for different x values. For example, if the slope between father and son heights were 0.5, we could correctly interpret this slope to mean "Sons whose fathers' heights differ by 1 inch, differ in average height by 0.5 inches." • The regression of y on x is not the same as that of x on y.
However, the correlation between y and x is the same as between x and y. • Correlation appears in several guises. First, it provides a quick, numerical summary of the "strength" of a linear relationship. In this context, correlation only makes sense if the relationship is indeed linear. Second, the slope of the regression line is proportional to the correlation coefficient: slope = r*(SD of y)/(SD of x) Third: the square of the correlation, called "R-squared", measures the "fit" of the regression line to the data. The standard phrase is: r-squared measures the percent of variation in y explained by the variable x. This sentence, while precise, has no meaning to most people and will need to be carefully explained. (See the demonstration.) However, it is easy to use r-squared: if it is low (near 0) then there's still lots of unexplained variability in the data. If it's close to 1, then the regression line does a good job of fitting the data. I like to explain r-squared in the context of prediction: a high r-squared means that if you tell me the value of x, my prediction of y will be pretty close to what we actually observe. But if r-squared is low, my prediction might be pretty far off. • Sometimes students will equate a steep slope with a high
value of the correlation coefficient. This is an easy mistake to make,
because the slope does depend directly on the correlation coefficient.
However, the ratio of the standard deviations of y to x plays an equal
role, and so one should not think "steep slope == high r". • This is a complex topic that we cover fairly quickly. We'll cover more of this in the next unit. But now is probably a good time to give you some organizing principles. I like to teach this by presenting it the same way I would analyze data: 1) plot y versus x and give a verbal description in the context of the data. In particular we want to know if the relationship looks linear. 2) If it looks linear, then compute the correlation as a means of quantifying the strength of this relationship. 3) If it looks linear, compute the regression line as a means for making predictions about future observations or as a means of quantifying the rate of "change" between x and y or as a means of understanding how "typical" y-values differ for different values of x. 4) Now look at residuals to see if the relationship really looks linear. You already did this with the scatterplot, but looking at residuals gives you a "sharper" picture. It's as if the residual plot has focused your vision a bit and might point out non-linearity that you couldn't see before. You also want to see if there are any influential outliers that might affect your interpretation of the slope. 5) If you still think the data are linear, write an interpretation of the intercept (if applicable) and the slope, taking into account any influential observations. 6) Assuming the data are linear, calculate r-squared. This
functions as a bit of "currency" for consumers of your regression. If
your r-squared is "good", they'll buy your explanation. If not, they
might think twice. For example, I might have confirmed to everyone's
satisfaction that there is a linear relationship between the amount of
protein in a city's diet and the average price of steak in that city.
Bit if my r-squared is very low, this means there's so much variability
that my regression line might be of little practical use. • Give your students practice reading and interpreting
computer printouts from regressions. This has appeared on the AP exam. |
||