Regression Revisited

 HomeContact us    
  Main Concepts | Demonstration  | Activity  | Teaching Tips  | Data Collection & Analysis  | Practice Questions  | Milestone   | Fathom Tutorial 
 

   

 Main Concepts


• The slope and intercepts we compute in least squares regression are statistics, based upon our sample data. They are estimates of corresponding parameters; namely, the slope and intercept we would observe if (1) the relationship between our variables were truly linear and (2) we had data on the entire population.

• The regression model is more than just the equation of a line; it is a model that tells us what reality should look like. In particular, the model says that for any given x value, the observed y value will be a normally distributed deviation from the line. Furthermore, the standard deviation of these y values from the line must be the same for all x values. Together, these are pretty strong assumptions! But remember: just because the model claims to reflect reality doesn't mean this is so.

• The predicted y-value given by the regression line is interpreted as the mean value of all possible y's that we could observe for that particular x value (assuming the model is good, of course).

• Our model for the population has two further assumptions: (1) the standard deviations of these various normal distributions for the y variable is the same for every x-value, and (2) for each x, the corresponding y-values are independent. If (1) is violated, you'll see the cloud of points get wider and narrower at different x values (assuming you have enough data to see this.) A typical pattern is to see the cloud get gradually wider or narrower as x increases. This is most easily detected in a residual plot of the residuals against the x values. (Incidentally, if you are looking for a good new vocabulary word -- and this is not on the AP test -- the term for the violation of assumption (1) is "heteroskedasticity". Go ahead and say it outloud; it trips pleasantly off the tongue.) Violations of (2) are harder to detect and usually require fairly thorough knowledge of how the data were recorded and collected.

• Under the set of conditions described above, our statistics (sample slope and intercept) are unbiased estimators of the corresponding parameters.

• Under those same conditions, we can perform statistical inference procedures on the slope and intercept, though inference on the intercept is far less common (and less important). In particular, we can perform a "model ultility test," which considers the null hypothesis that the population slope is actually zero. If this hypothesis is true, then our linear model is not "useful," in the sense that our explanatory variable does not help us predict the value of our response variable.

• Even in statistical inference, association (correlation) does not imply causation. If we were to reject the aforementioned null hypothesis and conclude that our linear model is "useful," we still could not conclude that our explanatory variable causes the observed responses.