17 – Linear Regression
Introduction
Regression is a toolkit for developing models of cause and effect between one ratio scale data type dependent response variables, and one (simple linear regression) or more or more (multiple linear regression) ratio scale data type independent predictor variables. By convention the dependent variable(s) is denoted by Y, the independent variable(s) represented by X1, X2, Xn for n independent variables. Like ANOVA, linear regression is simply a special case of the general linear model, first introduced in Chapter 12.7.
Components of a statistical model
Regression statistical methods return model estimates of the intercept and slope coefficients, plus statistics of regression fit (e.g., R2, aka “R-squared,” the coefficient of determination).
Chapter 17.1 – 17.9 cover the simple linear model
Chapter 18.1 – 18.5 cover the multiple regression linear model
where α or β0 represent the Y-intercept and β or β1, β2, … βn represent the regression slopes.
Regression and correlation test linear hypotheses
We state that the relationship between two variables is linear (the alternate hypothesis) or it is not (the null hypothesis). The difference? Correlation is a test of linear association (are variables correlated, we ask?), imply possible causation, but are not sufficient evidence for causation: we do not imply that one variable causes another to vary, even if the correlation between the two variables is large and positive, for example. Correlations are used in statistics on data sets not collected from explicit experimental designs incorporated to test specific hypotheses of cause and effect.
Linear regression, however, is to cause and effect as correlation is to association. With regression and ANOVA, we are indeed making a case for a particular understanding of the cause of variation in a response variable: modeling cause and effect is the goal. Regression, ANOVA, and other general linear models are designed to permit the statistician to control for the effects of confounding variables provided the causal variables themselves are uncorrelated.
When to use correlation and when to apply linear regression to data set? For two ratio scale variables, one can always apply either approach. Correlation is always appropriate if both variables are measured in the course of the study whereas a regression modeling approach would be implied where one variable was the result of manipulation by the researcher.
Assumptions of linear regression
The key assumption in linear regression is that a straight line indeed is the best fit of the relationship between dependent and independent variables. The additional assumptions of parametric tests (Chapter 13) also hold. In Chapter 18 we conclude with an extension of regression from one to many predictor variables and the special and important topic of correlated predictor variables or multicollinearity.
Build a statistical model, make predictions
In our exploration of linear regression we begin with simple linear regression, also called ordinary least squares regression, starting with one predictor variable. Practical aspects of model diagnostics are presented. Regression may be used to describe or to provide a predictive statistical framework. In Chapter 18 we conclude with an extension of regression from one to many predictor variables. We conclude with a discussion of model selection. Throughout, use of Rcmdr
and R have multiple ways to analyze linear regression models are presented; we will continue to emphasize the general linear model approach, but note that use of linear model in Rcmdr
provides a number of default features that are conveniently available.
References
Linear regression is a huge topic; references I include are among my favorite on the subject, but are only a small and incomplete sampling. For simplicity, I merged references for Chapter 17 and Chapter 18 into one page at References and suggested readings (Ch17 & 18)
Chapter 17 contains
- Introduction
- Simple Linear Regression
- Relationship between the slope and the correlation
- Estimation of linear regression coefficients
- OLS, RMA, and smoothing functions
- Testing regression coefficients
- ANCOVA – analysis of covariance
- Regression model fit
- Assumptions and model diagnostics for Simple Linear Regression
- References and suggested readings (Ch17 & 18)