17.7 – Regression model fit

Introduction

In Chapter 17.5 and 17.6 we introduced the example of tadpoles body size and oxygen consumption. We ran a simple linear regression, with the following output from R

RegModel.1 <- lm(VO2~Body.mass, data=example.Tadpole)

summary(RegModel.1)

Call:
lm(formula = VO2 ~ Body.mass, data = example.Tadpole)

Residuals:
    Min      1Q    Median       3Q       Max 
-202.26 -126.35     30.20    94.01    222.55

Coefficients:
                Estimate     Std. Error    t value    Pr(>|t|) 
(Intercept)      -583.05         163.97     -3.556     0.00451 ** 
Body.mass         444.95          65.89      6.753   0.0000314 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 145.3 on 11 degrees of freedom
Multiple R-squared: 0.8057, Adjusted R-squared: 0.788 
F-statistic: 45.61 on 1 and 11 DF, p-value: 0.00003144

You should be able to pick out the estimates of slope and intercept from the table (intercept was -583 and slope was 445). Additionally, as part of your interpretation of the model, you should be able to report how much variation in VO2 was explained by tadpole body mass (coefficient of determination, R2, was 0.81, which means about 81% of variation in oxygen consumption by tadpoles is explained by knowing the body mass of the tadpole.

What’s left to do? We need to evaluate how well our model fits the data, i.e., we evaluate regression model fit. By fit we mean, how well does our model agree to the raw data? A poor fit and the model predictions are far from the raw data; a good fit, the model accurately predicts the raw data. In other words, if we draw a line through the raw data, a good fit line crosses through most of the points — a poor fit the line rarely passes through the cloud of points.

How to judge fit? This we can do by evaluating the error components relative to the portion of the model that explains the data. Additionally, we can perform a number of diagnostics of the model relative to the assumptions we made to perform linear regression. These diagnostics form the subject of Chapter 17.8. Here, we ask how well does the model,

    \begin{align*} \dot{V}O_{2} = b_{0}+b_{1}\left ( Body \ mass \right ) \end{align*}

fit the data?

Model fit statistics

The second part of fitting a model is to report how well the model fits the data. The next sections apply to this aspect of model fitting. The first area to focus on is the magnitude of the residuals: the greater the spread of residuals, the less well a fitted line explains the data.

In addition to the output from lm() function, which focuses on the coefficients, we typically generate the ANOVA table also.

Anova(RegModel.1, type="II")
Anova Table (Type II tests)

Response: VO2
                Sum Sq   Df    F value       Pr(>F) 
Body.mass       962870    1     45.605   0.00003144 ***
Residuals       232245   11 
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Standard error of regression

S, the Residual Standard Error (aka Standard error of regression), is an overall measure to indicate the accuracy of the fitted line: it tells us how good the regression is in predicting the dependence of response variable on the independent variable. A large value for S indicates a poor fit. One equation for S is given by

    \begin{align*} S = \sqrt{\frac{SS_{residual}}{n-2}} \end{align*}

In the above example, S = 145.3 (underlined, bold in regression output above). We can see how if SSresidual is large, S will be large indicating poor fit of the linear model to the data. However, by itself S is not of much value as a diagnostic as it is difficult to know what to make of 145.3, for example. Is this a large value for S? Is it small? We don’t have any context to judge S, so additional diagnostics have been developed.

Coefficient of determination

R2, the coefficient of determination, is also used to describe model fit. 2, the square of the simple product moment correlation r, can take on values between 0 and 1 (0% to 100%). A good model fit has a high 2 value. In our example above, 2 = 0.8057 or 80.57%. One equation for 2 is given by

    \begin{align*} R^2 = \frac{SS_{regression}}{SS_{total}} \end{align*}

A value of 2 close to 1 means that the regression “explains” nearly all of the variation in the response variable, and would indicate the model is a good fit to the data. Note that the coefficient of determination, 2, is the squared value of r, the product moment correlation.

Adjusted R-squared

Before moving on we need to remark on the difference between 2 and adjusted 2. For Simple Linear Regression there is but one predictor variable, X; for multiple regression there can be many additional predictor variables. Without some correction, 2 will increase with each additional predictor variables. This doesn’t mean the model is more useful, however, and in particular, one cannot compare 2 between models with different numbers of predictors. Therefore, an adjustment is used so that the coefficient of determination remains a useful way to assess how reliable a model is and to permit comparisons of models. Thus, we have the Adjusted \bar{R}^2, which is calculated as

    \begin{align*} \bar{R}^2 = 1 - \frac{SS_{residual}}{SS_{total}} \cdot \frac{DF_{total}}{DF_{residual}} \end{align*}

In our example above, Adjusted R 2 = 0.3806 or 38.06%.

Which should you report? Adjusted R2, because it is independent of the number of parameters in the model.

Both \bar{R}^2 and S are useful for regression diagnostics, a topic which we will discuss next (Chapter 17.8).

Questions

  1. True or False. The simple linear regression is called a “best fit” line because it maximizes the squared deviations for the difference between observed and predicted Y values.
  2. True or False. Residuals in regression analysis are best viewed as errors committed by the researcher. If the experiment was designed better, or if the instrument was properly calibrated, then residuals would be reduced. Explain your choice.
  3. The USA is finishing the 2020 census as I write this note. As you know, the census is used to reapportion congress and also to determine the number of electoral college votes. In honor of the election for US President that’s just days away, in the next series of questions in this Chapter and subsequent sections of Chapter 17 and 18, I’ll ask you to conduct a regression analysis on the electoral college. For starters, make the regression of Electoral votes on 2010 census population. (Ignore for now the other columns, just focus on POP_2019 and Electoral.) Report the
    • regression coefficients (slope, intercept)
    • percent of the variation in electoral college votes explained by the regression (R2).
  4. Make a scatterplot and add the regression line to the plot

Data set

USA population year 2010 and year 2019 with Electoral College counts

StateRegionDivisionPOP_2010POP_2019Electoral
AlabamaSouthEast South Central477973649031859
AlaskaWestPacific7102317315453
ArizonaWestMountain6392017727871711
ArkansasSouthWest South Central291591830178046
CaliforniaWestPacific372539563951222355
ColoradoWestMountain502919657587369
ConnecticutNortheastNew England357409735652877
DelawareSouthSouth Atlantic8979349828953
District of ColumbiaSouthSouth Atlantic6017237057493
FloridaSouthSouth Atlantic188013102147773729
GeorgiaSouthSouth Atlantic96876531061742316
HawaiiWestPacific136030114158724
IdahoWestMountain156758217870654
IllinoisMidwestEast North Central128306321267182120
IndianaMidwestEast North Central6483802673221911
IowaMidwestWest North Central304635531550706
KansasMidwestWest North Central285311829133146
KentuckySouthEast South Central433936744676738
LouisianaSouthWest South Central453337246487948
MaineNortheastNew England132836113442124
MarylandSouthSouth Atlantic5773552604568010
MassachusettsNortheastNew England6547629689250311
MichiganMidwestEast North Central9883640988363516
MinnesotaMidwestWest North Central5303925563963210
MississippiSouthEast South Central296729729761496
MissouriMidwestWest North Central5988927613742810
MontanaWestMountain98941510687783
NebraskaMidwestWest North Central182634119344085
NevadaWestMountain270055130801566
New HampshireNortheastNew England131647013597114
New JerseyNortheastMid-Atlantic8791894888219014
New MexicoWestMountain205917920968295
New YorkNortheastMid-Atlantic193781021945356129
North CarolinaSouthSouth Atlantic95354831048808415
North DakotaMidwestWest North Central6725917620623
OhioMidwestEast North Central115365041168910018
OklahomaSouthWest South Central375135139569717
OregonWestPacific383107442177377
PennsylvaniaNortheastMid-Atlantic127023791280198920
Rhode IslandNortheastNew-England105256710593614
South CarolinaSouthSouth-Atlantic462536451487149
South DakotaMidwestWest-North-Central8141808846593
TennesseeSouthEast-South-Central6346105682917411
TexasSouthWest-South-Central251455612899588138
UtahWestMountain276388532059586
VermontNortheastNew-England6257416239893
VirginiaSouthSouth-Atlantic8001024853551913
WashingtonWestPacific6724540761489312
West VirginiaSouthSouth-Atlantic185299417921475
WisconsinMidwestEast-North-Central5686986582243410
WyomingWestMountain5636265787593

Chapter 17 contents