18.5 – Selecting the best model
Introduction
This is a long entry in our textbook, many topics to cover. We discuss aspects of model fitting, from why model fitting is done to how to do it and what statistics are available to help us decide on the best model. Model selection very much depends on what the intent of the study is. For example, if the purpose of model building is to provide the best description of the data, then in general one should prefer the full (also called the saturated) model. On the other hand, if the purpose of model building is to make a predictive statistical model, then a reduced model may prove to be a better choice. The text here deals mostly with the later context of model selection, finding a justified reduced model.
From Full model to best Subset model
Model building is an essential part of being a scientist. As scientists, we seek models that explain as much of the variability about a phenomenon as possible, but yet remain simple enough to be of practical use.
Having just completed the introduction to multiple regression, we now move to the idea of how to pick best models.
We distinguish between a full model, which includes as many variables (predictors, factors) as the regression function can work with, returning interpretable, if not always statistically significant output, and a saturated model.
The saturated model is the one that includes all possible predictors, factors, interactions in your experiment. In well-behaved data sets, the full model and the saturated model will be the same model. However, they need not be the same model. For example, if two predictor variables are highly collinear, then you may return an error in regression fitting.
For those of you working with meta-analysis problems, you are unlikely to be able to run a saturated model because some level of a key factor are not available in all or at least most of the papers. Thus, in order to get the model to run, you start dropping factors, or you start nesting factors. If you were unable to get more things in the model, then this is your “full” model. Technically we wouldn’t call it saturated because there were other factors, they just didn’t have enough data to work with or they were essentially the same as something else in the model.
Identify the model that does run to completion as your full model and proceed to assess model fit criteria for that model, and all reduced models thereafter.
In R (Rcmdr) you know you have found the full model when the output lacks “NA” strings (missing values) in the output. Use the full model to report the values for each coefficient, i.e., conducting the inferential statistics.
Get the estimates directly from the output from running the regression function. You can tell if the effect is positive (look at the estimate for sample — it is positive) so you can say — more samples, greater likelihood to see more cases of cancer.
Remember, the experimental units are the papers themselves, so studies with larger numbers of subjects are going to find more cases of diabetes. We would worry big time with your project if we did not see statistically significant and positive for sample size.
Example: A multiple linear model
For illustration, here’s an example output following a run with the linear model function on an experimental data set.
The variables were
BMI = Dependent variable, continuous
Age = Independent variable, continuous
CalsPDay = Independent variable, continuous
CholPDay = Independent variable, continuous
Sex = Independent variable, categorical
Smoke = Independent variable, categorical
lm(formula = BMI ~ Age + CalsPDay + CholPDay + Sex + Smoke + Sex:Smoke, data = BMI) Residuals: Min 1Q Median 3Q Max -9.9685 -3.3766 -0.6609 2.5090 22.3482 Coefficients: Estimate Std. Error t value Pr(>F) (Intercept) 25.9351297 3.7205047 6.971 1.708e-09 *** Age 0.890 CalsPDay -0.0005757 0.0009882 -0.583 0.562 CholPDay 0.0103521 0.0060722 1.705 0.093 . Sex[T.M] -0.8529925 2.2209045 -0.384 0.702 Smoke[T.Yes] -1.1670159 1.9134734 -0.610 0.544 Sex[T.M]:Smoke[T.Yes] 0.9261469 2.8510680 0.325 0.746
The Y-variable was BMI, and the predictor variables included gender (male, female), smokers (yes, no), and the interaction, plus two measures of diet quality (calories per day and amount of cholesterol).
Question. Write out the equation in symbol form.
We see that none of the factors or covariates were statistically significant, so I wouldn’t go on and on about positive or negative.
But, for didactic purposes here, imagine the P-value for CholPDay
was less than 0.05 (and therefore statistically significant). We report the value (0.0103521 → I would round to 0.01), and note that those who had more cholesterol in their diet per day, those individuals tended to have higher BMI (e.g., the sign of the coefficient — and I related the coefficient back to the most important thing about your study — the biological interpretation).
Now’s a good time to be clear about HOW you report statistical results. DO NOT SIMPLY COPY AND PASTE EVERYTHING into your report. Now, for the estimates above, you would report everything, but not all of the figures. Here’s how the output should be reported in your Project paper.
Table 1. Coefficients from full model.
Estimate SE t P-value Intercept 25.935 3.721 6.97 < 0.0001 Age -0.007 0.051 -0.14 0.8892 Calories/Day -0.001 0.001 -0.58 0.5622 Cholesterol/Day 0.010 0.006 1.71 0.0929 Sex -0.853 2.221 -0.38 0.7021 Smoke -1.167 1.915 -0.61 0.5440 Interaction Smoke:Sex 0.926 2.851 0.33 0.7463
Looks better, doesn’t it?
Once you have the full model, use this model for the inferential statistics. Use the significance tests of each parameter in the model from the corresponding ANOVA table. Now, where is the ANOVA table? Remember, right after running the linear regression,
Rcmdr: Models → Hypothesis testing → ANOVA tables
Accept the default (partial marginality), and, Boom! Out pops the ANOVA table you should be familiar with.
From the ANOVA table you will tell me whether a Factor is significant or not. You report the ANOVA table in your paper. You describe it.
Now, the next step is to decide what is the best model. It then guides you to the next step which is to decide whether a better model (fewer parameters, Occam’s razor) can be found. Identify the parameter from the ANOVA table with the highest P-value and remove it from the model when you run the regression again. Repeat the steps above, return the ANOVA table, checking the estimates and P-values, until you have a model with only statistically significant parameters.
Find the best model
Output from R follows
Anova(LinearModel.1, type="II") Anova Table (Type II tests) Response: BMI Sum Sq Df F value P Age 0.62 1 0.0196 0.890 CalsPDay 10.79 1 0.3394 0.562 CholPDay 92.37 1 2.9065 0.093 Sex 1.52 1 0.0478 0.828 Smoke 8.84 1 0.2782 0.600 Sex:Smoke 3.35 1 0.1055 0.746 Residuals 2129.34 67
This is my full model and I would start anticipating the need to reduce my model because none of the factors are statistically significant. By the criterion of simple models are better, I would proceed first to drop the interaction. See below for more on selecting the best models.
But first, I want to take up an important point about your models that you may not have had a chance to think about. The order of entry of parameters in your model can effect the significance and value of the estimates themselves. The order of parameter model entry above can be read top to bottom. Age was first, followed in sequence by CalsPDay, CholPDay, and so on. By convention, enter the covariates first (the ratio-scale predictors), that’s what I did above.
Here’s the output from a model in which I used a different order of parameters.
Anova(LinearModel.2, type="II") Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) Sex 1.52 1 0.0478 0.82764 Smoke 8.84 1 0.2782 0.59964 Age 0.62 1 0.0196 0.88918 CalsPDay 10.79 1 0.3394 0.56215 CholPDay 92.37 1 2.9065 0.09286 . Sex:Smoke 3.35 1 0.1055 0.74631 Residuals 2129.34 67
The output is the same!!! So why did I give you a warning about parameter order? Run the ANOVA table summary command again, but this time select Type III type of test, i.e., ignore marginality
> Anova(LinearModel.2, type="III") Anova Table (Type III tests) Response: BMI Sum Sq Df F value Pr(>F) (Intercept) 1544.34 1 48.5929 1.708e-09 *** Sex 4.69 1 0.1475 0.70214 Smoke 11.82 1 0.3720 0.54400 Age 0.62 1 0.0196 0.88918 CalsPDay 10.79 1 0.3394 0.56215 CholPDay 92.37 1 2.9065 0.09286 . Sex:Smoke 3.35 1 0.1055 0.74631 Residuals 2129.34 67
The output has changed — and in fact it now reports the significance test of the intercept. This output is the same as the output from the linear model. Try again, this time selecting Type I, sequential
> anova(LinearModel.2) Analysis of Variance Table Response: BMI Df Sum Sq Mean Sq F value Pr(>F) Sex 1 0.68 0.681 0.0214 0.88402 Smoke 1 2.82 2.816 0.0886 0.76690 Age 1 3.44 3.436 0.1081 0.74333 CalsPDay 1 2.27 2.272 0.0715 0.78998 CholPDay 1 96.30 96.299 3.0301 0.08633 Sex:Smoke 1 3.35 3.354 0.1055 0.74631 Residuals 67 2129.34 31.781
Here, we see the effect of order. So, as we are working to learn all of the issues of statistics and in particular mode fitting, I have purposefully restricted you to Type II analyses — obeying marginality correctly handles most issues about order of entry.
Status check
Where are we??? Recall that the purpose of all of this effort is to find the best supported model. The question we are working on is whether the full (saturated) model is the best model or if a reduced model can be supported.
We go back to my first full model output from ANOVA.
Model 1
Anova(LinearModel.1, type="II") Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) Age 0.62 1 0.0196 0.88918 CalsPDay 10.79 1 0.3394 0.56215 CholPDay 92.37 1 2.9065 0.09286 . Sex 1.52 1 0.0478 0.82764 Smoke 8.84 1 0.2782 0.59964 Sex:Smoke 3.35 1 0.1055 0.74631 Residuals 2129.34 67
We have two factors (Sex, Smoke), three covariates (Age, CalsPDay, CholPDay), and one two-way interaction (Sex:Smoke). We would write our full model then as
Get and save in your output the ANVOA table for this Full model. Proceed to test a series of nested reduced models. Start by dropping the interaction terms, consistent with our Occam’s razor approach.
Model 2
Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) Age 0.63 1 0.0201 0.88760 CalsPDay 10.45 1 0.3331 0.56572 CholPDay 96.30 1 3.0704 0.08424 . Sex 1.52 1 0.0484 0.82650 Smoke 8.84 1 0.2819 0.59720 Residuals 2132.69 68
Next, reduce by identifying Factors or Predictors with the highest P-values. Looks like “Age” is next.
Model 3
Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) CalsPDay 9.87 1 0.3194 0.57381 CholPDay 97.78 1 3.1627 0.07974 . Sex 2.55 1 0.0824 0.77488 Smoke 8.61 1 0.2786 0.59934 Residuals 2133.32 69
Next up, drop the “Sex” parameter.
Model 4
Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) CalsPDay 10.45 1 0.3424 0.56032 CholPDay 96.12 1 3.1502 0.08027 . Smoke 12.42 1 0.4070 0.52557 Residuals 2135.87 70
Next? You would select CalsPDay, right?
Model 5
Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) CholPDay 90.24 1 2.9852 0.08837 . Smoke 13.95 1 0.4613 0.49923 Residuals 2146.32 71
And finally, we remove the “Smoke” factor.
Model 6
Anova Table (Type II tests) Response: BMI Sum Sq Df F value Pr(>F) CholPDay 77.93 1 2.5974 0.1114 Residuals 2160.26 72
Oops!
What happened to Model 6? Nothing remains significant? Panic? What is the point??? Arggh, Dr Dohm…!!!
Easy there…. Take a deep breath, and guess what? Your best model needs to have significant parameters in it, right? Your best fit model then is Model 5. And that model will be your candidate for best fit as we proceed to complete our model building.
Now we proceed to gain some support evidence for our candidate best model. We are going to use an information criterion approach.
Use a fit criterion for determining model fit
To help us evaluate evidence in favor of one model over another there are a number of statistics one may calculate to provide a single number for each model for comparison purposes. The criteria model evaluators available to us include Mallow’s Cp, adjusted R2, Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) to select best model.
We already introduced the coefficient of determination R2 as a measure of fit – in general we favor models with larger values of R2. However, values of R2 will always be larger for models with more parameters. Thus, the other evaluators attempt to adjust for the parameters in the model and how they contribute to increased model fit. For illustrative purposes we will use Mallow’s Cp. The equation for Mallow’s Cp in linear regression is
where p is the number of parameters in the model. Mallow’s Cp is thus equal to the number of parameters in the model plus an additional amount due to lack of fit of the model (i.e., large residuals). All else being equal we favor the model in which the Cp is close to the number of parameters in the model.
In Rcmdr, select Models → Subset model selection … (Fig. 1)
Figure 1. Rcmdr popup menu, Subset model selection…
From the menu, select the criterion and how many models to return. The function returns a graph that can be used to interpret which model is best given the selection criterion used. Below is an example (although for a different data set!) for Mallow’s Cp (Fig. 2).
Figure 2. Mallow’s Cp plot
Lets break down the plot. First define the axes. The vertical axis is the range of values for the Cp calculated for each model. The horizontal axis is categorical and reads from left to right: Intercept, Inspection[T.Yes], etc., up to Suspended. Looking into the graph itself we see horizontal bars — the extent of shading indicates which model corresponds to the Cp value. For example, the lowest bar which is associated with the Cp value of 8 extends all the way to the right of the graph. This says that the model evaluated included all of the variables and therefore was the saturated or full model. The next bar from the bottom of the graph is missing only one block (lgCarIns), which tells us the Cp value 6 corresponds to a reduced model, and so forth.
Cross-validation
Once you have identified your Best Fit model, then, your proceed to run the diagnostics plots. For the rest of the discussion we return to our first example.
Rcmdr: Models → Graphs → Basic diagnostic plots.
We’ll just concern ourselves with the first row of plots (Fig. 3).
Figure 3. Diagnostic plots
The left one shows the residuals versus the predicted values — if you see a trend here, the assumption of linearity has been violated. The second plot is a test of the assumption of normality of the residuals. Interpret them (residuals OK, Residuals normally distributed? Yes/ No), and you’re done. Here, I would say I see no real trend in the residuals vs fitted plot, so assumption of linear fit is OK. For normality, there is a tailing off at the larger values of residuals, which might be of some concern (and I would start thinking about possible leverage problems), but nothing dramatic. I would conclude that our Model 6 is a good fitting model and one that could be used to make predictions.
Now, if you think a moment, you should identify a logical problem. We used the same data to “check” the model fit as we did to make the model in the first place. In particular if the model is intended to make predictions it would be advisable to check the performance of the model (e.g., does it make reasonable predictions?) by supplying new data, data not used to construct the model, into the model. If new data are not available, one acceptable practice is to divide the full data set into at least two subsets, one used to develop the model (sometimes called the calibration or training dataset) and the other used to test the model. The benefits of cross-validation include testing for influence points, over fitting of model parameters, and a reality check on the predictions generated from the model.
Questions
[pending]