 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)  ### Principles

The approach we took to simple linear regression generalizes directly to multiple explanatory variables. Hence for two explanatory variables we can write:

 Y = β0 + β1X 1 + β2X2 + ε
where
• Y is the value of the response which is predicted to lie on the best-fit regression plane'
• β0, is the intercept (the value of Y when both X1 and X2 are 0),
• β1 is the partial regression coefficient which quantifies the sensitivity of Y to change in X1 , adjusting for the effect of X2 on Y.
• β2 is the partial regression coefficient which quantifies the sensitivity of Y to change in X2, adjusting for the effect of X1 on Y,
• ε is a random error term.

The model can be further generalized to any number of explanatory variables. Note that the slope parameters are termed partial regression coefficients because they measure the change in Y per unit change in the respective X whilst holding all other X variables constant. If the X variables are measured in different units, it may be preferable to use standardized coefficients that are independent of the units the variables are measured in.

It is important to note that just considering one explanatory variable if several are acting on a response variable can be very misleading. Associations can be masked if important explanatory variables are left out of the model resulting in biased coefficients. In the epidemiological field this has been termed misspecification bias. The same may apply if some of the relationships are not linear.

### Estimation of model parameters

Model parameters in a multiple regression model are usually estimated using ordinary least squares minimizing the sum of squared deviations between each observed value and predicted values. It involves solving a set of simultaneous normal equations, one for each parameter in the model. A maximum likelihood estimation of parameters will give the same result if errors are normal.

We do not give the full computational details for multiple regression as it is unlikely to be instructive, and in practice multiple regression is always done by matrix methods using a computer programme. However, details of the manual calculation of multiple regression coefficients can be found in Sokal & Rohlf (1995). A summary of the matrix method is given in Quinn & Keough (2002). ### Significance testing in multiple regression

#### Overall significance

The overall significance of a multiple regression can be assessed using analysis of variance in the same way as with a simple linear regression.

 Source of variationRegressionErrorTotal df 1n − 2n − 1 SSSSRegSSErrorSSTotal MSMSReg (s2 )MSError (s2Y.X) F-ratioMSReg / MSError P value

where:
• SSTotal is the total sums of squares, or Σ(Y − )2,
• SSRegression is the sums of squares explained by the regression, or Σ(  )2,
• SSError is the unexplained error, or Σ(Y− )2
• is the expected (predicted) value of Y for each value of X,
The F-test in the ANOVA of a multiple regression tests the null hypothesis that all the partial regression slopes equal zero (β1 = ...= βk = 0).

#### Significance of partial regression coefficients

The null hypothesis that any particular partial regression coefficient (βi) equals zero can be tested with a t-test or with a partial F-test.

t-test of coefficient

 t = bi sbi
where
• t is the estimated t-statistic; under the null hypothesis it is a random quantile of the t-distribution with (n − p) degrees of freedom. Note we define p as the number of predictors including the intercept, in other words k + 1 where k is the number of predictor variables.
• bi is the estimated regression coefficient,
• sbi is the standard error of bi.

Partial F-test of coefficient An equivalent test can be carried out by simply dropping the predictor variable of interest from the full model and recalculating the analysis of variance table for the reduced model. A test of the predictor variable can be obtained by calculating the F-ratio of the difference in the residual sums of squares from the full to the reduced model divided by the MSresidual of the full model:

 F = MSdiff SSresid
where
• F is the estimated F-statistic; under the null hypothesis it is a random quantile of the F-distribution with 1 and (n − p) degrees of freedom; p is the number of predictors including the intercept, in other words k + 1 where k is the number of predictor variables.
• MSdiff is the difference between SSresid for the full and reduced models
• SSresid is the residual sums of squares for the full model.

### Nominal Explanatory Variables

Nominal variables (such as gender) can be included by using dummy variables (also known as indicator variables) that take set values or codes to represent the different levels of the nominal variable. For two levels just one dummy variable is required; for multiple (n) levels n-1 dummy variables are required.

The simplest coding system for a binary variable is known as reference coding where the reference group (say females) is coded with 0 and the other group (males) as 1. With this coding the coefficient b for the dummy variable will indicate the average difference between males and females after adjusting for all other effects.

The alternative is to use effects coding where females are coded with 1 and males with −1 . With this coding the coefficient b for the dummy variable will indicate the deviation of the response for males and females from the average of both males and females.

### Interactions

If interactions are not included in the model, it is assumed that effects of the explanatory variables are all additive. In other words if you have two explanatory variables X1 and X2 the change in the response variable for a unit change in X1 remains the same whatever the level of X2. This is often not the case - in fact much more common that effects are not additive. In ANOVA we removed the interaction terms if they were not significant (but remembering that power to detect them was very limited) But if you have multiple explanatory variables the potential number of interactions you can add are enormous! Hence the usual practice is to stick to first order interactions (between two variables) .

The model for two factors then becomes:

 Y = b0 + b1X 1 + b2X2+ b3X 1X2
This can be rewritten as:
 Y = b0 + b1X 1 + (b2+ b3X1)X2
or
 Y = b0 + b2X 2 + (b1+ b3X1)X2

This makes it clear that the effect of X1 on Y also depends on the level of X2; similarly the effect of X2 on Y also depends on the level of X1 (each appear in the equation twice);

If one of the explanatory variables is nominal variable with two levels (a binary variable) then interactions have a special interpretation. Let X1 be a continuous predictor variable and X2 a binary variable with values of 0 = male and 1 female. Then if we specify there is no interaction, we end up with two simple linear regressions:

For males X2=0
 Y = b0 + b1X1
For females X2=1
 Y = (b0 + b2) + b1X1

The two regression lines will be parallel with the difference between males and females (the vertical distance between the lines) being b2.

If on the other hand an interaction term is included as in (1) above, the lines can have different slopes and the models for males and females reduce to:
 Y = b0 + b1X1
 Y = (b0 + b2) + (b1 + b3X1

The difference between slopes is b3 - if this coefficient is statistically significant we can conclude that the slopes for males and females are different. This will give same result as a t-test of slopes but is a neater way to do it!

### Assessing importance of explanatory variables

Relative importance attempts to quantify an individual explanatory variable's contribution to a multiple regression model. If all explanatory variables are uncorrelated this is straightforward. Each variable's contribution could be assessed as the R2 from the univariate regression, which sum to the full multiple regression R2. However, this only works if explanatory variables are uncorrelated and in observational studies there is nearly always a degree of correlation. In this situation it has been argued that relative importance should consider both its direct effect (its correlation with the response variable) and its effect when combined with the other variables in the regression equation (see Johnson & Lebreton (2004) for more on this).

#### Simple relative importance measures

1. The first simple measure only considers the variable's direct effect. We compare what each regressor alone is able to explain using the univariate r2 values. This method is not very useful because it gives no indication of the variable's effect when combined with the other variables, and moreover individual r2values will (almost certainly) not sum to the overall model R2. It is option first in the R package relaimpo.
2. An alternative approach is to use the change in R2 if the variable is dropped from model. This has been termed the usefulness of the variable. This gives the same order of importance as the significance test of individual partial coefficients either with the t-test or the F-test. This measure of importance only considers its effect when combined with other variables. This may of course be precisely what you want, but be aware that if you have two explanatory variables both of which are highly correlated with the response variable (and hence with each other), only one of these will appear to be important using this index. It is option last in the R package relaimpo.

3. Another common practice is to calculate for each variable its standardized regression coefficient. This is calculated by multiplying the coefficient by the standard deviation of the explanatory variable and then dividing it divided by the standard deviation of the response variable. A standardized coefficient is interpreted as the standard deviation change in the dependent variable when the explanatory variable is changed by one standard deviation, holding all other variables constant.

 b'i = bi × sY sXi
where
• b'i is the standardized partial regression coefficient of the ith explanatory variable, • bi is the partial regression coefficient of the ith explanatory variable,
• sY is the standard deviation of the response variable Y,
• sXi is the standard deviation of the ith explanatory variable.

These standardized coefficients have the advantage of being comparable from one independent variable to the other because the unit of measurement has been eliminated. The (squared) standardized coefficient is option beta in the R package relaimpo.

Standardized coefficients have the disadvantage that they are no longer readily interpretable in terms of the original data. Moreover Bring et al. (1994) have suggested that conventional methods to calculate standardized coefficients are incorrect. They argue that since the coefficient is calculated on the basis that all other variables are held constant, the standard deviation of the explanatory variable (sXi) should be calculated on the same basis. Hence we should use a partial standard deviation (holding all other Xi constant) instead of the ordinary standard deviation. If this is done the estimates of relative importance are then equivalent to those indicated by significance tests of the coefficients and change in R2 when variables are dropped from the model.

#### Computer intensive relative importance metrics

There are more sophisticated (and some would argue much better) indices of relative importance, but until recently their use has been constrained by the fact that they are computer intensive. They have the advantage that they decompose the total r2 into non-negative contributions that sum to the total r2.

The approach is based on doing an ANOVA using sequential sums of squares. Division of the sequential sums of squares for each explanatory variable by the total sums of squares gives sequential r2 contributions which (not surprisingly) sum to the total. The problem is that each ordering of variables will produce different values. This dependence on ordering can be dealt with by taking simple unweighted averages over the different orderings (Lindeman et al. (1980)). The approach was later developed further using a variety of measures of fit under the name hierarchical partitioning (Chevan & Sutherland (1991) ). More recently a method using weighted averages has been proposed by Feldman (2005) . Note however that there is no unique unchallenged metric for relative importance when there are correlated regressors - important thing is to decide on which to use at the start so as not to bias results by choosing which metric gives the desired answer!

### Model selection criteria

The most commonly used criterion used to select a variable is the P-value derived from a partial F test of the coefficient. However, this has all sorts of problems, not the least of which is overfitting. Hence increasingly other criteria are being used to select the variables to enter a model:
1. Adjusted r2

r2 cannot itself be used as the criterion by which to select a model because it always increases when you add an extra variable. Hence, use of r2 (and in fact of several other criteria) would inevitable lead to overfitting. Instead one uses the adjusted r2 which is calculated using mean squares rather than sums of squares. Note adjusted r2 can be negative if you have an especially poorly fitting regression - the apparent contradiction of a having a negative squared value is explained by the fact that it is a n adjusted squared value - not a squared adjusted one!

2. Mallow's Cp

Mallow's Cp is related to the adjusted r2 but has a heavier penalty for increasing the number of explanatory variables. As such it more robust to overfitting. Mallow' Cp is also closely related to AIC (see below). It is calculated as follows:

Cp = SSRp/MSRM - n + 2p
where SSRp is the summed squared residual error in Y from a model with p coefficients, and MSRM is the mean squared residual error in Y from the maximal model.

Unlike the adjusted r2 it is not measured on a 0-to-1 scale. Its values are typically positive and greater than 1. The 'best' models will be the ones with the lowest values of Cp. Since E(SSRp) = (n-p)MSRM Cp

• For a full model E(Cp) = p and good fits are expected to approximately equal p.
• For a bad fit model Cp is expected to be much greater than p.
• So optimal models have Cp ≤ p and a low p.
3. Akaike's Information criterion (AIC)

Assuming errors in Y are normal,

AICN=n{log(RSS/n)} + 2p where n{log(RSS/n)} is the model's deviance, RSS is the residual sum of squares, n is the number of observations and p is the number of coefficients.

Or, using the adjusted R2,

AICR2 = n{log([1-R2]/n)} + 2p

Assuming the MLE is normal, where χ2 is used to test the overall model,

AICχ2 = χ2 + 2p

The quasi information criterion is used where c is the inflation factor,

QAIC = AIC/c

To reduce overfitting in small samples there is a 'second order' correction, where n is large it approaches zero,

AICc = AIC + 2p(p+1)/(n-p-1)

Assuming errors in Y are normal, McQuarrie & Tsai (1998) strongly recommend their corrected measure,

AICu = log(RSS/[n-p]) + (n+p)/(n-p-2)
Especially where n is small and p large, bias correction via bootstrapping may largely eliminate overfitting. (see Ishiguro et al. (1997) )
4. #### Bayesian or Schwarz information criterion (BIC)

This is a Bayesian equivalent of AIC, where Lp is the maximized log likelihood given p regression coefficients,
-2Lp +p.ln(n)
It has a heavier penalty for increasing the number of explanatory variables than AIC, and as such is more robust to overfitting

### Model selection methods

Model/variable selection methods are summarized below:

1. Forward selection
This method starts with the null model. For each variable, the test-statistic (commonly the F-ratio) is calculated as a measure of the variable's contribution to the model. The variable with the largest value that is greater than a preset value is added to the model. Then the test-statistic is calculated again for each of the variables still remaining, and the evaluation process is repeated. Thus, variables are added sequentially the model one by one until no remaining variable produces a test statistic value that is greater than the preset value. Variables that have made it into the model are not tested again and cannot therefore be removed.
2. Backward elimination
This method starts with the maximal model and deletes variables one by one from the model until all remaining variables contribute something significant to the dependent variable. At each step, the variable showing the smallest contribution to the model is deleted. One common practice to identify confounders is to check what effect removal of a variable has on the other coefficients when it is dropped. If removal of a variable affects them by more than some predetermined amount (say 25%) the variable is forced into the model even if its effect is not significant.

3. Stepwise
This method is a modification of the forward selection approach. Variables are added to the model one at a time but each time a variable is added a backward elimination step occurs to test whether any variables entered previously can be removed.

4. All possible subsets
The least biased strategy is to compare all possible subsets. However, it does require considerable computing power which was often not available until recently and it is prone to overfitting. Nevertheless it may be a viable strategy if there are relatively few variables.

5. Cross validation

This may well be the optimal method for model selection although it still seems to be little used in biological applications. The commonest form is k-fold cross validation. The data are first split into k exclusive chunks of data. The model is built with k-1 chunks and then tested with the rest of the data. This process is repeated until each chunk has been used for testing. The model which has the smallest errors is taken as the best model (see .

#### Problems of selection methods

We have discussed this extensively in the core text and only summarize the main points here. One first needs to consider what the modelling is designed to achieve: is it to obtain the best single 'predictive' model or is it an explanatory approach for developing new insights and investigating causal relationships between variables. If the aim is only to get the best predictive model, then one may not bother too much about how many variables are included or how much sense it makes. If the aim is to investigate causal relationships, then the approach is rather different (for example the multicollinearity issue must be tackled). It should also be remembered that there will usually be several best models - traditionally researchers have narrowed these down to one best model, but there is increasingly a tendency to give the group of 'best' models rather than just one 'winning' model. This makes much more sense!

All the above methods of model selection are potentially open to abuse, but some more so than others. Stepwise methods in particular yield confidence intervals for both effects and predicted values that are falsely narrow. The regression coefficients are biased, and P-values do not have the proper meaning. These problems are even worse if important variables are missing or mis-specified or collinear with other variables. Apart from cross-validation all the above methods will tend to overfit, and hence will perform poorly if applied to a different data set than that used to develop them.

A more general problem arises if regression modelling is applied blindly relying on the automated (or semi-automated) procedure of the software. As Henderson &Velleman (1981) point out the data analyst knows more than the computer - failure to use that knowledge produces inadequate data analysis. Before submitting variables to the model, one may need to transform a variable to increase the information content - or make its distribution more symmetrical - or make the relationship more linear. Constructed variables (say X1/X2) may be more useful than original variables especially if there is major collinearity. ### Assumptions

1. Model is correctly specified
It is assumed that all important explanatory variables are included in the model in the correct form.

2. Linearity
It is assumed that each explanatory variable is related linearly to the response variable through its regression coefficient.

3. Normality and Equal Variance
Multiple regression assumes that the residuals are normally distributed and have equal variance across the range of values of the explanatory variables. These assumptions are best evaluated with graphical methods and related statistics to assess the residuals as detailed in the core text If the assumptions do not hold, the response variable can (sometimes) be transformed or robust regression methods can be used instead 4. Multicollinearity

This term describes the situation when two variables are so highly correlated that it is not possible to reliably estimate their individual regression coefficients. The term was originally used for when there was an exact linear relationship between two variables (for example the same variable put in the model but with different units of measurement). It is now commonly used for where there is simply a high degree of correlation between two explanatory variables. It is a particular problem with environmental variables such as temperature and relative humidity because there is a clear inverse relationship between the two.

Note however that multicollinearity does not affect the ability of the equation to predict response. So if multiple regression is being used purely for predictive purposes, then it may not be a problem. But it does make it very difficult to estimate the contribution of individual variables. For example if collinear variables are dropped or added, coefficients may change suddenly and unpredictably. Also the precision of estimates will be reduced because standard errors will be too large and important variables may be omitted. Multicollinearity can even yield a significant overall regression model fit in which none of the individual regression coefficients are significant

The degree of collinearity of variables in a model can be assessed by estimating tolerance of a variable as [1-r2] (where r2 is the coefficient of determination) when regressed on another. Tolerance is more commonly expressed as the variance inflation factor (VIF) (1/tolerance) so the higher the VIF, the greater the degree of collinearity. Unfortunately there is no theoretical basis for saying what value of the VIF indicates unacceptable collinearity. Some authorities suggest that any value of VIF > 10 is unacceptable; others put the limit at VIF > 5. Nor is there any theoretical basis for saying what action one should take. The options are to remove the variable (which runs the risk of mis-specification bias if it is significant), to use another variable that captures the information in correlated variables (for example with principal components regression or with a simple constructed derived variable), or (if it is a curvilinear relationship) to use centering. This transforms the variable by subtracting the mean from all of the observations.

5. Outliers & influential data points
Unusual observations may heavily influence the magnitude of the estimate of one or more of the regression coefficients. These points are called outliers if the value of the response variable is unusually high or lows. They are called high leverage points if value of an explanatory variable is unusually high or low. Scatterplots of data help to identify such points along with regression diagnostics, such as Cook's distance, the diagonal values of the hat matrix, and Studentized residuals. Except where otherwise specified, all text and images on this page are copyright InfluentialPoints, all rights reserved. Images not copyright InfluentialPoints credit their source on web-pages attached via hypertext links from those images.