Principles
The approach we took to simple linear regression generalizes directly to multiple explanatory variables. Hence for two explanatory variables we can write:
Y = β_{0} + β_{1}X _{1} + β_{2}X_{2} + ε

where
 Y is the value of the response which is predicted to lie on the bestfit regression plane'
 β0, is the intercept (the value of Y when both X_{1} and X_{2} are 0),
 β_{1} is the partial regression coefficient which quantifies the sensitivity of Y to change in X_{1} , adjusting for the effect of X_{2} on Y.
 β_{2} is the partial regression coefficient which quantifies the sensitivity of Y to change in X_{2}, adjusting for the effect of X_{1} on Y,
 ε is a random error term.

The model can be further generalized to any number of explanatory variables. Note that the slope parameters are termed partial regression coefficients because they measure the change in Y per unit change in the respective X whilst holding all other X variables constant. If the X variables are measured in different units, it may be preferable to use standardized coefficients that are independent of the units the variables are measured in.
It is important to note that just considering one explanatory variable if several are acting on a response variable can be very misleading. Associations can be masked if important explanatory variables are left out of the model resulting in biased coefficients. In the epidemiological field this has been termed misspecification bias. The same may apply if some of the relationships are not linear.
Estimation of model parameters
Model parameters in a multiple regression model are usually estimated using ordinary least squares minimizing the sum of squared deviations between each observed value and predicted values. It involves solving a set of simultaneous normal equations, one for each parameter in the model. A maximum likelihood estimation of parameters will give the same result if errors are normal.
We do not give the full computational details for multiple regression as it is unlikely to be instructive, and in practice multiple regression is always done by matrix methods using a computer programme. However, details of the manual calculation of multiple regression coefficients can be found in Sokal & Rohlf (1995). A summary of the matrix method is given in Quinn & Keough (2002).
Significance testing in multiple regression
Overall significance
The overall significance of a multiple regression can be assessed using analysis of variance in the same way as with a simple linear regression.
Source of variation Regression Error Total
 df 1 n − 2 n − 1
 SS SS_{Reg} SS_{Error} SS_{Total
}  MS MS_{Reg} (s^{2}_{}) MS_{Error} (s^{2}_{Y.X})
 Fratio MS_{Reg} / MS_{Error}
 P value

where:
 SS_{Total} is the total sums of squares, or Σ(Y − )^{2},
 SS_{Regression} is the sums of squares explained by the regression, or Σ( − )^{2},
 SS_{Error} is the unexplained error, or Σ(Y− )^{2}
 is the expected (predicted) value of Y for each value of X,
The
Ftest in the ANOVA of a multiple regression tests the null hypothesis that all the partial regression slopes equal zero (β
_{1} = ...= β
_{k} = 0).
Significance of partial regression coefficients
The null hypothesis that any particular partial regression coefficient (β_{i}) equals zero can be tested with a ttest or with a partial Ftest.
ttest of coefficient
t
 =
 b_{i}


s_{bi}

where
 t is the estimated tstatistic; under the null hypothesis it is a random quantile of the tdistribution with (n − p) degrees of freedom. Note we define p as the number of predictors including the intercept, in other words k + 1 where k is the number of predictor variables.
 b_{i} is the estimated regression coefficient,
 s_{bi} is the standard error of b_{i}.

Partial Ftest of coefficient
An equivalent test can be carried out by simply dropping the predictor variable of interest from the full model and recalculating the analysis of variance table for the reduced model. A test of the predictor variable can be obtained by calculating the Fratio of the difference in the residual sums of squares from the full to the reduced model divided by the MS_{residual} of the full model:
F
 =
 MS_{diff
} 

SS_{resid}

where
 F is the estimated Fstatistic; under the null hypothesis it is a random quantile of the Fdistribution with 1 and (n − p) degrees of freedom; p is the number of predictors including the intercept, in other words k + 1 where k is the number of predictor variables.
 MS_{diff} is the difference between SS_{resid} for the full and reduced models
 SS_{resid} is the residual sums of squares for the full model.

Nominal Explanatory Variables
Nominal variables (such as gender) can be included by using dummy variables (also known as indicator variables) that take set values or codes to represent the different levels of the nominal variable. For two levels just one dummy variable is required; for multiple (n) levels n1 dummy variables are required.
The simplest coding system for a binary variable is known as reference coding where the reference group (say females) is coded with 0 and the other group (males) as 1.
With this coding the coefficient b for the dummy variable will indicate the average difference between males and females after adjusting for all other effects.
The alternative is to use effects coding where females are coded with 1 and males with −1 . With this coding the coefficient b for the dummy variable will indicate the deviation of the response for males and females from the average of both males and females.
Interactions
If interactions are not included in the model, it is assumed that effects of the explanatory variables are all additive. In other words if you have two explanatory variables X1 and X2 the change in the response variable for a unit change in X1 remains the same whatever the level of X2. This is often not the case  in fact much more common that effects are not additive. In ANOVA we removed the interaction terms if they were not significant (but remembering that power to detect them was very limited) But if you have multiple explanatory variables the potential number of interactions you can add are enormous! Hence the usual practice is to stick to first order interactions (between two variables) .
The model for two factors then becomes
Y = b_{0} + b_{1}X _{1} + b_{2}X_{2}+ b_{3}X _{1}X_{2}

This can be rewritten as:
Y = b_{0} + b_{1}X _{1} + (b_{2}+ b_{3}X_{1})X_{2}

or
Y = b_{0} + b_{2}X _{2} + (b_{1}+ b_{3}X_{1})X_{2}

This makes it clear that the effect of X_{1} on Y also depends on the level of X2; similarly the effect of X2 on Y also depends on the level of X_{1} (each appear in the equation twice);
If one of the explanatory variables is nominal variable with two levels (a binary variable) then interactions have a special interpretation. Let X_{1} be a continuous predictor variable and X_{2} a binary variable with values of 0 = male and 1 female. Then if we specify there is no interaction, we end up with two simple linear regressions:
For males X_{2}=0
For females X
_{2}=1
Y = (b_{0} + b_{2}) + b_{1}X_{1}

The two regression lines will be parallel with the difference between males and females (the vertical distance between the lines) being b_{2}.
If on the other hand an interaction term is included as in (1) above, the lines can have different slopes and the models for males and females reduce to:
Y = (b_{0} + b_{2}) + (b_{1} + b_{3}X_{1}

The difference between slopes is b_{3}  if this coefficient is statistically significant we can conclude that the slopes for males and females are different. This will give same result as a ttest of slopes but is a neater way to do it!
Assessing importance of explanatory variables
Relative importance attempts to quantify an individual explanatory variable's contribution to a multiple regression model. If all explanatory variables are uncorrelated this is straightforward. Each variable's contribution could be assessed as the R^{2} from the univariate regression, which sum to the full multiple regression R^{2}. However, this only works if explanatory variables are uncorrelated and in observational studies there is nearly always a degree of correlation. In this situation it has been argued that relative importance should consider both its direct effect (its correlation with the response variable) and its effect when combined with the other variables in the regression equation (see Johnson & Lebreton (2004) for more on this).
Simple relative importance measures
 The first simple measure only considers the variable's direct effect. We compare what each regressor alone is able to explain using the univariate r^{2} values. This method is not very useful because it gives no indication of the variable's effect when combined with the other variables, and moreover individual r^{2}values will (almost certainly) not sum to the overall model R^{2}. It is option first in the R package relaimpo.
An alternative approach is to use the change in R^{2} if the variable is dropped from model. This has been termed the usefulness of the variable. This gives the same order of importance as the significance test of individual partial coefficients either with the ttest or the Ftest. This measure of importance only considers its effect when combined with other variables. This may of course be precisely what you want, but be aware that if you have two explanatory variables both of which are highly correlated with the response variable (and hence with each other), only one of these will appear to be important using this index. It is option last in the R package relaimpo.
Another common practice is to calculate for each variable its standardized regression coefficient. This is calculated by multiplying the coefficient by the standard deviation of the explanatory variable and then dividing it divided by the standard deviation of the response variable. A standardized coefficient is interpreted as the standard deviation change in the dependent variable when the explanatory variable is changed by one standard deviation, holding all other variables constant.
b'_{i}
 =
 b_{i}
 ×
 s_{Y
} 

s_{Xi
} 
where
 b'_{i} is the standardized partial regression coefficient of the ith explanatory variable,
 b_{i} is the partial regression coefficient of the ith explanatory variable,
 s_{Y} is the standard deviation of the response variable Y,
 s_{Xi} is the standard deviation of the ith explanatory variable.

These standardized coefficients have the advantage of being comparable from one independent variable to the other because the unit of measurement has been eliminated. The (squared) standardized coefficient is option beta in the R package relaimpo.
Standardized coefficients have the disadvantage that they are no longer readily interpretable in terms of the original data. Moreover Bring et al. (1994) have suggested that conventional methods to calculate standardized coefficients are incorrect. They argue that since the coefficient is calculated on the basis that all other variables are held constant, the standard deviation of the explanatory variable (s_{Xi}) should be calculated on the same basis. Hence we should use a partial standard deviation (holding all other X_{i} constant) instead of the ordinary standard deviation. If this is done the estimates of relative importance are then equivalent to those indicated by significance tests of the coefficients and change in R^{2} when variables are dropped from the model.
Computer intensive relative importance metrics
There are more sophisticated (and some would argue much better) indices of relative importance, but until recently their use has been constrained by the fact that they are computer intensive. They have the advantage that they decompose the total r^{2} into nonnegative contributions that sum to the total r^{2}.
The approach is based on doing an ANOVA using sequential sums of squares. Division of the sequential sums of squares for each explanatory variable by the total sums of squares gives sequential r^{2} contributions which (not surprisingly) sum to the total. The problem is that each ordering of variables will produce different values. This dependence on ordering can be dealt with by taking simple unweighted averages over the different orderings (Lindeman et al. (1980)). The approach was later developed further using a variety of measures of fit under the name hierarchical partitioning (Chevan & Sutherland (1991)). More recently a method using weighted averages has been proposed by Feldman (2005). Note however that there is no unique unchallenged metric for relative importance when there are correlated regressors  important thing is to decide on which to use at the start so as not to bias results by choosing which metric gives the desired answer!
Model selection criteria
The most commonly used criterion used to select a variable is the
Pvalue derived from a partial
F test of the
coefficient. However, this has all sorts of problems, not the least of which is overfitting.
Hence increasingly other criteria are being used to select the variables to enter a model:
 Adjusted r^{2}
r^{2} cannot itself be used as the criterion by which to select a model because it always increases when you add an extra variable. Hence, use of r^{2} (and in fact of several other criteria) would inevitable lead to overfitting. Instead one uses the adjusted r^{2} which is calculated using mean squares rather than sums of squares. Note adjusted r^{2} can be negative if you have an especially poorly fitting regression  the apparent contradiction of a having a negative squared value is explained by the fact that it is a n adjusted squared value  not a squared adjusted one!
Mallow's Cp
Mallow's Cp is related to the adjusted r^{2} but has a heavier penalty for increasing the number of explanatory variables. As such it more robust to overfitting. Mallow' C_{p} is also closely related to AIC (see below). It is calculated as follows:
C_{p} = SSR_{p}/MSR_{M}  n + 2p
where SSR_{p} is the summed squared residual error in Y from a model with p coefficients, and MSR_{M} is the mean squared residual error in Y from the maximal model.
Unlike the adjusted r^{2} it is not measured on a 0to1 scale. Its values are typically positive and greater than 1. The 'best' models will be the ones with the lowest values of Cp. Since E(SSR_{p}) = (np)MSR_{M} C_{p}
 For a full model E(C_{p}) = p and good fits are expected to approximately equal p.
 For a bad fit model C_{p} is expected to be much greater than p.
 So optimal models have C_{p} ≤ p and a low p.
Akaike's Information criterion (AIC)
Assuming errors in Y are normal,
AIC_{N}=n{log(RSS/n)} + 2p
where n{log(RSS/n)} is the model's deviance, RSS is the residual sum of squares, n is the number of observations and p is the number of coefficients.
Or, using the adjusted R^{2},
AIC_{R}2 = n{log([1R^{2}]/n)} + 2p
Assuming the MLE is normal, where χ^{2} is used to test the overall model,
AIC_{χ}2 = χ^{2} + 2p
The quasi information criterion is used where c is the inflation factor,
QAIC = AIC/c
To reduce overfitting in small samples there is a 'second order' correction, where n is large it approaches zero,
AIC_{c} = AIC + 2p(p+1)/(np1)
Assuming errors in Y are normal, McQuarrie & Tsai (1998) strongly recommend their corrected measure,
AIC_{u} = log(RSS/[np]) + (n+p)/(np2)
Especially where n is small and p large, bias correction via bootstrapping may largely eliminate overfitting.
(see Ishiguro et al. (1997))
Bayesian or Schwarz information criterion (BIC)
This is a Bayesian equivalent of AIC, where L_{p} is the maximized log likelihood given p regression coefficients,
2L_{p} +p.ln(n)
It has a heavier penalty for increasing the number of explanatory variables than AIC, and as such is more robust to overfitting
Model selection methods
Model/variable selection methods are summarized below:
 Forward selection
This method starts with the null model. For each variable, the teststatistic (commonly the Fratio) is calculated as a measure of the variable's contribution to the model. The variable with the largest value that is greater than a preset value is added to the model. Then the teststatistic is calculated again for each of the variables still remaining, and the evaluation process is repeated. Thus, variables are added sequentially the model one by one until no remaining variable produces a test statistic value that is greater than the preset value. Variables that have made it into the model are not tested again and cannot therefore be removed.
Backward elimination
This method starts with the maximal model and deletes variables one by one from the model until all remaining variables contribute something significant to the dependent variable. At each step, the variable showing the smallest contribution to the model is deleted. One common practice to identify confounders is to check what effect removal of a variable has on the other coefficients when it is dropped. If removal of a variable affects them by more than some predetermined amount (say 25%) the variable is forced into the model even if its effect is not significant.
Stepwise
This method is a modification of the forward selection approach. Variables are added to the model one at a time but each time a variable is added a backward elimination step occurs to test whether any
variables entered previously can be removed.
All possible subsets
The least biased strategy is to compare all possible subsets. However, it does require considerable computing power which was often not available until recently and it is prone to overfitting. Nevertheless it may be a viable strategy if there are relatively few variables.
Cross validation
This may well be the optimal method for model selection although it still seems to be little used in biological applications. The commonest form is kfold cross validation. The data are first split into k exclusive chunks of data. The model is built with k1 chunks and then tested with the rest of the data. This process is repeated until each chunk has been used for testing. The model which has the smallest errors is taken as the best model (see .
Problems of selection methods
We have discussed this extensively in the core text and only summarize the main points here. One first needs to consider what the modelling is designed to achieve: is it to obtain the best single 'predictive' model or is it an explanatory approach for developing new insights and investigating causal relationships between variables. If the aim is only to get the best predictive model, then one may not bother too much about how many variables are included or how much sense it makes. If the aim is to investigate causal relationships, then the approach is rather different (for example the multicollinearity issue must be tackled). It should also be remembered that there will usually be several best models  traditionally researchers have narrowed these down to one best model, but there is increasingly a tendency to give the group of 'best' models rather than just one 'winning' model. This makes much more sense!
All the above methods of model selection are potentially open to abuse, but some more so than others. Stepwise methods in particular yield confidence intervals for both effects and predicted values that are falsely narrow. The regression coefficients are biased, and Pvalues do not have the proper meaning. These problems are even worse if important variables are missing or misspecified or collinear with other variables. Apart from crossvalidation all the above methods will tend to overfit, and hence will perform poorly if applied to a different data set than that used to develop them.
A more general problem arises if regression modelling is applied blindly relying on the automated (or semiautomated) procedure of the software. As Henderson &Velleman (1981) point out the data analyst knows more than the computer  failure to use that knowledge produces inadequate data analysis. Before submitting variables to the model, one may need to transform a variable to increase the information content  or make its distribution more symmetrical  or make the relationship more linear. Constructed variables (say X1/X2) may be more useful than original variables especially if there is major collinearity.
Assumptions
Model is correctly specified
It is assumed that all important explanatory variables are included in the model in the correct form.
Linearity
It is assumed that each explanatory variable is related linearly to the response variable through its regression coefficient.
Normality and Equal Variance
Multiple regression assumes that the residuals are normally distributed and have equal variance across the range of values of the explanatory variables. These assumptions are best evaluated with graphical methods and related statistics to assess the residuals as detailed in the core text If the assumptions do not hold, the response
variable can (sometimes) be transformed or robust regression methods can be used instead
Multicollinearity
This term describes the situation when two variables are so highly correlated that it is not possible to reliably estimate their individual regression coefficients. The term was originally used for when there was an exact linear relationship between two variables (for example the same variable put in the model but with different units of measurement). It is now commonly used for where there is simply a high degree of correlation between two explanatory variables. It is a particular problem with environmental variables such as temperature and relative humidity because there is a clear inverse relationship between the two.
Note however that multicollinearity does not affect the ability of the equation to predict response. So if multiple regression is being used purely for predictive purposes, then it may not be a problem. But it does make it very difficult to estimate the contribution of individual variables. For example if collinear variables are dropped or added, coefficients may change suddenly and unpredictably. Also the precision of estimates will be reduced because standard errors will be too large and important variables may be omitted. Multicollinearity can even yield a significant overall regression model fit in which none of the individual regression coefficients are significant
The degree of collinearity of variables in a model can be assessed by estimating tolerance of a variable as [1r^{2}] (where r^{2} is the coefficient of determination) when regressed on another. Tolerance is more commonly expressed as the variance inflation factor (VIF) (1/tolerance) so the higher the VIF, the greater the degree of collinearity. Unfortunately there is no theoretical basis for saying what value of the VIF indicates unacceptable collinearity. Some authorities suggest that any value of VIF > 10 is unacceptable; others put the limit at VIF > 5. Nor is there any theoretical basis for saying what action one should take. The options are to remove the variable (which runs the risk of misspecification bias if it is significant), to use another variable that captures the information in correlated variables (for example with principal components regression or with a simple constructed derived variable), or (if it is a curvilinear relationship) to use centering. This transforms the variable by subtracting the mean from all of the observations.
Outliers & influential data points
Unusual observations may heavily influence the magnitude of the estimate
of one or more of the regression coefficients. These points
are called outliers if the value of the response variable is unusually high or lows. They are called high leverage points if value of an explanatory variable is unusually high or low. Scatterplots of data help to identify such points along with regression diagnostics, such as Cook's distance, the diagonal values of the hat
matrix, and Studentized residuals.