 
Multiple linear regression
Worked example 1
Our first worked example uses data from Climent on a (possibly hypothetical) study on factors affecting number of man hours worked at a hospital.
Factors affecting no. person hours worked at hospitals

 Response variable  Explanatory variables

Hosp  Monthly mean personhours  Mean daily patient load  Monthly Xrays  Monthly beddays  Eligible Population  Mean length of stay

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
 566.5 696.8 1033.2 1603.6 1611.4 1613.3 1854.2 2160.6 2305.6 3503.9 3571.9 3741.4 4026.5 10343.8 11732.2 15414.9 18854.4
 15.570 44.020 20.420 18.740 49.200 44.920 55.480 59.280 94.390 128.020 96.000 131.420 127.210 252.900 409.200 463.700 510.220
 2463 2048 3940 6505 5723 11520 5779 5969 8461 20106 13313 10771 15543 36194 34703 39204 86533
 472.9 1339.8 620.3 568.3 1497.6 1365.8 1687.0 1639.9 2872.3 3655.1 2912.0 3921.0 2865.7 7684.1 12446.3 14098.4 15524.0
 18.000 9.500 12.800 36.700 35.700 24.000 43.300 46.700 78.700 180.500 60.900 103.700 126.800 157.700 169.400 331.400 371.600
 4.4500 6.9200 4.2800 3.9000 5.5000 4.6000 5.6200 5.1500 6.1800 6.1500 5.8800 4.8800 5.5000 7.0000 10.7800 7.0500 6.3500

Consider nature of study
The response variable is the monthly mean personhours. We are unsure of the purpose of the study  whether to obtain a predictive equation (so as to predict manpower need) or as an explanatory study to try to understand what factors affect manpower needs. Since most of the variables (apart from population) can only be determined post hoc, we will assume it is an explanatory study. This also seems likely because if one is trying to predict weekly manpower need one would certainly need to take into account climatic factors such as average temperature (hospitals tend to be much busier in winter than summer). This means that we will not just try to get the model that has the highest r^{2}, but will instead focus more on the minimum adequate model approach (whilst not being over ready to abandon variables in the interests of parsimony).
Hospitals have been ordered by monthly mean person hours. Even a cursory examination of the data reveals that the first four of the explanatory variables are fairly clearly related to the response variable and to each other. This would certainly be expected for patient load, number of xrays and number of beddays  and we might immediately question whether all these variables are required. However, we will start by getting a general idea of the distributions of the variables.
Draw boxplots
Plot out data to get a visual assessment of the distribution of the response and explanatory variables.
The response variable and the first four of the explanatory variables all have very similar right skewed distributions whilst mean stay is more symmetrically distributed  apart from a single unusually high value (hospital #15). If only the distribution of the response variable was skewed, one would certainly consider a transformation  but under the circumstances we will proceed without any transformations (at least for now).
Examine relationships between variables
Plot out data to get a visual assessment of the relationships between variables.
If you look along the top row you can see plots of the response variable (manh) against each of the explanatory variables. When taken on their own, each appears to be (positively) related to manh. However the relationships are so similar (especially with load, xray, beds and popn), that we are likely to have problems with multicollinearity. This is most marked (not surprisingly!) with patient load and number of beddays which show an almost perfect correlation.
We can also examine this using a cross correlation matrix.
Again very high correlations of the response variable with all of the first four explanatory variables  and similarly high correlations between those variables. We could try some constructed variables  or (more easily) throw out some of the variables which carry the same information. We will do this on the basis of each variable's variance inflation factor with (or without) simultaneous consideration of the Pvalue for each variable.
Examine 'full' model with variance inflation factors
Only one of these variables (xray) is coming out significant at present, although stay is close to significance. We will first select factors entirely on the basis of having the highest variance inflation factor (VIF) if it is greater than 10, following the procedure used by Climent.
Removal of collinear variables
Below we have simply removed variables on the basis of the highest VIF, recalculating the new VIF values each time. The variable 'load' is the first to go, followed by 'beds' and 'popn', so we end up with a model containing only two explanatory variables 'xray' and 'stay'.
Since both variables are significant, it seems unlikely that removal of either of them would improve the model (you can check with using update followed by an ANOVA), so we will proceed with checking diagnostics.

The first plot (top left) shows the residuals plotted against the fitted values. It shows little evidence of variance increasing for larger values of the response variable, although we appear to have an outlier in observation #16. The next plot (top right) the square root of the standardized residuals plotted against the fitted values. This again suggests that observation #16 is not behaving well, and might make us concerned that variance is indeed increasing for larger values of the response variable. Both the normal QQ plot and the Cooks distance plot reinforce our concerns about observation #16.
One option is to remove observation #16  but remember that the full procedure must be run again. Below we carry out the same analysis but without observation #16.
We clearly have a better fitting model with no clear outliers (but then this is fairly inevitable after rejecting an outlier!). Still some concerns about a variance mean relationship  but cannot readily be solved with a transformation of the response variable (either log or square root).
Now a slightly different procedure by which to eliminate collinear variables  will select on the basis of both highest variance inflation factor (VIF) and the P value and will use border line VIF of 10 rather than 5.
This produces a more satisfying (comprehensible) model taking account of eligible population along with number of x rays and period of stay.


