InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Multiple linear regression

#### Worked example 1

Our first worked example uses data from Climent on a (possibly hypothetical) study on factors affecting number of man hours worked at a hospital.

 Factors affecting no. person hours worked at hospitals Response variable Explanatory variables Hosp Monthly meanperson-hours Mean dailypatient load Monthly X-rays Monthly beddays EligiblePopulation Mean length of stay 1234567891011121314151617 566.5696.81033.21603.61611.41613.31854.22160.62305.63503.93571.93741.44026.510343.811732.215414.918854.4 15.57044.02020.42018.74049.20044.92055.48059.28094.390128.02096.000131.420127.210252.900409.200463.700510.220 24632048394065055723115205779596984612010613313107711554336194347033920486533 472.91339.8620.3568.31497.61365.81687.01639.92872.33655.12912.03921.02865.77684.112446.314098.415524.0 18.0009.50012.80036.70035.70024.00043.30046.70078.700180.50060.900103.700126.800157.700169.400331.400371.600 4.45006.92004.28003.90005.50004.60005.62005.15006.18006.15005.88004.88005.50007.000010.78007.05006.3500

1. #### Consider nature of study

The response variable is the monthly mean person-hours. We are unsure of the purpose of the study - whether to obtain a predictive equation (so as to predict manpower need) or as an explanatory study to try to understand what factors affect manpower needs. Since most of the variables (apart from population) can only be determined post hoc, we will assume it is an explanatory study. This also seems likely because if one is trying to predict weekly manpower need one would certainly need to take into account climatic factors such as average temperature (hospitals tend to be much busier in winter than summer). This means that we will not just try to get the model that has the highest r2, but will instead focus more on the minimum adequate model approach (whilst not being over ready to abandon variables in the interests of parsimony).

Hospitals have been ordered by monthly mean person hours. Even a cursory examination of the data reveals that the first four of the explanatory variables are fairly clearly related to the response variable and to each other. This would certainly be expected for patient load, number of x-rays and number of bed-days - and we might immediately question whether all these variables are required. However, we will start by getting a general idea of the distributions of the variables.

2. #### Draw boxplots

Plot out data to get a visual assessment of the distribution of the response and explanatory variables.

The response variable and the first four of the explanatory variables all have very similar right skewed distributions whilst mean stay is more symmetrically distributed - apart from a single unusually high value (hospital #15). If only the distribution of the response variable was skewed, one would certainly consider a transformation - but under the circumstances we will proceed without any transformations (at least for now).

3. #### Examine relationships between variables

Plot out data to get a visual assessment of the relationships between variables.

If you look along the top row you can see plots of the response variable (manh) against each of the explanatory variables. When taken on their own, each appears to be (positively) related to manh. However the relationships are so similar (especially with load, xray, beds and popn), that we are likely to have problems with multicollinearity. This is most marked (not surprisingly!) with patient load and number of bed-days which show an almost perfect correlation.

We can also examine this using a cross correlation matrix.

 Using R manh load xray beds popn stay manh 1.000 0.986 0.945 0.985 0.940 0.579 load 0.986 1.000 0.907 0.999 0.936 0.671 xray 0.945 0.907 1.000 0.906 0.910 0.447 beds 0.985 0.999 0.906 1.000 0.928 0.672 popn 0.940 0.936 0.910 0.928 1.000 0.463 stay 0.579 0.671 0.447 0.672 0.463 1.000

Again very high correlations of the response variable with all of the first four explanatory variables - and similarly high correlations between those variables. We could try some constructed variables - or (more easily) throw out some of the variables which carry the same information. We will do this on the basis of each variable's variance inflation factor with (or without) simultaneous consideration of the P-value for each variable.

4. #### Examine 'full' model with variance inflation factors

 Using RResiduals: Min 1Q Median 3Q Max -706.122 -370.105 -1.370 223.879 1576.586 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2113.97007 1016.45482 2.080 0.0617 . load 26.35950 24.58374 1.072 0.3066 xray 0.05603 0.02145 2.613 0.0241 * beds 0.24161 0.73327 0.329 0.7480 popn -5.72753 6.12260 -0.935 0.3696 stay -427.77354 194.67169 -2.197 0.0503 . --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 646.6 on 11 degrees of freedom Multiple R-squared: 0.9907, Adjusted R-squared: 0.9865 F-statistic: 234.5 on 5 and 11 DF, p-value: 8.708e-11 >vif(lm(manh~load+xray+beds+popn+stay,data=hosp)) load xray beds popn stay 599.852743 7.971065 498.226668 16.720139 3.639526

Only one of these variables (xray) is coming out significant at present, although stay is close to significance. We will first select factors entirely on the basis of having the highest variance inflation factor (VIF) if it is greater than 10, following the procedure used by Climent.

5. #### Removal of collinear variables

Below we have simply removed variables on the basis of the highest VIF, recalculating the new VIF values each time. The variable 'load' is the first to go, followed by 'beds' and 'popn', so we end up with a model containing only two explanatory variables 'xray' and 'stay'.

 Using R> mod1=lm(manh~load+xray+beds+popn+stay,data=hosp) > # REMOVE LOAD AS HAS THE LARGEST VIF > 5 > mod2=update(mod1,~.-load) > library(car) > vif(mod2) xray beds popn stay 7.969596 21.351454 11.420026 3.186518 > # REMOVE BEDS AS HAS THE LARGEST VIF >5 > mod3=update(mod2,~.-beds) > vif(mod3) xray popn stay 5.874181 5.984406 1.278709 > # REMOVE POPN AS HAS THE LARGEST VIF >5 > mod4=update(mod3,~.-popn) > vif(mod4) xray stay 1.249213 1.249213 > #ALL VIF < 5 SO EXAMINE MODEL > summary(mod4) Call: lm(formula = manh ~ xray + stay, data = hosp) Residuals: Min 1Q Median 3Q Max -2086.0 -694.2 -160.2 425.9 4926.0 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.136e+03 1.619e+03 -1.937 0.0732 . xray 2.242e-01 2.153e-02 10.412 5.65e-08 *** stay 6.859e+02 2.892e+02 2.372 0.0326 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 1640 on 14 degrees of freedom Multiple R-squared: 0.9239, Adjusted R-squared: 0.913 F-statistic: 85 on 2 and 14 DF, p-value: 1.476e-08

Since both variables are significant, it seems unlikely that removal of either of them would improve the model (you can check with using update followed by an ANOVA), so we will proceed with checking diagnostics.

The first plot (top left) shows the residuals plotted against the fitted values. It shows little evidence of variance increasing for larger values of the response variable, although we appear to have an outlier in observation #16. The next plot (top right) the square root of the standardized residuals plotted against the fitted values. This again suggests that observation #16 is not behaving well, and might make us concerned that variance is indeed increasing for larger values of the response variable. Both the normal QQ plot and the Cooks distance plot reinforce our concerns about observation #16.

One option is to remove observation #16 - but remember that the full procedure must be run again. Below we carry out the same analysis but without observation #16.

We clearly have a better fitting model with no clear outliers (but then this is fairly inevitable after rejecting an outlier!). Still some concerns about a variance mean relationship - but cannot readily be solved with a transformation of the response variable (either log or square root).

Now a slightly different procedure by which to eliminate collinear variables - will select on the basis of both highest variance inflation factor (VIF) and the P- value and will use border line VIF of 10 rather than 5.

This produces a more satisfying (comprehensible) model taking account of eligible population along with number of x rays and period of stay.