Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Multiple linear regression
Worked example 1
Consider nature of study
The response variable is the monthly mean person-hours. We are unsure of the purpose of the study - whether to obtain a predictive equation (so as to predict manpower need) or as an explanatory study to try to understand what factors affect manpower needs. Since most of the variables (apart from population) can only be determined post hoc, we will assume it is an explanatory study. This also seems likely because if one is trying to predict weekly manpower need one would certainly need to take into account climatic factors such as average temperature (hospitals tend to be much busier in winter than summer). This means that we will not just try to get the model that has the highest r2, but will instead focus more on the minimum adequate model approach (whilst not being over ready to abandon variables in the interests of parsimony).
Hospitals have been ordered by monthly mean person hours. Even a cursory examination of the data reveals that the first four of the explanatory variables are fairly clearly related to the response variable and to each other. This would certainly be expected for patient load, number of x-rays and number of bed-days - and we might immediately question whether all these variables are required. However, we will start by getting a general idea of the distributions of the variables.
Plot out data to get a visual assessment of the distribution of the response and explanatory variables.
The response variable and the first four of the explanatory variables all have very similar right skewed distributions whilst mean stay is more symmetrically distributed - apart from a single unusually high value (hospital #15). If only the distribution of the response variable was skewed, one would certainly consider a transformation - but under the circumstances we will proceed without any transformations (at least for now).
Examine relationships between variables
Plot out data to get a visual assessment of the relationships between variables.
If you look along the top row you can see plots of the response variable (manh) against each of the explanatory variables. When taken on their own, each appears to be (positively) related to manh. However the relationships are so similar (especially with load, xray, beds and popn), that we are likely to have problems with multicollinearity. This is most marked (not surprisingly!) with patient load and number of bed-days which show an almost perfect correlation.
We can also examine this using a cross correlation matrix.
Again very high correlations of the response variable with all of the first four explanatory variables - and similarly high correlations between those variables. We could try some constructed variables - or (more easily) throw out some of the variables which carry the same information. We will do this on the basis of each variable's variance inflation factor with (or without) simultaneous consideration of the P-value for each variable.
Examine 'full' model with variance inflation factors
Only one of these variables (xray) is coming out significant at present, although stay is close to significance. We will first select factors entirely on the basis of having the highest variance inflation factor (VIF) if it is greater than 10, following the procedure used by Climent.
Removal of collinear variables
Below we have simply removed variables on the basis of the highest VIF, recalculating the new VIF values each time. The variable 'load' is the first to go, followed by 'beds' and 'popn', so we end up with a model containing only two explanatory variables 'xray' and 'stay'.
Since both variables are significant, it seems unlikely that removal of either of them would improve the model (you can check with using update followed by an ANOVA), so we will proceed with checking diagnostics.
The first plot (top left) shows the residuals plotted against the fitted values. It shows little evidence of variance increasing for larger values of the response variable, although we appear to have an outlier in observation #16. The next plot (top right) the square root of the standardized residuals plotted against the fitted values. This again suggests that observation #16 is not behaving well, and might make us concerned that variance is indeed increasing for larger values of the response variable. Both the normal QQ plot and the Cooks distance plot reinforce our concerns about observation #16.
One option is to remove observation #16 - but remember that the full procedure must be run again. Below we carry out the same analysis but without observation #16.
We clearly have a better fitting model with no clear outliers (but then this is fairly inevitable after rejecting an outlier!). Still some concerns about a variance mean relationship - but cannot readily be solved with a transformation of the response variable (either log or square root).
Now a slightly different procedure by which to eliminate collinear variables - will select on the basis of both highest variance inflation factor (VIF) and the P- value and will use border line VIF of 10 rather than 5.
This produces a more satisfying (comprehensible) model taking account of eligible population along with number of x rays and period of stay.