Our worked example for simple linear regression uses a very short data set from Rivas et al (2006).
They were interested in the relationship between the somatic cell counts of cows and the phagocytic ability of macrophages and polymorphonuclear cells in the milk. The cells contained 2 major subpopulations, characterized by low phagocytic (LP) or high phagocytic (HP) ability. A measure of phagocytic ability was given by the ratio of HP to LP cells. The authors regressed somatic cell count on the natural log of the HP/LP ratio and obtained a significant relationship (P
= 0.036) with the ratio explaining 63% of the somatic cell count variability.
The first point to note is that the X-variable is likely to be measured with substantial error which will result in attenuation of the slope. However, the regression is purely descriptive with no interest in the values of model parameters, so measurement error is not a problem. A more serious potential problem is whether or not the X,Y variables are independent. If the HP/LP ratio were derived from dividing the total somatic cell count into two categories, we would be regressing A + B on A/B - which would clearly have the potential for spurious correlations. However, the HP/LP ratio was determined separately by light fluorescence - so we can probably regard it as independent.
An initial assessment of linearity with and without a log transformation of the ratio is shown below:
The relationship is more or less linear with the X-variable transformed. A log transformation of both axes (not shown here) did not improve linearity, and the very limited number of data points (6) precludes further investigation. We note that the distribution of a log ratio is unlikely to be normal - but we will await further examination of this until we get to the diagnostics stage.
The slope (b) of the line is estimated thus:
||1990.836 − 483.7704
|| = −629.3
|2.5906 − 0.1958
The intercept (a) is given by:
a = 446.3333 − 629.3 × -0.1806 = 332.68
We test the significance of the regression using analysis of variance:
|Source of variation|
The coefficient of determination is given by :
|| = 0.705
and the adjusted coefficient of determination by :
||1 − (1 − 0.705)
|| = 0.631
We conclude that there is a significant regression relationship between somatic cell count and the chosen measure of phagocytic ability, although the significance level is rather marginal (P = 0.0365), and the regression only explains 63.1% of the variability in somatic cell count.
We first do a plot of somatic cell count against log phagocytic ability and insert the fitted line. The fit is rather poor, so it is not easy to distinguish outliers - however, the point marked in black has the largest residual (around 430).
The second plot is of residuals versus the explanatory variable (log ratio). Although there is considerable scatter around the line, there is no obvious trend of the degree of scatter being related to the X variable.
Next we examine plots of residuals against fitted Y-values - this is essential if one has several X-variables, although in this example it adds little to what we already know. In the first plot we have the residuals with their sign; in the second absolute residuals are given. The latter can make it easier to see if there is a clear increase in residual in relation to fitted y-value.
Again the scatter of the residuals in the vertical direction is more or less symmetrical around zero, with no evidence of a trend in relation to the predicted Y.
Next we examine leverages, remembering that outliers with respect to the X-variable will have a high leverage. We have inserted a line to indicate the value of 2p/n (see above
None of the leverages exceeds our (rather arbitrary) line, although the lowest X-value (marked in red) does have a higher leverage than other points. Not surprisingly, when we combine the information on residuals and leverages, we find that the point marked in red has the highest Cook's distance, and is hence the most influential point in the regression. It is not unusual to find the most extreme reading on the X-axis is highly influential, and may reflect problems with the model - in other words the relationship is only linear over the central part of the range.
Lastly we assess the distribution of the residuals. One can sometimes use a histogram for this, but histograms are insensitive to departures from normality and potentially biased - as well as being useless for small samples! The first figure below shows a quantile-quantile plot of the (raw) residuals; the line joins the first and third quartiles and can be used as a guide to linearity.
The distribution obviously deviates markedly from normality. The second figure shows a quantile-quantile plot of the (externally) studentized residuals which should lie along the 45 degree line. The deviation from normality is less marked for the studentized residuals, although it would still be hard to describe the residuals as approximating to normality. This reflects the problems of using a ratio, since ratios are very rarely normally distributed even after a transformation.