Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Simple linear regression
Our worked example for simple linear regression uses a very short data set from Rivas et al
The first point to note is that the X-variable is likely to be measured with substantial error which will result in attenuation of the slope. However, the regression is purely descriptive with no interest in the values of model parameters, so measurement error is not a problem. A more serious potential problem is whether or not the X,Y variables are independent. If the HP/LP ratio were derived from dividing the total somatic cell count into two categories, we would be regressing A + B on A/B - which would clearly have the potential for spurious correlations. However, the HP/LP ratio was determined separately by light fluorescence - so we can probably regard it as independent.
An initial assessment of linearity with and without a log transformation of the ratio is shown below:
The relationship is more or less linear with the X-variable transformed. A log transformation of both axes (not shown here) did not improve linearity, and the very limited number of data points (6) precludes further investigation. We note that the distribution of a log ratio is unlikely to be normal - but we will await further examination of this until we get to the diagnostics stage.
We test the significance of the regression using analysis of variance:
The coefficient of determination is given by :
and the adjusted coefficient of determination by :
We conclude that there is a significant regression relationship between somatic cell count and the chosen measure of phagocytic ability, although the significance level is rather marginal (P = 0.0365), and the regression only explains 63.1% of the variability in somatic cell count.
We first do a plot of somatic cell count against log phagocytic ability and insert the fitted line. The fit is rather poor, so it is not easy to distinguish outliers - however, the point marked in black has the largest residual (around 430).
The second plot is of residuals versus the explanatory variable (log ratio). Although there is considerable scatter around the line, there is no obvious trend of the degree of scatter being related to the X variable.
Next we examine plots of residuals against fitted Y-values - this is essential if one has several X-variables, although in this example it adds little to what we already know. In the first plot we have the residuals with their sign; in the second absolute residuals are given. The latter can make it easier to see if there is a clear increase in residual in relation to fitted y-value.
Again the scatter of the residuals in the vertical direction is more or less symmetrical around zero, with no evidence of a trend in relation to the predicted Y.
None of the leverages exceeds our (rather arbitrary) line, although the lowest X-value (marked in red) does have a higher leverage than other points. Not surprisingly, when we combine the information on residuals and leverages, we find that the point marked in red has the highest Cook's distance, and is hence the most influential point in the regression. It is not unusual to find the most extreme reading on the X-axis is highly influential, and may reflect problems with the model - in other words the relationship is only linear over the central part of the range.
Lastly we assess the distribution of the residuals. One can sometimes use a histogram for this, but histograms are insensitive to departures from normality and potentially biased - as well as being useless for small samples! The first figure below shows a quantile-quantile plot of the (raw) residuals; the line joins the first and third quartiles and can be used as a guide to linearity.
The distribution obviously deviates markedly from normality. The second figure shows a quantile-quantile plot of the (externally) studentized residuals which should lie along the 45 degree line. The deviation from normality is less marked for the studentized residuals, although it would still be hard to describe the residuals as approximating to normality. This reflects the problems of using a ratio, since ratios are very rarely normally distributed even after a transformation.