InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Covariance Analysis (ANCOVA)

#### Worked example 1

Our first worked example uses data from Larry Douglass (2004) (University of Maryland) on the effect of three different diets on cholesterol levels. Pre treatment levels were used as the covariate. Douglass does the analysis with SAS - we use R.

Comparison is instructive!

 Pre and post treatment cholesterol levels Treatment A Treatment B Treatment C Pre Post Pre Post Pre Post 175175235215195195 135145205175140190 205175230190155185 165195160155150170 210170235185180220 185170210210200240

1. #### Draw boxplots and assess normality

Plot out data to get a visual assessment of the treatment and block effects, and assess how appropriate (parametric) ANCOVA is for the set of data.

2. #### Plot relationship with covariate

Plot out data to get a visual assessment of the relationship between the response variable and the covariate, and assess how appropriate (parametric) ANCOVA is for the set of data.

Whilst we might get away with arguing that there is a linear relationship for treatments 1 and 3, there is no evidence of any relationship at all for treatment 2. We will continue with the analysis, but bear in mind that any 'adjustments' we make to means will be highly questionable (if not totally invalid!).

3. #### Get table of (unadjusted) means

 Using R a1 a2 a3 165.0000 165.8333 202.5000
4. #### Carry out analysis of covariance

Calculating sums of squares manually for an analysis of covariance can only be described as a pain in the backside - and one needs to be very organized! Values for the various sums and sums of squares can be obtained with R. We first calculate the sums of squares assuming a pooled regression of Y on X:

 SSTotal = 582800 - 32002/18 = 13911.11 SSTreatment = [9902 + 9952 + 12152]/6 − 32002/18 = 5502.8 SSPooledReg = [632800 - (3530 × 3200)/18]2 = 2856.8 701900 -(35302/18) SSError = 13911.11− 2856.8 − 5502.8 = 5552.31

 Sums of squares Table Σ sums Σ2 products ΣΣ2 totals x1y1x1y1Σx1Σy1x2y2x2y 2Σx2Σy2x3y3x3y 3Σx3Σy3X overallY overallXY overallΣXΣY 1190990  1140995  120012 15  35303200 238750167600199150 219900166275188900 243250248925244750 701900582800632800 1178100   1134300   1458000   11296000

Next calculate the regression statistics for each treatment separately:

 SSTot1 = 167600 - 9902/6 = 4250 SSXY1 = 199150 - (1190 × 990)/6 = 2800 SSX1 = 238750 -(11902/6) = 2733.33 SSReg1 = 28002 / 2733.33 = 2868.3 SSError1 = 4250− 2868.296 = 1381.7 b1 = 2800/2733.33 = 1.0244

 SSTotal2 = 166275 - 9952/6 = 1270.833 SSXY2 = 188900 - (1140 × 995)/6 = − 150 SSX2 = 219900 -(11402/6) = 3300 SSReg2 = − 1502 / 3300 = 6.818 SSError2 = 1270.833−6.8182 = 1264.015 b2 = − 150/3300 = − 0.04545

 SSTotal3 = 248925 - 12152/6 = 2887.5 SSXY3 = 244750 - (1200 × 1215)/6 = 1750 SSX3 = 243250 -(12002/6) = 3250 SSReg3 = 17502 / 3250 = 942.308 SSError3 = 2887.5−942.308 = 19451.2 b3 = 1750/3250 = 0.5385

Summed regression statistics:

 SSTotal = 4250 + 1270.833 + 2887.5 = 8408.33 SSXY = 2800 - 150 + 1750 = 4400 SSX = 2733.33 + 3300 + 3250 = 9283.3 SSreg = 2868.296 + 6.818 + 942.308 = 3817.42 SSError = 1381.704 + 1264.015 + 1945.192 = 4590.91 bcommon = 4400/9283.3 = 0.474

SSReg (common slope) = 44002/9283.3 = 2085.465

SSHeterogeneity of slopes = 3817.422 - 2085.465 = 1731.957. I

#### Model with interaction (maximal model, different slopes)

 ANOVA table Source ofvariation df SS MS F- ratio P Diet 2 5502.8 Pre 1 2085.5 Heterogeneity of slopes 2 1732.0 866.0 2.2635 0.1465 Residual 12 4590.9 382.6 Total 17 13911.11

Interpretation here is somewhat problematical. At the P = 0.05 level the interaction is not significant - in other words we can accept the lines as parallel. However, this in direct contradiction to our visual assessment of the relationship between the response variable and the covariate, where we lack evidence of any regression relationship at all between 'post' and 'pre' for treatment B - let alone a similar slope to that found for treatments A and C. The apparent contradiction may result from the lack of power of tests for interaction. Or it may result from random variation in treatment B. Which is the case is a matter of judgement (or guesswork!), but for the sake of the example we will assume it results from random variation.

When we come to doing it in R, we run into the same 'challenge' that we did when analyzing unbalanced factorial designs. Because the levels of the covariate are different for each of the three treatments, we no longer have a balanced or orthogonal design. For a non-orthogonal design, it matters which variable is entered first. To be more precise, it affects sums of squares and significance levels of the main effects and all but the highest level interaction effect. It does not, however, affect the SS residual nor the parameter estimates and standard errors. If you want the results identical to the manual version, then you have to enter the (nominal) treatment factor first followed by the (continuous) covariate.

 Using R> #With 'diet' first, you get the same SS as when you work it out manually: > anova(model1) Analysis of Variance Table Response: post Df Sum Sq Mean Sq F value Pr(>F) diet 2 5502.8 2751.39 7.1917 0.008853 ** pre 1 2085.5 2085.46 5.4511 0.037743 * diet:pre 2 1732.0 865.98 2.2635 0.146524 Residuals 12 4590.9 382.58 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > #But with 'pre' first, the SS for diet and pre will not be the same > anova(model2) Analysis of Variance Table Response: post Df Sum Sq Mean Sq F value Pr(>F) pre 1 2856.8 2856.75 7.4672 0.01818 * diet 2 4731.5 2365.74 6.1837 0.01426 * pre:diet 2 1732.0 865.98 2.2635 0.14652 Residuals 12 4590.9 382.58

The alternative is to use Type II sums of squares which are calculated according to the principle of marginality. Adjustments to the sums of squares are made for the other main effects in the statistical model but not for higher-level interaction effects.

 Using R> Anova(model1,type=2) Anova Table (Type II tests) Response: post Sum Sq Df F value Pr(>F) diet 4731.5 2 6.1837 0.01426 * pre 2085.5 1 5.4511 0.03774 * diet:pre 1732.0 2 2.2635 0.14652 Residuals 4590.9 12 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > Anova(model2,type=2) Anova Table (Type II tests) Response: post Sum Sq Df F value Pr(>F) pre 2085.5 1 5.4511 0.03774 * diet 4731.5 2 6.1837 0.01426 * pre:diet 1732.0 2 2.2635 0.14652 Residuals 4590.9 12

Irrespective of these considerations, we still have to decide what to do about the interaction. Given that it is not significant at P=0.05, the conventional approach would be to drop the interaction and assume parallel lines. We do this below and estimate adjusted means.

#### Model without interaction (parallel lines)

 Using RAnalysis of Variance Table Response: post Df Sum Sq Mean Sq F value Pr(>F) diet 2 5502.8 2751.39 6.0921 0.01249 * pre 1 2085.5 2085.46 4.6176 0.04962 * Residuals 14 6322.9 451.63 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 > summary(model2) Call: lm(formula = post ~ diet + pre, data = ancdoug) Residuals: Min 1Q Median 3Q Max -24.792 -16.419 -3.594 12.702 36.276 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 70.9964 44.5979 1.592 0.13372 dieta2 4.7831 12.4066 0.386 0.70564 dieta3 36.7101 12.2752 2.991 0.00973 ** pre 0.4740 0.2206 2.149 0.04962 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 21.25 on 14 degrees of freedom Multiple R-squared: 0.5455, Adjusted R-squared: 0.4481 F-statistic: 5.601 on 3 and 14 DF, p-value: 0.009764

The diet factor remains significant (P = 0.0125) but 'pre' is getting very borderline with a P- value of 0.0496. If we do a type II ANOVA using the R car package (so order has no effect) we get exactly the same P value for 'pre' but a slightly higher value for the diet factor (P = 0.0200) (we leave you to check this).