InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Covariance Analysis (ANCOVA)

Worked example 1

Our first worked example uses data from Larry Douglass (2004) (University of Maryland) on the effect of three different diets on cholesterol levels. Pre treatment levels were used as the covariate. Douglass does the analysis with SAS - we use R. Comparison is instructive!

Pre and post treatment cholesterol levels
Treatment ATreatment BTreatment C
PrePostPrePostPrePost
175
175
235
215
195
195
135
145
205
175
140
190
205
175
230
190
155
185
165
195
160
155
150
170
210
170
235
185
180
220
185
170
210
210
200
240

  1. Draw boxplots and assess normality

    Plot out data to get a visual assessment of the treatment and block effects, and assess how appropriate (parametric) ANCOVA is for the set of data.

    Figmb1a.gif

  2. Plot relationship with covariate

    Plot out data to get a visual assessment of the relationship between the response variable and the covariate, and assess how appropriate (parametric) ANCOVA is for the set of data.

    Figmb2a.gif

    Whilst we might get away with arguing that there is a linear relationship for treatments 1 and 3, there is no evidence of any relationship at all for treatment 2. We will continue with the analysis, but bear in mind that any 'adjustments' we make to means will be highly questionable (if not totally invalid!).

  3. Get table of (unadjusted) means

    Using R
          a1       a2       a3
    165.0000 165.8333 202.5000
     

  4. Carry out analysis of covariance

    Calculating sums of squares manually for an analysis of covariance can only be described as a pain in the backside - and one needs to be very organized! Values for the various sums and sums of squares can be obtained with R. We first calculate the sums of squares assuming a pooled regression of Y on X:

    SSTotal   =   582800 - 32002/18   =   13911.11
    SSTreatment   =   [9902 + 9952 + 12152]/6 − 32002/18   =   5502.8
    SSPooledReg   =   [632800 - (3530 × 3200)/18]2   =   2856.8
    701900 -(35302/18)
    SSError   =   13911.11− 2856.8 − 5502.8   =   5552.31

    Sums of squares Table
      Σ sums Σ2 products ΣΣ2 totals
    x1
    y1
    x1y1
    Σx1Σy1
    x2
    y2
    x2y 2
    Σx2Σy2
    x3
    y3
    x3y 3
    Σx3Σy3
    X overall
    Y overall
    XY overall
    ΣXΣY
    1190
    990
     
     
    1140
    995
     
     
    1200
    12 15
     
     
    3530
    3200
     
     
    238750
    167600
    199150
     
    219900
    166275
    188900
     
    243250
    248925
    244750
     
    701900
    582800
    632800
     
     
     
     
    1178100
     
     
     
    1134300
     
     
     
    1458000
     
     
     
    11296000

    Next calculate the regression statistics for each treatment separately:

    SSTot1   =   167600 - 9902/6   =   4250
    SSXY1   =   199150 - (1190 × 990)/6   =   2800
    SSX1   =   238750 -(11902/6)   =   2733.33
    SSReg1   =   28002 / 2733.33   =   2868.296
    SSError1   =   4250− 2868.296   =   1381.704
    b1   =   2800/2733.33   =   1.0244

    SSTotal2   =   166275 - 9952/6   =   1270.833
    SSXY2   =   188900 - (1140 × 995)/6   =   − 150
    SSX2   =   219900 -(11402/6)   =   3300
    SSReg2   =   − 1502 / 3300   =   6.818
    SSError2   =   1270.833−6.8182   =   1264.015
    b2   =   − 150/3300   =   − 0.04545

    SSTotal3   =   248925 - 12152/6   =   2887.5
    SSXY3   =   244750 - (1200 × 1215)/6   =   1750
    SSX3   =   243250 -(12002/6)   =   3250
    SSReg3   =   17502 / 3250   =   942.308
    SSError3   =   2887.5−942.308   =   19451.192
    b3   =   1750/3250   =   0.5385

    Summed regression statistics:

    SSTotal   =   4250 + 1270.833 + 2887.5   =   8408.333
    SSXY   =   2800 - 150 + 1750   =   4400
    SSX   =   2733.33 + 3300 + 3250   =   9283.3
    SSreg   =   2868.296 + 6.818 + 942.308   =   3817.422
    SSError   =   1381.704 + 1264.015 + 1945.192   =   4590.911
    bcommon   =   4400/9283.3   =   0.4740

    SSReg (common slope) = 44002/9283.3 = 2085.465

    SSHeterogeneity of slopes = 3817.422 - 2085.465 = 1731.957. I

    Model with interaction (maximal model, different slopes)

    ANOVA table
    Source of
    variation
    df SS MS F-
    ratio
    P
    Diet 2 5502.8      
    Pre  1  2085.5      
    Heterogeneity of slopes  2  1732.0 866.0 2.2635 0.1465
    Residual 12 4590.9 382.6    
    Total 17 13911.11      

    Interpretation here is somewhat problematical. At the P = 0.05 level the interaction is not significant - in other words we can accept the lines as parallel. However, this in direct contradiction to our visual assessment of the relationship between the response variable and the covariate, where we lack evidence of any regression relationship at all between 'post' and 'pre' for treatment B - let alone a similar slope to that found for treatments A and C. The apparent contradiction may result from the lack of power of tests for interaction. Or it may result from random variation in treatment B. Which is the case is a matter of judgement (or guesswork!), but for the sake of the example we will assume it results from random variation.

    When we come to doing it in R, we run into the same 'challenge' that we did when analyzing unbalanced factorial designs. Because the levels of the covariate are different for each of the three treatments, we no longer have a balanced or orthogonal design. For a non-orthogonal design, it matters which variable is entered first. To be more precise, it affects sums of squares and significance levels of the main effects and all but the highest level interaction effect. It does not, however, affect the SS residual nor the parameter estimates and standard errors. If you want the results identical to the manual version, then you have to enter the (nominal) treatment factor first followed by the (continuous) covariate.

    Using R
    > #With
    'diet' first, you get the same SS as when you work it out manually:
    > anova(model1)
    Analysis of Variance Table
    
    Response: post
              Df Sum Sq Mean Sq F value   Pr(>F)
    diet       2 5502.8 2751.39  7.1917 0.008853 **
    pre        1 2085.5 2085.46  5.4511 0.037743 *
    diet:pre   2 1732.0  865.98  2.2635 0.146524
    Residuals 12 4590.9  382.58
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    > #But with 'pre' first, the SS for diet and pre will not be the same
    > anova(model2)
    Analysis of Variance Table
    
    Response: post
              Df Sum Sq Mean Sq F value  Pr(>F)
    pre        1 2856.8 2856.75  7.4672 0.01818 *
    diet       2 4731.5 2365.74  6.1837 0.01426 *
    pre:diet   2 1732.0  865.98  2.2635 0.14652
    Residuals 12 4590.9  382.58
    
    

    The alternative is to use Type II sums of squares which are calculated according to the principle of marginality. Adjustments to the sums of squares are made for the other main effects in the statistical model but not for higher-level interaction effects.

    Using R
    >
    Anova(model1,type=2)
    Anova Table (Type II tests)
    
    Response: post
              Sum Sq Df F value  Pr(>F)
    diet      4731.5  2  6.1837 0.01426 *
    pre       2085.5  1  5.4511 0.03774 *
    diet:pre  1732.0  2  2.2635 0.14652
    Residuals 4590.9 12
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    > Anova(model2,type=2)
    Anova Table (Type II tests)
    
    Response: post
              Sum Sq Df F value  Pr(>F)
    pre       2085.5  1  5.4511 0.03774 *
    diet      4731.5  2  6.1837 0.01426 *
    pre:diet  1732.0  2  2.2635 0.14652
    Residuals 4590.9 12
    

    Irrespective of these considerations, we still have to decide what to do about the interaction. Given that it is not significant at P=0.05, the conventional approach would be to drop the interaction and assume parallel lines. We do this below and estimate adjusted means.

    Model without interaction (parallel lines)

    Using R
    Analysis
    of Variance Table
    
    Response: post
              Df Sum Sq Mean Sq F value  Pr(>F)
    diet       2 5502.8 2751.39  6.0921 0.01249 *
    pre        1 2085.5 2085.46  4.6176 0.04962 *
    Residuals 14 6322.9  451.63
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    > summary(model2)
    
    Call:
    lm(formula = post ~ diet + pre, data = ancdoug)
    
    Residuals:
        Min      1Q  Median      3Q     Max
    -24.792 -16.419  -3.594  12.702  36.276
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)
    (Intercept)  70.9964    44.5979   1.592  0.13372
    dieta2        4.7831    12.4066   0.386  0.70564
    dieta3       36.7101    12.2752   2.991  0.00973 **
    pre           0.4740     0.2206   2.149  0.04962 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    Residual standard error: 21.25 on 14 degrees of freedom
    Multiple R-squared: 0.5455,     Adjusted R-squared: 0.4481
    F-statistic: 5.601 on 3 and 14 DF,  p-value: 0.009764
    

    The diet factor remains significant (P = 0.0125) but 'pre' is getting very borderline with a P- value of 0.0496. If we do a type II ANOVA using the R car package (so order has no effect) we get exactly the same P value for 'pre' but a slightly higher value for the diet factor (P = 0.0200) (we leave you to check this).

    Adjusted means

    The adjusted means ( ') are given by :
    '1 = 165 − 0.474 (198.333 − 196.111) = 163.947
    '2 = 165.833 − 0.474 (190 − 196.111) = 168.730
    '3 = 202.5 − 0.474 (200 − 196.111)     = 200.657

    Using R
    #Calculated from coefficients
     A1;A2;A3
    [1] 163.9467
    [1] 168.7298
    [1] 200.6568
    #Obtained using package effects
    diet
          a1       a2       a3
    163.9467 168.7298 200.6568
    
    

    These then are the adjusted means post ANCOVA. Post hoc comparison of means reveals the level with diet a3 is significantly higher than with either of the other two diets.

    So in conclusion was it worthwhile (or even valid) to do a covariance analysis. In this case we would say no - simply because the authors failed to demonstrate a clear relationship between pre and post values which is a pre-condition for using covariance analaysis. In the event it had little effect on the outcome - other than giving oneself a great deal more work!