Analysis of covariance (ANCOVA) combines the techniques of analysis of variance and regression by incorporating both nominal variables (factors) and continuous measurement variables (covariates) into a single model. We will only look in detail at its use in the completely randomized design, but it can be used with all the designs that we have covered including the randomized block, Latin square, factorial and repeated measures designs.
The primary use of ANCOVA is to increase precision in randomized experiments. A covariate X is measured on each experimental unit before treatment is applied. That covariate may be the baseline level of the response variable, or it may be some other characteristic of the experimental unit that is expected to affect outcome. The eventual treatment means are then adjusted to remove the initial differences, thus reducing the experimental error and permitting a more precise comparison among treatments. Such adjustment is dependent upon the parallel slopes assumption - namely that the slope of the relationship between the covariate and the response variable is the same for each treatment level.
The other main use of ANCOVA is to model relationships especially where one wants to compare regression relationships at different levels of some treatment variable - for example growth rates (size against age). In this use the aim is to assess which model is most appropriate to describe the data - whether separate slopes for each treatment level, a common slope but different intercepts or a common intercept but different slopes. This use clearly does not depend on the parallel slopes assumption - that is simply one of the model simplifications that can be made if justified - and because of this some authorities do not include such an analysis under the name 'analysis of covariance'.
A third somewhat controversial use of ANCOVA is to adjust for sources of bias in observational studies.
One way ANCOVA
Maximal model (separate regression lines)
Parallel lines model
We will take a completely randomized experimental design with 'a' group (= treatment) levels, each
replicated n times. A response variable Y and a covariate X are measured on each experimental unit.
Group (treatment) totals are denoted as T1 to Ta, and the grand total as G.
Step 1. Overall Pooled Regression
The total, treatment, regression and residual sums of squares using a pooled overall regression are
calculated as follows:
- SSTotal is the total sums of squares, Yij are the individual
observations, G is the grand total or ΣY, and N is the total number of
- SSTreatment is the treatment sums of squares, Ti are the treatment
totals, n is the number of replicates per treatment, G is the grand total or ΣY,
and N is the total number of observations.
||[ΣXY − (ΣX)
|ΣXij2 − (ΣX)2/N
- SSRegression is the sums of squares explained by the regression, and N is the
total number of observations.
||SSTotal − SSRegression
- SSError is the error or residual sums of squares
Step 2. Individual regressions
Regression statistics for each treatment level are calculated as follows.
- SSTotal (1) is the total sums of squares, Y1 are the individual
observations, G1 is the grand total (or ΣY1) and n
is the number of observations for the first treatment level.
||ΣX1Y1 − (ΣX1)(ΣY1)/n
- SSXY (1) is the covariance sums of squares, and n is the number of
observations for the first treatment level.
- SSX (1) is the sums of squares for X1, and n is the number of
observations for the first treatment level.
||(SSXY (1))2/ SSX (1)
- SSRegression (1) is the regression sums of squares for the first treatment level.
||SSTotal (1) − SSRegression (1)
- SSError (1) is the error or residual sums of squares for the first treatment
- b1 is the slope of the regression line for the first treatment level.
Step 3. Summed regression statistics
The regression statistics for each treatment level are summed as follows.
SSTotal = SSTotal (1) + SSTotal (2) + ... + SSTotal
SSXY = SSXY (1) + SSXY (2) + ... + SSXY (a)
SSX = SSX (1) + SSX (2) + ... + SSX (a)
SSReg (summed) = SSReg (1) + SSReg (2) + ... +
SSError = SSError (1) + SSError (2) + ... + SSError
bcommon = SSXY / SSX
SSReg (common slope) = (SSXY)2 /
Step 4. Assess heterogeneity of slopes
If slope of relationship between X and Y was the same at each treatment level, SSReg
(common slope) would be the same as SSReg (summed). The difference between
the two is a measure of the heterogeneity of slopes.
Hence SSHeterogeneity of slopes = SSReg (summed) − SSReg (common slope)
Mean squares are obtained by dividing sums of squares by their respective degrees of freedom. The
significance test is carried out by dividing the mean square regression by the mean square error. Under a
null hypothesis of a zero slope this F-ratio will be distributed as F with 1 and n
− 2 degrees of freedom.
|Source of variation|
n − 2
n − 1
' = 1 −
b ( )
Dealing with non-parallel lines
For example, if we have treatment (A) with two levels and a continuous covariate (B)
Effect size (b1|B=θ) = b1 + b3θ
where b1 is the coefficient for (A), b3 is the coefficient for the interaction (A
× B), and θ is the chosen value of the covariate (B). Then estimate the
standard error of effect size:
Standard error of effect size (sb1|B=θ) = √ s2b1 + 2 θs2b1b3 + θ2s2b3
- ANOVA assumptions
Observations are independent from observation to observation. Residuals are randomly and normally distributed. Variances between groups are homogeneous (ANOVA assumptions).
- Regression assumptions
The relationship between Y and X must be linear for each treatment group (although some forms of nonlinearlity can be dealt with by including a polynomial term as an extra covariate). In addition errors (deviations from the fitted lines) must be independent of the values of X and normally distributed.
- Specific ANCOVA assumptions
The model assumes that the covariate is independent of the treatment effect. In other words the distribution of values of the confounder should be the same at each treatment level or (more importantly) the (parametric) mean value of the covariate is the same for each group. A further specific (but optional) assumption is homogeneity of slopes. It is optional because it is only required to simplify the model for estimation of adjusted means.