Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)




Measuring agreement between observers

"Ah! don't say that you agree with me. When people agree with me I always feel that I must be wrong."
Oscar Wilde 1854-1900.
The Critic as Artist. Part 2


There is one type of contingency table which we haven't considered yet. Sometimes the frequencies of a categorical variable are measured more than once, for example by two different people or by using two or more different methods. This may also arise when checking the results of a questionnaire - the question may be asked again to the respondent at a later date, or a researcher may compare his own assessment with that of the respondent.

Let's take the example of assessing alcohol consumption for a cohort study on effect of alcohol on heart disease. The problem here is that people tend to underestimate how much they drink. One way to get round this is to ask the spouse to assess the number of drinks taken each day by the husband. The data are arranged in a square contingency table as below. If there is a high level of agreement, most observations will lie along the diagonal. In this case we have 106 of the 120 observations along the diagonal. We can get a measure of the level of agreement by calculating the proportion of identical ratings between wives and husbands. For this example the proportion is 0.883.


Husband's own assessment of his no. drinks/day
Wife's assessment of husband's no. drinks/dayLess than 22 or more
Less than 25710
2 or more449


You might wonder why we cannot just assess association between the two sets of categories in the usual way, say with a chi square. A Pearson's chi square test on the square above gives a value for X2 of 71.2 which is highly significant (P< 0.001). However, significance alone is no indicator of agreement.

Consider this second set of data. Here the wife tends to disagree with the husband on one of the categories. This still produces a strong association (X2 = 11.0, P< 0.001) but the proportion of identical ratings is now only 0.667. Agreement is whether they are the same - not whether they are related in some way which is what the chi square value tells us.  
 Husband's own assessment
of his no. drinks/day
Wife's assessment of
husband's no. drinks/day
Less than 22 or more
Less than 2607
2 or more3320

We have compared agreement above just on the basis of the proportion of identical ratings. But this is not an ideal measure of agreement, since we would expect a certain number of identical ratings just by chance. We can get a better measure of agreement if we use a corrected measure of agreement that we first encountered in Unit 2 , namely Cohen's Kappa. The only difference here is that neither measure is accepted as a gold standard.

Expected chance frequencies are worked out from the marginal frequencies in the same way as for Pearson's chi square test. For the first table above, this gives expected frequencies of 34.1 , and 26.9 in the diagonal cells. These give us the expected proportion of agreements as 0.508 compared to an observed proportion of 0.883. Kappa is obtained as before by subtracting the expected proportion of agreements from the observed proportion, and standardizing the difference by dividing it by the maximum possible agreement beyond chance (that is 1 − 0.508):

Algebraically speaking -

κ   =   pO − pE   =     0.883   −   0.508     =   0.762
1   −   pE 1   −   0.508
  • κ is kappa;
  • pO is the observed proportion that agree
  • pE is the expected proportion that agree, assuming no association

There is an accepted convention on how the level of agreement of various levels of kappa is described, shown in the table to the right. However, it should be remembered that kappa is only indicating the level of agreement between two methods or two observers. It cannot tell you which (if either) is closest to reality!
κ  >  0.75 agreement
κ  =  0.4 - 0.75 agreement
fair to good
κ  <   0.4 agreement

If we have more than two categories, the kappa coefficient can be worked out in exactly the same way as long as we are not dealing with an ordered categorical variable. If we are, such as with our example on the number of drinks, we need a slight modification.

In the table shown to the right with three categories, the ratings can either differ by one category (shown in gold on the right) or by two categories (shown in red). Expected frequencies for no association are shown as subscripts in brackets. In this table the 24 cases off the diagonal are concentrated in the gold cells, but if they were instead concentrated in the red cells it should indicate a lower measure of agreement. But the simple kappa measure for these data would taking no account of what is happening in the cells off the diagonal, and would give us a kappa value of 0.668 irrespective of how these values were distributed.

 Husband's own
assessment of
his no. drinks/day
Wife's assessment
of husband's
no. drinks/day
<11 - 2>2

This problem can be circumvented with the weighted kappa measure. A weight, wi, is assigned to cases for which the two raters differ by i categories. In the case of agreement the weight is set at 1. If there are k categories the maximum disagreement is of k − 1 categories, so this is given weight zero (the red cells above). Intermediate values are usually given equally spaced weights, so we would give gold cells a weight of 0.5 We then add in the weighted frequencies for the gold cells to those on the diagonals to obtain the expected and observed proportions.

Hence our observed proportion of agreement is:
(1 x 96 + 0.5 x 20)/120  =   0.883
Our expected proportion of agreement by chance alone is:
(1 x 47.7 + 0.5 x 52.6)/120  =   0.617

This gives a weighted kappa measure of 0.695, slightly higher than we obtained with the unweighted kappa. In this case it is higher because most of the discrepancies were by only one category. If they had instead been concentrated in the red boxes, we would have found the weighted kappa was less than the unweighted version.

The kappa measure is not without its detractors (for example Maclure & Willett (1987)), and care must be taken in its interpretation:

  • Do not interpret the absolute value of kappa too rigidly. For ordinal data the value obtained is dependent on the number of categories - generally the fewer the categories, the higher will be the kappa value.
  • Use weighted kappa when dealing with other than 2x2 tables.
  • Do not use kappa for measurement variables - other methods are available for these.

Significance tests & confidence intervals for Kappa

Since one is most interested in the absolute value of kappa, an alternative and more appropriate way to test it is to attach a 95% confidence interval and see if it overlaps zero. An approximate interval is given by

Algebraically speaking -

95% Confidence Interval (κ)  =  κ   ±  1.96 SE (κ)

  • SE is the standard error of kappa

For our first example with a κ value of 0.762 we get a 95% confidence interval of 0.588 to 0.944, which clearly differs significantly from zero. For the second example κ is only 0.288 (95% CI: 0.129 to 0.446) which is still significant but is clearly a much poorer level of agreement.