Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Measuring agreement between observers
There is one type of contingency table which we haven't considered yet. Sometimes the frequencies of a categorical variable are measured more than once, for example by two different people or by using two or more different methods. This may also arise when checking the results of a questionnaire - the question may be asked again to the respondent at a later date, or a researcher may compare his own assessment with that of the respondent.
Let's take the example of assessing alcohol consumption for a cohort study on effect of alcohol on heart disease. The problem here is that people tend to underestimate how much they
You might wonder why we cannot just assess association between the two sets of categories in the usual way, say with a chi square. A Pearson's chi square test on the square above gives a value for X2 of 71.2 which is highly significant (P< 0.001). However, significance alone is no indicator of agreement.
Consider this second set of data.
Here the wife tends to disagree with the husband on one of the categories. This still produces a strong association (X2 = 11.0, P< 0.001) but the proportion of identical ratings is now only 0.667. Agreement is whether they are the same - not whether they are related in some way which is what the chi square value tells us.
We have compared agreement above just on the basis of the proportion of identical ratings. But this is not an ideal measure of agreement, since we would expect a certain number of identical ratings just by chance. We can get a better measure of agreement if we use a corrected measure of agreement that we first encountered in
Expected chance frequencies are worked out from the marginal frequencies in the same way as for Pearson's chi square test.
For the first table above, this gives expected frequencies of
There is an accepted convention on how the level of agreement of various levels of kappa is described, shown in the table below. However, it should be remembered that kappa is only indicating the level of agreement between two methods or two observers. It cannot tell you which (if either) is closest to reality!
If we have more than two categories, the kappa coefficient can be worked out in exactly the same way as long as we are not dealing with an ordered categorical variable. If we are, such as with our example on the number of drinks, we need a slight modification.
In the table shown below with three categories, the ratings can either differ by one category (shown in gold on the right) or by two categories (shown in red).
Expected frequencies for no association are shown as subscripts in brackets. In this table the 24 cases off the diagonal are concentrated in the gold cells, but if they were instead concentrated in the red cells it should indicate a lower measure of agreement. But the simple kappa measure for these data would taking no account of what is happening in the cells off the diagonal, and would give us a kappa value of 0.668 irrespective of how these values were distributed.
This problem can be circumvented with the weighted kappa measure. A weight, wi, is assigned to cases for which the two raters differ by i categories. In the case of agreement the weight is set at 1. If there are k categories the maximum disagreement is of k − 1 categories, so this is given weight zero (the red cells above). Intermediate values are usually given equally spaced weights, so we would give gold cells a weight of
Hence our observed proportion of agreement is:
This gives a weighted kappa measure of 0.695, slightly higher than we obtained with the unweighted kappa. In this case it is higher because most of the discrepancies were by only one category. If they had instead been concentrated in the red boxes, we would have found the weighted kappa was less than the unweighted version.
Significance tests & confidence intervals for Kappa
Since one is most interested in the absolute value of kappa, an alternative and more appropriate way to test it is to attach a 95% confidence interval and see if it overlaps zero. An approximate interval is given by
For our first example with a κ value of 0.762 we get a 95% confidence interval of 0.588 to 0.944, which clearly differs significantly from zero. For the second example κ is only