Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



z-test for independent proportions: Use & misuse

(independent proportions, risk difference, confidence interval of difference, critical ratio test, chi square test)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

The purpose of the z-test for independent proportions is to compare two independent proportions. It is also known as the t-test for independent proportions, and as the critical ratio test. In medical research the difference between proportions is commonly referred to as the risk difference. The test statistic is the standardized normal deviate (z). The standard test uses the common pooled proportion to estimate the variance of the difference between two proportions. It is identical to the chi square test, except that we estimate the standard normal deviate (z). The square of the test statistic (z2) is identical to the Pearson's chi square statistic X2.

It is sometimes preferred to the chi square test if the interest is in the size of the difference between the two proportions. A confidence interval can be attached to that difference using either the normal approximation or a variety of exact or small sample methods. Because different estimates of the variance are used, it is possible that the results of the test may not be consistent with the confidence interval. In other words, the confidence interval of the difference may overlap zero (indicating no significant difference), yet the test indicates a significant difference. As a result an alternative critical ratio test was devised that gives identical results to the confidence interval. This estimates the standard error of the difference as the square of the sum of the individual variances. When the test is used, it should therefore always be specified whether the variance of the difference is based on the pooled estimate of the common proportion (identical to Pearson's chi square test) or on the variance of the difference from the sum of the two individual variances (the more liberal alternative critical ratio test).

Not surprisingly the most common misuses of the z-test are the same as for Pearson's chi square test.  Lack of independence is the commonest factor invalidating the test. In observational studies this may result from the use of cluster sampling.  The effect is likely to be greatest when individuals within a cluster are much more similar to each other than to individuals in other clusters - for example where the school was the unit of study. Multistage sampling is almost inevitable in some ecological studies where it is almost impossible to select a genuinely random sample. We do give one example, a telemetry study on rabbits, where standard errors were adjusted to take account of the more complex sampling design. The same issue arises with cluster randomized trials - we look at a veterinary trials where treatment was randomized to animals but the unit of analysis was mammary infections per quarter. In another study treatment was allocated to farms, but the unit of analysis was clutches of birds.

Use of paired samples also conflicts with the independence assumption - we give examples of before and after studies where the same sampling units are assessed and paired studies where different diagnostic tests are tested on the same samples. In some of these cases McNemar's test for significance of change would have been more appropriate. Pooling results from five different experiments may also invalidate the independence assumption especially if the data are heterogeneous. Certainly pooling across factors to reduce data to a 2 2 table (such as in a study on the conception rate of water buffaloes) is very unwise.

Despite the fact that the test lends itself to estimating a confidence interval of the difference, it is rare to see the interval calculated - this is a pity as it is a much more informative approach than just quoting a P-value. Use of small samples is not uncommon in which case exact tests based on the distribution of X2 would be more appropriate. Another misuse is to use multiple z-tests to compare proportions in repeated measures designs. Lastly we note there is a strange predilection for always using a one-tailed tests in survival studies whether of rabbits, foxes or wild birds. The reason for this is debatable but it remains true that one should always justify one-tailed test a priori.

What the statisticians say

Lui (2004) gives a comprehensive account of the statistical estimation of epidemiological risk including the confidence interval for the risk difference. Fleiss et al. (2003) and Agresti (2002) also cover tests between proportions and confidence intervals of the difference between proportions. Conover (1999) looks at the analysis of contingency tables in Chapter 4. All the various different models (= applications) are assessed. Woodward (1999) covers the z-test for proportions and the confidence interval of the difference in Chapter 2. Snedecor & Cochran (1989) cover the z test and its identity with Pearson's chi square test in Chapter 8. Fleiss (1981) provides a discussion of the merits of the continuity correction and of the alternate critical ratio test.

Santnera et al. (2007) carry out small-sample comparisons of various confidence intervals for the difference of two independent binomial proportions. Zou & Donner (2004) propose a simple alternative confidence interval for the difference between two proportions. Agresti & Min (2001) construct an interval for the risk difference by inverting a two sided test. Agresti & Caffo (2000) propose a simple and effective confidence intervals for differences of proportions by adding two successes and two failures. Newcombe (2000) details the method for attaching a confidence interval to the risk difference based on the Wilson score interval. Newcombe (1998) compares eleven methods for interval estimation for the difference between independent proportions; he recommends that based on the Wilson score interval. Paul & Zaihra (2008) describe estimation of the confidence interval of the risk difference for data sampled from clusters. Eberhardt & Fligner (1977) compare the type I error rate and power of the standard z-test for independent proportions with those of the alternate critical ratio test.

Wikipedia uses the term absolute risk reduction for risk difference but does not give its confidence interval. The R package 'epitools' does not give the confidence interval for a risk difference, but it can be found in R packages provided by Michael A Rotondi and Mark Stevenson.