The chi square test, and other tests for comparing proportions covered within this unit, are frequently misapplied by analysing pooled 2×2 tables as if they were derived from one study. Pooling data where the prevalence of the characteristic varies between replicates can quite simply give the wrong answer  known as Simpson's paradox. It is best understood by looking at these hypothetical data sets.
For example:
Example 1
Note the fall rates for the two treatments A and B were very similar in each centre, but differed between centres. In the first centre the fall rate was 8183%, whilst in the second centre it was only 3336%.
So what happens if we just pool the data from the two trials?
Now we find that the control has a significantly higher fall rate than the treatment (69% compared to 59%) with a risk ratio for treatment versus control of 0.85. Clearly this is a very misleading result. It results from pooling data with unequal proportions (overall fall rates for each centre are 0.82 versus 0.34) and unequal ratios of sample sizes (120:210 and 105:75). We have also lost the valuable information that the fall rates of both treatments are highly dependent on which centre is involved (we may have a major problem with patient care in centre 1!).
To analyze the data properly we need to return to the full information given in the two tables.
Multicentre trial

Centre
 Treatment
 Outcome
 Risk ratio

Falls
 No falls

1
 Treatment
 100 (a_{1})
 20 (b_{1})
 1.03

Control
 170 (c_{1})  40 (d_{1})

2
 Treatment
 35 (a_{2})  70 (b_{2})
 0.93

Control
 27 (c_{2})
 48 (d_{2})



Example 2
Again we have a binary outcome variable, but here we believe that the outcome of the study has been affected by a confounding factor such as age. Hence we stratify the results to investigate that factor and adjust for its effects.
Our example here is from a casecontrol study on risk factors for a cattle disease. The data originate from 75 cases of all ages compared with 75 randomly drawn controls. We stratify the results so we can examine results for adults and calves separately. We use odds ratios to assess the importance of the risk factor in each stratum. If we pooled the data we would get a crude odds ratio of 2.98  not as obviously misleading as in our first example, but we might justifiably question its relevance given the apparent difference between the groups.
Risk factors for cattle disease 
Age 
Risk factor 
Affected  Odds ratio 
Yes 
No 
Adults 
+ 
25  12 
2.50 
 
15  18 
Calves 
+ 
30 
24 
5.25 
 
5 
21 


The best way to approach such data is to first estimate a common effect estimate (either risk ratio or odds ratio as appropriate) with the appropriate confidence interval. The data are then tested for homogeneity. If the data are homogenous the common effect estimate can be tested for significance. If not, analysis reverts to considering each stratum separately.
There are several approaches to carrying out this sort of analysis. The most popular approach is to use what are called MantelHaenszel methods and we will concentrate on this approach here. An alternative approach is to combine the logarithms of the odds ratios. This method works satisfactorily when there are only a few strata and the sample sizes within each are large. There is also a maximum likelihood method known as the CornfieldGart method. These procedures are described by Gart (1970) and are summarized in Fleiss (2003) .
Common risk ratio and odds ratio
The MantelHaenszel common risk ratio is obtained by simply weighting the contribution of each individual risk ratio by a measure of its precision. This is done by taking the numerator and denominator of the risk ratio for each square separately and dividing each by the number of observations in that square. The components from each square are then summed, and the numerator is divided by the denominator to obtain the common risk ratio:
Algebraically speaking 
λ_{MH} 
= 
Σ a_{i} (c_{i} + d_{i}) / n_{i} 

Σ c_{i} (a_{i} + b_{i}) / n_{i} 
Where:
 λ_{MH} is the MantelHaenszel common
risk ratio;
 a_{i}, b_{i}, c_{i}, and d_{i} are the observed frequencies in each cell as shown in the examples above.
 n_{i} is the total number of observations in each table.

The value will be biased towards the risk ratio of the squares containing most observations. Hence using the data from our first example above, we get a common risk ratio of 1.01, rather larger than the arithmetic mean of 0.98.
As before the asymptotic confidence interval (1.96 times the standard error) is worked out for the logarithm of the relative risk, and then detransformed to obtain the interval for the relative risk itself. The standard error is given by Greenland & Robins (1985).
A similar approach is followed to get the common odds ratio. Again the contribution of each square to the common odds ratio is weighted by the number of observations in that square:
Algebraically speaking 
ω_{MH}
 =
 Σ a_{i}d_{i} / n_{i}


Σ b_{i}c_{i} / n_{i}

Where:
 λ_{MH} is the MantelHaenszel common risk ratio;
 a_{i}, b_{i}, c_{i}, and d_{i} are the observed frequencies in each cell as shown in the examples above.
 n_{i} is the total number of observations in each table.

Using the data from our second example above, we get a common odds ratio of 3.51. The asymptotic confidence interval is worked out for the logarithm of the odds ratio, and then detransformed to obtain the interval for the odds ratio itself. The standard error is given by Robins (1986). Exact confidence intervals are preferable when sample sizes are small.
Testing for homogeneity/interaction
In our first example
we obtained centre risk ratios of 1.03 and 0.93, with a common risk ratio of 1.01. In this situation the common risk ratio does seem to be an appropriate summary effect measure for our data. But in the second example is the common odds ratio of 3.51 really appropriate to describe a risk ratio of 2.50 for adults and 5.25 for calves??
It would appear in this latter case we might have an interaction between the confounding factor (age) and the risk factor. In other words the effect of the risk factor is dependent on the level of the confounding factor. Putting it another way our different 2 × 2 tables may not be homogenous. How do we assess the importance of this interaction or heterogeneity?
Essentially we compare the observed values with the expected values assuming a common risk or odds ratio. In a 2×2 table if row and column totals are known, knowledge of one cell fixes the other three cells. Hence we base the test of homogeneity using just one value in each 2×2 square, usually the top left hand cell. The only difficulty is in working out what the expected values should be  this is straightforward but rather tedious!
For calculating the MantelHaenszel interaction chi square statistic we go back to the basic form of Pearson's chi square statistic  namely that X^{2} is equal to the square of the deviations divided by the parametric variance under the null hypothesis:
Algebraically speaking 
X^{2}_{MH interaction}
 =
 Σ
 (a_{i} − _{i})^{2}


s^{2}_{a}

Where:
 X^{2}_{MH interaction} is the MantelHaenszel interaction chi square statistic;
 a is the observed frequency in the top left hand cell for the ith table;
 _{i} is the expected frequency in the top left hand cell for the ith table assuming a common risk or odds ratio  see for how _{i} is estimated for the risk ratio and odds ratio;
 s^{2}_{ai} is the variance of the expected frequencies. This is given by:
1  /(
 1
 +
 1
 +
 1
 +
 1
 )

   
_{i}
 _{i}
 _{i}
 _{i}


So how do our examples work out in the test for interaction?
Mantel Haenszel association test
All that remains is to assess the significance or otherwise of the common risk or odds ratio, assuming that we have demonstrated homogeneity above. As before we base the test just on the observed and expected values in cell a of each table. Now however expected values are estimated on the basis of no association, rather than on the basis of a common risk or odds ratio. If applied to just one square the formula is algebraically identical to Pearson's chi square, except that it is multiplied by the factor (n_{i}1/n_{i}). This is close to 1 except for small sample sizes.
Algebraically speaking 
X^{2}_{MH association}
 =
 (Σa_{i} − Σ_{i})^{2}


Σs^{2}_{ai}

Where:
 X^{2}_{MH association} is the MantelHaenszel chi square statistic for significance of the common risk or odds ratio;
 a_{i} & _{i} are the observed and expected frequencies in cell 'a' in square 'i' assuming no association;
 s^{2}_{ai} is the variance of the expected frequencies for square i which is given by the product of the four margin totals (a_{i}+b_{i})(c_{i}+d_{i})(a_{i}+c_{i})(b_{i}+d_{i}) divided by n_{i}^{2}(n_{i}1).

Important point
Improperly pooled data from 2×2 tables can produce misleading (= wrong) conclusions from the data. It can create apparent treatment effects where none exist, and similarly conceal important treatment effects.

{Fig. 1}
If we apply this test to the first data set on a multicentre trial with a common MH risk ratio of 1.01 (shown in red on the figure), we obtain a very low MantelHaenszel chi square value of only 0.0005 (P = 0.982). From this we clearly have no evidence of association between treatment and outcome.
But just think back to the result of the Pearson's chi square test on the crude risk ratio (shown in green)  that gave a P value of 0.03! Using this approach would have led us to wrongly conclude that treatment was effective in reducing the incidence of falls. Improper pooling of data is one of the commoner reasons for incorrect statistical analysis of data in the literature.
{Fig. 2}
Moving to the second example with a common MH odds ratio of 3.51 (again shown in red on the figure), we get a MantelHaenszel chi square value for association of 12.03 (P = 0.0005). We can therefore be confident that the risk factor is associated with the occurrence of the disease.
Note, that since this was a 'traditional' case control study, our odds ratio of 3.51 can only be equated to relative risk if the disease is 'rare'. In this case adjusting for the confounding factor of age has increased the crude odds ratio from 2.98 to 3.51.