Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
The two-sample t-test: Use & misuse
Statistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
The two sample t-test is used to compare the means of two independent samples. For the nil hypothesis, the observed t-value is equal to the difference between the two sample means divided by the standard error of the difference between the sample means. If the two population variances can be assumed equal, the standard error of the difference is estimated from the weighted variance about the means. If the variances cannot be assumed equal, then the standard error of the difference between means is taken as the square root of the sum of the individual variances divided by their sample size. In the latter case the estimated t' statistic must either tested with modified degrees of freedom, or it can be tested against different critical values. A weighted t-test must be used if the unit of analysis comprises percentages or means based on different sample sizes. Use of the t-test assumes that estimates are based on probability sampling or random allocation, that observations are independent, that the means are of measurement variables, and that observations are drawn from normally distributed populations.
The two-sample t-test is probably the most widely used (and misused) statistical test. Comparing means based on convenience sampling or non-random allocation is meaningless. If, for any reason, one is forced to use haphazard rather than probability sampling, then every effort must be made to minimize selection bias. Non-independence of replicates (pseudoreplication) is widespread, and we give examples of the two-sample test being used for paired observations and for observations in a time series. The same problem arises where conditions post-treatment do not maintain independence of observations, or in cluster randomized trials when the wrong unit of analysis is used. Another misuse is to use the t-test for an ordinal variable such as a parasite score - ordinal variables often cannot even approximate to a normal distribution, with the added disadvantage that the arithmetic mean provides an inappropriate measure of location.
Whilst the t-test is relatively robust as regards the normality assumption, this is often taken to extremes. We give examples of where the test is used on small numbers of untransformed proportions, and for data with large numbers of zeroes or other extreme values. There is a strong tendency to blindly rely on central limit theorem, when often a simple logarithmic transformation would both normalize distributions and homogenize variances. When it comes to the equality of variances assumption, one should not forget that the F-ratio test also requires the data to be normally distributed - and that use of the unequal variance t-test reduces the power of the test if the variances are indeed equal. We give several examples of where the outcome of the study is reversed if the correct version of the test is used. In general there is far too much focus on differences (or otherwise) of mean values when an increase in variability may be main effect of treatment.
As with the paired t-test, inadequate sample size and consequent low power are common problems. A result based on a sample size of less than 5-10 will never be very convincing - however 'significant' that result may be. Just because the t-test can be used on very small samples, it does not justify the use of very small samples unless larger sample sizes are impossible. The t-test should also not be used for multiple comparisons - carrying out dozens (or hundreds) of t-tests means that Type I errors are inevitable. Using a modified alpha level reduces the problem, but it is usually far better to focus on just a few key comparisons. We give examples of two of the more specialist uses of the t-test - for analysing cross-over designs and randomized cluster trials. In cross-over designs, the analysis is only valid if there is no period × treatment interaction. Weighted t-tests are still much underused and should always be utilised if cluster random sampling / allocation is used rather than simple random methods.
What the statisticians sayArmitage & Berry (2002) cover t-tests in chapter 4. Jones & Kenward (2003) provide a detailed account of how to analyse cross-over trials using t-tests. Woodward (1999) gives a useful summary of the two-sample t-test in Chapter 2, and a more detailed account of its use in analysing cross-over trials in Chapter 7. Zar (1999) gives the formulation of the t-test for simple random sampling, as well as information on the limits of robustness of the test, the unequal variance t-test, and sample size determination. He argues against use of the F-ratio test to test for equality of variances, although it is unclear on what basis the equal or unequal variance version of the test should be used. Bart et al. (1998) looks at the two-sample t-test in Chapter 3, again considering how large a sample is required when normality assumptions are not met. He concludes (somewhat controversially) that providing populations have no extreme outliers, sample sizes as low as five are adequate. Underwood (1997) provides extensive coverage of the two sample t-test for hypothesis testing in Chapters 5 and 6. Contrary to most authors he argues against any use of the unequal variance version of the test.
Zimmerman & Zumbo (2009) and Zimmerman (2004) argues that optimum protection is assured by using a separate-variances test unconditionally whenever sample sizes are unequal. Zimmerman (2004) notes that the equal variance two-sample t- test is not robust to variance heterogeneity for skewed distributions even when sample sizes are equal. Neuhauser (2002) discusses which two-sample tests are appropriate when variances are unequal. See also Moser & Stevens (1992) Conover et al. (1992) compare various tests for homogeneity of variances. Markowski & Markowski (1990) conclude that the F-test is not an effective preliminary test of homogeneity of variances.
Chuang (2002), Kerry & Bland (1998), Bland & Kerry (1998) , Ydkin & Moher (2001), and Bennett et al. (2002) look at the use of the weighted t-test for comparison of means derived from cluster randomized clinical trials. Donner (1993) addresses the same issues in the veterinary situation when the litter is the unit of analysis.
Diaz-Uriarte (2002) looks at how a series of two sample t-tests can be used to correctly analyze data from cross-over trials. Johnson (1995) and Smith (1995) take issue with Potvin and Roff (1993) in their general advocacy of non-parametric tests. Stewart-Oaten et al. (1986) looks at the use of the two sample t test for analyzing 'before and after control impact' studies.