InfluentialPoints.com Biology, images, analysis, design... 

"It has long been an axiom of mine that the little things are infinitely the most important" 

The twosample ttest: Use & misuseStatistics courses, especially for biologists, assume formulae = understanding and teach how to do statistics, but largely ignore what those procedures assume, and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable... Use and MisuseThe two sample ttest is used to compare the means of two independent samples. For the nil hypothesis, the observed tvalue is equal to the difference between the two sample means divided by the standard error of the difference between the sample means. If the two population variances can be assumed equal, the standard error of the difference is estimated from the weighted variance about the means. If the variances cannot be assumed equal, then the standard error of the difference between means is taken as the square root of the sum of the individual variances divided by their sample size. In the latter case the estimated t' statistic must either tested with modified degrees of freedom, or it can be tested against different critical values. A weighted ttest must be used if the unit of analysis comprises percentages or means based on different sample sizes. Use of the ttest assumes that estimates are based on probability sampling or random allocation, that observations are independent, that the means are of measurement variables, and that observations are drawn from normally distributed populations. The twosample ttest is probably the most widely used (and misused) statistical test. Comparing means based on convenience sampling or nonrandom allocation is meaningless. If, for any reason, one is forced to use haphazard rather than probability sampling, then every effort must be made to minimize selection bias. Nonindependence of replicates (pseudoreplication) is widespread, and we give examples of the twosample test being used for paired observations and for observations in a time series. The same problem arises where conditions posttreatment do not maintain independence of observations, or in cluster randomized trials when the wrong unit of analysis is used. Another misuse is to use the ttest for an ordinal variable such as a parasite score  ordinal variables often cannot even approximate to a normal distribution, with the added disadvantage that the arithmetic mean provides an inappropriate measure of location. Whilst the ttest is relatively robust as regards the normality assumption, this is often taken to extremes. We give examples of where the test is used on small numbers of untransformed proportions, and for data with large numbers of zeroes or other extreme values. There is a strong tendency to blindly rely on central limit theorem, when often a simple logarithmic transformation would both normalize distributions and homogenize variances. When it comes to the equality of variances assumption, one should not forget that the Fratio test also requires the data to be normally distributed  and that use of the unequal variance ttest reduces the power of the test if the variances are indeed equal. We give several examples of where the outcome of the study is reversed if the correct version of the test is used. In general there is far too much focus on differences (or otherwise) of mean values when an increase in variability may be main effect of treatment. As with the paired ttest, inadequate sample size and consequent low power are common problems. A result based on a sample size of less than 510 will never be very convincing  however 'significant' that result may be. Just because the ttest can be used on very small samples, it does not justify the use of very small samples unless larger sample sizes are impossible. The ttest should also not be used for multiple comparisons  carrying out dozens (or hundreds) of ttests means that Type I errors are inevitable. Using a modified alpha level reduces the problem, but it is usually far better to focus on just a few key comparisons. We give examples of two of the more specialist uses of the ttest  for analysing crossover designs and randomized cluster trials. In crossover designs, the analysis is only valid if there is no period × treatment interaction. Weighted ttests are still much underused and should always be utilised if cluster random sampling / allocation is used rather than simple random methods. What the statisticians sayArmitage & Berry (2002) cover ttests in chapter 4. Jones & Kenward (2003) provide a detailed account of how to analyse crossover trials using ttests. Woodward (1999) gives a useful summary of the twosample ttest in Chapter 2, and a more detailed account of its use in analysing crossover trials in Chapter 7. Zar (1999) gives the formulation of the ttest for simple random sampling, as well as information on the limits of robustness of the test, the unequal variance ttest, and sample size determination. He argues against use of the Fratio test to test for equality of variances, although it is unclear on what basis the equal or unequal variance version of the test should be used. Bart et al. (1998) looks at the twosample ttest in Chapter 3, again considering how large a sample is required when normality assumptions are not met. He concludes (somewhat controversially) that providing populations have no extreme outliers, sample sizes as low as five are adequate. Underwood (1997) provides extensive coverage of the two sample ttest for hypothesis testing in Chapters 5 and 6. Contrary to most authors he argues against any use of the unequal variance version of the test.Zimmerman & Zumbo (2009) and Zimmerman (2004) argues that optimum protection is assured by using a separatevariances test unconditionally whenever sample sizes are unequal. Zimmerman (2004) notes that the equal variance twosample t test is not robust to variance heterogeneity for skewed distributions even when sample sizes are equal. Neuhauser (2002) discusses which twosample tests are appropriate when variances are unequal. See also Moser & Stevens (1992) Conover et al. (1992) compare various tests for homogeneity of variances. Markowski & Markowski (1990) conclude that the Ftest is not an effective preliminary test of homogeneity of variances. Chuang (2002), Kerry & Bland (1998), Bland & Kerry (1998) , Ydkin & Moher (2001), and Bennett et al. (2002) look at the use of the weighted ttest for comparison of means derived from cluster randomized clinical trials. Donner (1993) addresses the same issues in the veterinary situation when the litter is the unit of analysis. DiazUriarte (2002) looks at how a series of two sample ttests can be used to correctly analyze data from crossover trials. Johnson (1995) and Smith (1995) take issue with Potvin and Roff (1993) in their general advocacy of nonparametric tests. StewartOaten et al. (1986) looks at the use of the two sample t test for analyzing 'before and after control impact' studies. Wikipedia has sections on Student's ttest and the Ftest.
