Statistics courses, especially for biologists, assume formulae = understanding and teach how to do
statistics, but largely ignore what those procedures assume,
and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...
Use and Misuse
The paired t-test is widely used in all disciplines. We include a few examples of its correct use - and rather more of its misuse. An example of the latter is to use the paired t-test to test for equality between two measurement techniques. The test may tell you whether the means are significantly different - but it cannot test for equality! Another common misuse is to set up a parallel group randomized trial, yet test for a treatment effect by the analysis using paired t-tests on a before-treatment after-treatment basis. This converts a (strong inference) randomized trial to a (weak inference) observational study. Much better to correct for differences in baseline values between treatment groups in other ways. Use of the paired t-test for observational before-after studies has other pitfalls, including confounding variables and regression to the mean because of measurement error.
Inadequate sample size and consequent low power
are common problems. This is often 'resolved' by pseudoreplication
- for example, using corresponding quarters of the pre- and post-intervention years rather than a single measure for each village; using the same participants to do multiple before and after studies; and using repeated measures over time as paired replicates. Perhaps a more justifiable error is to use paired samples when the pairing is unjustified. If insufficient variation is accounted for one simply looses power as a result, especially with small sample sizes. The other down-side of pairing is the risk of contamination between units.
We come last to the specific requirements for the paired t-test. The first is that observations are obtained either by probability sampling or random allocation. Comparison of two groups of convenience-sampled
units is meaningless, yet still widely done. Fortunately, random allocation is becoming more common in experimental work. As for normality
of the mean difference, it is true the t-test is fairly robust on this except when there are large numbers of zeros. But the 'robustness' of the t-test is often pushed beyond all reasonable limits. If cluster sampling or cluster randomization is being used, it is important to use the correct weighted standard error in the t-test if there are variable number of units in each cluster. Lastly there is the practice of carrying out multiple comparisons using t-tests. The t-test should only be used for pairwise comparisons, with other approaches used for multiple comparisons.
What the statisticians say
Armitage & Berry (2002) 
cover the paired t-test in Chapter 4.
Bart et al. (1998) 
provide a useful account of the analysis of paired data and partially paired data in Chapter 3, along with an assessment of how large a sample size is needed for skewed distributions to be normalized.
Zar (1999) 
covers paired designs in Chapter 9. Use of the paired t-test is discussed, along with a test for difference between variances from two correlated populations (although note that the paired t-test does not assume equality of variances).
Underwood (1997) 
covers paired comparisons in Chapter 6 and emphasizes the observational nature of before-after studies.
Wright (2006)
examines why the paired t test and ANCOVA can produce different results when comparing groups in a before-after design, the so-called Lord's paradox. Tuet al. (2008)
and Wainer (1991)
give more on Lord's paradox. Menke & Martinez (2004)
describe how a permutation test can be used to provide an exact test for the two sample paired situation, rather than Student's t-distribution.
Zimmerman (2005)
notes that power in paired-samples designs can be improved by correcting the two-sample test for correlation rather than using a paired t-test. Zimmerman (1997),
(2004)
stresses the importance of taking non-independence of samples into account even when the correlation is small. Box (1987)
provides a fascinating account of the development of Student's t-test by W.S. Gosset working in the Guinness brewery.
Bennett et al. (2002)
review the use of both the paired and unpaired t-test in cluster randomized designs with reference to previous papers on the topic. These include Klar & Donner (1997)
who advocate the use of stratified designs with more than two clusters in each stratum and Diehr et al. (1995)
who advocate performing an unpaired analysis on paired data. Other contributions on the topic include Donner (1987)
and Donner & Donald (1982).
Diaz-Uriarte (2002)
examines the incorrect use of the paired t-test to analyze crossover trials in animal behaviour research. Burridge & Robins (2000)
compare paired designs (analysed with the paired t-test) with the Latin square design for assessing the performance of bycatch reduction devices in fisheries research.
Arthur et al. (1996)
looks at use of the paired t-test for assessing habitat selection when availability changes whilst Horton (1995)
reviews use of the paired t-test for analyzing paired-choice assays.
Wikipedia has sections on the paired difference test
and Student's t-test.
NIST/SEMATECH e-Handbook of Statistics
covers analysis of paired observations. Graphpad
have a useful section on interpreting the paired t-test.