Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



Null hypothesis significance testing: Use & misuse

(P = 0.05 syndrome, naked P-values, proving the null hypothesis, statistical significance, biological importance, one and two-tailed tests)

Statistics courses, especially for biologists, assume formulae = understanding and teach how to do  statistics, but largely ignore what those procedures assume,  and how their results mislead when those assumptions are unreasonable. The resulting misuse is, shall we say, predictable...

Use and Misuse

A null hypothesis significance test estimates how consistent an observed statistic is compared to a hypothetical population of similarly obtained statistics - known as the test, or 'null' distribution.

Note, when the test statistic is calculated from more than one sample constructing the null distribution commonly assumes all the samples were randomly selected from the same (fixed) population of values - given which the null hypothesis might be better described as a nil hypothesis, and the null distribution as the nil distribution. That assumption is seldom plausible for real data. Testing for a difference between means where sample observations have other obvious differences (such as skew) upsets conventional t-tests and ANOVA but, since authors seldom show how their individual sample data are distributed, we cannot confirm this.

The null hypothesis assumes the statistic's observed value is randomly-selected from that null distribution. The further the observed statistic diverges from that test population's median the less compatible it is with that null population, and the less probable it is that such a divergent statistic would be obtained by simple chance. That compatibility is quantified as a P-value - where a low P-value indicates your observed statistic is an extreme quantile  of the distribution it being tested against. Note that a P-value is not the probability the alternate hypothesis is true, nor is it the probability the null hypothesis is false!

There are big differences between different disciplines in their adherence (or otherwise) to null hypothesis significance testing. Medical statisticians have long campaigned against only quoting P-values, and most medical journals now advocate the use of effect sizes with confidence intervals, with or without the associated P-value. Outside of medical research it is still usually a matter of "P-values rule OK", although in ecological journals information criteria have been used much more extensively in recent years. Most disciplines seem to have firmly decided to stick with the 0.05 level to define 'significance', despite its arbitrary nature. This often degenerates to the P = 0.05 syndrome, where a P-value of 0.049 is regarded as 'significant' and hence worthy of reporting, and a P-value of 0.051 is 'not significant' and can therefore be ignored. Aside from a select few journals, completely naked P-values (without any estimate of effect size or precision) are (fortunately) now quite rare, but it is still common to find only means with standard errors presented, rather than the treatment effect (difference between the means) with its attached confidence interval.

One common misuse of null hypothesis significance testing is to take a high P-value as proof of the null hypothesis, or a zero treatment effect. We give several examples of this including assessment of distance from the river as a risk factor for a disease. You can never prove the null hypothesis by carrying out a significance test! This misuse occurs even when sample sizes are very small, which will invariable lead to unsafe conclusions. It is also a misuse to quote a low (significant) P-value based on a very small sample, simply because it is impossible to argue that such samples can be representative of the populations from which they were drawn. Another very common misuse is to carry out a test but ignore all the assumptions of that test - especially with regard to independence of observations.

Statistical significance is often confused with practical (biological) importance. We give  an example of significant differences in the human sex ratio at birth between the USA and Europe - given the host of confounding factors it would be very surprising if the ratios were identical, and with enormous sample sizes a significant difference is inevitable. But whether such a difference has any importance is another matter. However, beware of this being used as an excuse to ignore unwanted or politically unacceptable differences - such as we suspect happened with a study looking at possible adverse effects of rinderpest vaccination. The choice of a one-tailed or two-tailed test continues to be abused, and all too often one gets the impression that a one-tailed test was used simply to 'make it significant'. Lastly there remains a certain amount of lack of understandng of the general principles of significance testing. Contradictory comments like "there was no significant difference between groups (P < 0.05)" and "the experimental data do not provide evidence to reject the null hypothesis (P < 0.0001)" are not uncommon.


What the statisticians say

Harlow et al. (1997) is a multi-authored review of the arguments for and against null hypothesis significance testing. Kline (2004) looks beyond significance testing in behavioural sciences and focuses on confidence intervals for effect sizes. Rothman & Greenland (1997) give a less than enthusiastic review of null hypothesis statistical testing in medical research, and focus more on statistical estimation with confidence intervals and P-value functions. Sokal & Rohlf (1995) and Griffiths et al. (1998) both give traditional introductions to the subject - many will find the latter author rather more comprehensible.

In medical research the use of confidence intervals has largely replaced null hypothesis significance testing, but Curran-Everett (2009) provides a recent overview of the practice. Reese (2004) focuses on the 'sacred' 0.05 level, whilst Denis (2003) looks at alternative approaches. Sterne (2002), (2003) calls for significant change in the teaching of hypothesis tests, and questions whether there has been any improvement in the interpretation of significance tests over the last 60 years. Weinberg (2001) makes a spirited call to rehabilitate the politically incorrect P-value, in contrast to Sterne & Smith (2001) who look at what's wrong with it. Ludbrook & Dudley (1998) look at why permutation tests are more appropriate than standard parametric tests in biomedical research.

Cohen (1994) attacks the principles of null hypothesis significance testing in a classic paper entitled 'The earth is round (p < 0.05)'. McPherson (1989) concentrates on the overemphasis of P-values, whilst Goodman & Royall (1980) focus on the need for an evidential approach using likelihood ratios. Carver (1978) reviews some of the earlier critiques of the null hypothesis significance test including those by Rozeboom (1960), and concludes by recommending the abandonment of all statistical testing.

In other disciplines null hypothesis significance testing still has its strong adherents. Hurlbert & Lombardi (2009) and Robinson & Wainer (2002) promote a neo-Fisherian agenda with other approaches used as an adjunct. However, Nakagawa & Cuthill (2007) argue that all researchers (not just the medics) should give effect sizes and confidence intervals. Fidler et al. (2004) compare progress in medicine, psychology and ecology. Di Stefano (2004) advocates a confidence interval approach to data analysis in forestry studies whilst Guthery et al. (2001) criticize the emphasis on likelihood and information theory as a replacement for null hypothesis testing, and promote instead the idea of multiple research hypotheses. Stephens et al. (2007), (2007), (2005) argue that by marginalizing the use of null-hypothesis testing, ecologists risk rejecting a powerful, informative and well-established analytical tool, a view contested by Lukacs et al.(2007). Fidler et al. (2006) assesses how resistant conservation biology has been to improved statistical reporting practices. Hobbs & Hilborn (2006) provide a guide to alternatives to statistical hypothesis testing in ecology.

Anderson et al. (2000) look at the problems of null hypothesis testing in wildlife research, and suggests some alternatives. Stoehr (1999) questions whether significance thresholds are appropriate for the study of animal behaviour. Johnson (1999), (2002) attacks the use of 'naked' P-values and advocates more use of information theoretic approaches. Loehle (1987) explain how confirmation bias and theory tenacity often interfere with testing of alternate hypotheses. A rather indecisive position on the issue of one and two-tailed tests is taken by Ruxton & Neuhuser (2010) in contrast to Lombardi & Hurlbert (2009) who argue against most use of one-tailed tests in science. Rice & Gaines (1994) propose directional tests with an asymmetrical pair of critical regions as one solution to the dilemma of one-sided testing.

Wikipedia provides sections on statistical hypothesis testing, statistical significance, the null hypothesis, Type I and Type II errors and effect size. NIST/SEMATECH e-Handbook of Statistics give a brief account of interval estimation and hypothesis tests. Jerry Dallal has a fun practical demonstration of the Type I error rate. Hubbard & Bayarri highlight the incompatibility of Fisher's evidential P-value with the Type I error rate a of Neyman-Pearson statistical orthodoxy.