Definitions
A statistical test estimates how consistent an observed statistic is compared to a hypothetical population of similarly obtained statistics - known as the test, or 'null' distribution.
The further the observed statistic diverges from that test population's median the less compatible it is with that population, and the less probable it is that such a divergent statistic would be obtained by simple chance. That compatibility is quantified as a P-value - where a low P-value indicates your observed statistic is an extreme quantile of the distribution it being tested against.
As pointed out in the Quantiles more info page, there are two ways obtaining a P-value.
- By convention a P-value may be defined as the estimated probability in the long run, of obtaining a statistic whose value is as extreme or more extreme than the one which was observed.
Conventional P-values work well for statistics representing normal populations - but run into difficulties with small populations or highly discrete statistics.
- Mid P-values define quantiles in terms of the mean rank. In other words they estimate what proportion of the test population are expected to have a more divergent rank than the observed statistic.
A
P-value is
not the probability the alternate hypothesis is true, nor is it the probability the null hypothesis is false!
When the probability of obtaining such a divergent value is smaller than a predefined value (α known as the significance level), usually 0.05, the statistic under test can be said to differ 'significantly' from the hypothetical or 'null' population. In which case the null hypothesis may be rejected in favour of an, untested but plausible, alternate hypothesis (H_{A} or H_{1}). Notice that, because α is commonly predefined (a priori) it is a fixed significance level, unlike the P-value, which is an observed significance level. Accordingly α is said to define the rejection region of that statistic's null distribution, whereas 1-α is the acceptance region.
Notice that, under strict (binary) decision-making rules, some statisticians believe you should not give the observed P-value. Thus, by convention, the observed values of a statistic is only reported when P < 0.05 (hence the actual P-value is redundant - if, exceptionally, that rule is violated the result's non-significance must be explicit. Since this convention causes publication bias many statisticians now suggest that, instead, you report all treatment effects along with their precise observed P-values - or confidence limits. The issue is discussed in the core text
Assumptions
The most crucial, and most frequently violated, assumption is that sampling (or allocation to treatment) is random and observations are independent. This is equally true for hypotheses tests, confidence limits and likelihood comparisons - the key difference between them is not their assumptions but their models.
The second most important assumption is that the statistic chosen, and the test applied to it, tells you something useful about the biological situation you are investigating. Similarly, aside from the treatment effect under test, both the null model and the alternate model are assumed to be plausible.
The null model (H_{0}) comes in one of two forms:
- All samples are assumed to represent the same population, and any statistics calculated from them are assumed to represent a single population. Where there are more than one sample, their statistics are therefore pooled, to obtain the best estimate of that population's median - or, if distributed normally, its population mean.
Because there is no difference between the sample populations, this is sometimes known as the 'nil hypothesis'.
- Less commonly, samples are assumed to represent a different population or populations - but any statistics calculated from them are assumed to represent a single 'test' population.
Under the null hypothesis any difference between the statistic's observed value and its median (or its expected value) is assumed to have arisen by chance. In an experimental setting this observed difference (d) is known as the 'treatment effect'. The hypothetical null population distribution is estimated using a null model, constructed under the null hypothesis, and predicts how estimates of the treatment effect varies due to chance. In most tests the null hypothesis assumes the true treatment effect (δ) is zero. Irrespective of what value of δ is used to construct the null model, that value is the parameter under test.
There are several types of alternate hypothesis:
- Any difference between the observed treatment effect and that expected under the null hypothesis is not due to chance. In which case, the true value of d ≠ δ.
To accept this hypothesis you must allow for the probability of both positive and negative differences - and is therefore known as a '2 tailed test'.
- A one tailed test assumes the true difference is positive, and any negative difference that is observed is due to chance - or if testing the opposite tail, that the converse is true. In other words, if testing the upper tail H_{0} is only rejected when d > δ or, for the lower tail, when d < δ.
If a one-tailed test is employed it is assumed the treatment effect, for some inherent and clear reason, can only be positive - or be negative.
Conventional P-values assume the statistic under test represents a continuous distribution. Mid-P-values do not make that assumption - and, when applied to samples of a genuinely continuous distribution, yield identical results to conventional P-values. When conventional P-values are used for discrete statistics they yield conservative inferences - even then mid-P-values are seldom used, and correspondingly controversial.
Pros and cons of significance tests
Advantages
- They provide a logical framework for hypothesis testing in biology
Much fundamental research progresses by testing hypotheses, rather than simply estimating the magnitude of treatment effects. Conventional significance tests provide a logical framework for hypothesis testing.
- They provide an accepted convention for statistical analysis
It is valuable to have a common approach across different disciplines for analysing data and testing hypotheses. For example, in the field of epidemiology it has been argued that epidemiologists need to agree by consensus on prespecified criteria so that the basis for decisions is explicit. The conventions of significance testing (such as the 0.05 level for significance) then provide a reasonable basis for facilitating scientific decision making. Null hypothesis significance tests are still widely used, and are often insisted upon by referees and journal editors.
- The techniques are tried and tested
Appropriate tests have been devised for a variety of statistics, statistical techniques and statistical models - including many 'pre-cooked' experimental and sampling designs. Formulae and software packages are readily available, as is copious documentation.
- The alternative hypothesis can be rather vague
Although the null model has to be specified with some care, the alternate model can be relatively hazy. This has its down side as well, but some may see it as a plus.
- They reflect the same underlying statistical reasoning as confidence intervals
Significance tests and confidence intervals are in fact based on exactly the same underlying theory. Tests not only shed light upon confidence intervals, but also enable some of the more awkward ones to be estimated.
Disadvantages:
- They are commonly misunderstood and misinterpreted
The main misinterpretations are:
- A high value of P is taken as evidence in favour of the null hypothesis, or worse as proof of the null hypothesis. This is wrong because the P-value is not equal to the probability that the null hypothesis is true. It is only a measure of the degree of consistency of the data with the null hypothesis - and a very poor measure at that if the sample size is small!
- A low value of P is taken as evidence in favour of the alternative hypothesis, or worse as proof of the alternative hypothesis. This is also wrong because the P-value does not tell you anything directly about your chosen alternative hypothesis. It only tells us about the degree of consistency of the data with the null hypothesis. In many situations there may be other alternative hypotheses that you have not considered. As above there is also the problem of reliability if the sample size is small.
- If in one trial the null hypothesis is rejected at P = 0.05, it is thought that repeating the experiment many times will produce a significant result on 95% of occasions. Again this is wrong and is known as the 'replication fallacy'. In fact for the usual levels of power in ecological and veterinary research (< 0.5) , repetition is unlikely to produce a significant result on even 50% of occasions.
- Use of a rigid 0.05 level forces a false dichotomy into significant or not significant.
The P = 0.05 syndrome is characterized by a slavish adherence to comparing a P-value - that is subject to sampling error like any other statistic - to a fixed significance level - that is entirely arbitrary. If the sample size is small, the null hypothesis is accepted too readily. If the sample size is large, then biologically unimportant differences are accepted. Nester (1996) commented that because (most) biologists always want important differences to be significant and unimportant differences to be non-significant, the biologist is therefore reduced to one of following states of mind:
How biologists view significance tests |
Importance of observed difference |
Statistical significance of difference |
Not significant | Significant |
Not important |
Happy |
Annoyed |
Important |
Frustrated |
Elated |
What the biologist should be doing instead is interpreting the result in the light of the experimental design (designs differ in the strength of inference possible from the results) and other research results. In other words they should be thinking about their results!
- The P-value is uninformative compared to the confidence interval
Most journals now balk at accepting 'naked' P-values - in other words where neither the size of the effect, nor its precision are specified. There is a strong case to be made for always estimating the magnitude and the precision of the effect (using the confidence interval or better still the P-value function) along with the precise P-value. Confidence intervals are not fundamentally different from P-values - but they do provide useful additional information.
Unfortunately many (if not most) researchers who use confidence intervals only see them as surrogate null hypothesis significance tests. In other words, if the interval overlaps zero (for a difference) or one (for a ratio), then the effect is dismissed as non-significant - and one is no further forward in a rational approach to evidence. Confidence intervals should be seen as providing additional information on which to base your inferences and conclusions.
- The null hypothesis is nearly always false
It is true to say that nearly all null hypotheses are false on a priori grounds, at least for measurement variables. For example, if we are putting fertilizer on a crop, it would be very surprising if that fertilizer had no effect at all. In this situation we have no interest in disproving the null hypothesis that fertilizer has no effect - what we are interested in is the size of the effect.
Unfortunately the journals are full of tests based on obviously false null hypotheses, such as one spotted by Johnson (1999): "the density of large trees was greater in unlogged forest stands than in logged stands (P = 0.02).". It would indeed be truly amazing if there were no difference! If one looks at it and says - this cannot possibly be true - then it is probably a waste of time trying to disprove it.
- The one-tailed or two-tailed decision is profoundly subjective
Nearly all statistics text books give the usual bland explanation that whether one uses a one-tailed or two-tailed test depends on your initial hypotheses. If a difference is only considered possible in one direction, then it is considered legitimate to use a one-tailed test which halves the P-value. But in practice it seems that one-tailed tests are mainly used to push the P-value below the magic 0.05 level. These issues are generally avoided in medical research by a (largely unspoken) convention to always use two tailed tests. But occasionally, for good reason, that convention is broken - and invariably leads to disputes in journals.
- P-values take no account of any hypothesis other than the null.
The is the first of the more fundamental objections to significance tests. If an observation is rare under the null hypothesis, does it necessarily mean we should accept the alternative. Improbable events do happen - people do actually win the lottery on occasions. Do we therefore assume that the lottery has been 'fixed' because an improbable event has happened? Well, no - but if Tony wins the lottery, and we know that Tony's brother runs the lottery, we might feel differently. Now we have a viable alternative hypothesis.
The problem with P-values is that they take no account of any hypothesis other than the null. In other words, only negative non-relative evidence is being used to evaluate evidence. The philosopher Karl Popper might have supported this approach - but many others disagree!
- P-values include all values more extreme than the observed result
When we work out a P-value we are not just asking how unlikely is this result - we are asking how unlikely is a result as extreme or more extreme than this result. But as we have never seen such result, we just have to imagine they exist. Some statisticians argue that P-values therefore overstate the degree of conflict with the null hypothesis. Others disagree, but it remains a controversial aspect of P-values.
- The null distribution of the test statistic may not match the actual sampling distribution of the test statistic
Calculation of the P-value assumes that the null distribution of the test statistic closely matches the actual sampling distribution of the test statistic. Whilst this may be true of some randomized experiments, it is likely to be much less true in observational studies - where all sorts of confounding factors are liable to be operating. Indeed some statisticians argue that significance tests should not be applied in observational studies at all!
Synthesis
- Null hypothesis significance testing will undoubtedly continue to play a role for many years to come, especially where it is being used to provide a logical framework for hypothesis testing. However, great care should be taken not to misinterpret the results of a test.
- The arbitrary 0.05 significance level has no place in science. It should be replaced by interpretation of precise P− values in the light of sample size, design and previous knowledge.
- Confidence intervals should always be given for the effect size, or better still P-value functions - which despite having been proposed years ago are still scarcely used in any discipline.
- Other approaches including likelihood ratios, Bayesian techniques and (especially) information criteria will continue to gain ground, especially for deciding between alternative models. But there is no 'perfect' alternative to significance tests - the important thing is for scientists to really think about their results, and what is being assumed when analyzing them.
Related
topics :
Parametric or nonparametric
Permutation tests
The Z test