Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site




P-values of tests

When the outcome of statistical tests are reported, they are commonly, but not uniformly accompanied by a P-value. For example as, "Flagwaggit's test was significant, P < 0.01" or "the odds-ratio was not significant, P = 0.51" or "chi-square was not significant". In the latter case, by current convention, the P-value is assumed to have been greater than 0.05.

At first sight these statements would seem to have nothing to do with the Pth quantiles of samples, or population quantiles, discussed in the quantiles More Information page. Nevertheless, they do, because exactly the same principles and problems apply to the P-values of statistical tests.


When calculated for a sample, the value of a quantile is either descriptive - or, if the sample was taken at random form some larger 'population' of values, a sample quantile (such as the median, or the upper quartile) provides an estimate of the corresponding quantile from that population (such as the median, or the upper quartile).

    If you are uncertain as to what 'population' it is your data might represent, think of it as a much larger set of (potential) observations - a superset, if you like - of which your sample is (hopefully) a representative subset.

The essential difference between the P-values of a sample and the P-values that accompany statistical tests is that the latter refer to the estimated distribution of the statistic under test. Thus a "chi-square test" would be comparing the observed value of the statistic under test to a "chi square" distribution, and Flagwaggit's test compares Flagwaggit's statistic to its distribution - and (nearly always) both of those distributions will be estimated from the data at hand, in order to test some hypothesis.

In other words, the P-value of a test is an estimate of the corresponding P-value of whatever population of values the statistic under test was assumed to represent - assuming the hypothesis under test was correct, or approximately so. Specifically, these P-values tell you what proportion of results would yield statistics whose quantiles are as extreme, or more extreme, than the observed value of that test's statistic. For a conventional 1-tailed test, the 'critical value' (P = 0.05) is usually the 95% quantile. A 2-tailed test has 2 critical values, the 2.5% and 97.5% quantiles, which enclose 95% of the statistic's distribution.


If this sounds like gibberish do not worry for now. We explore the detailed reasoning, assumptions, properties and problems of statistical tests in Unit 5. But at this juncture, the crucial point to bear in mind is that, because P-values arise from a branch of maths called "set theory", and no-two values of a "set" can be identical, conventional P-values can be horribly misleading when applied to discrete, heavily-tied variables (which have lots of identical values). - This is why we pay so much attention to ranks, and rank-based quantiles.

Therefore, whilst thinking about how to calculate quantiles may seem wholly academic and without merit, not understanding their properties can have serious and very practical consequences. Moreover, because the problem and its solutions are controversial, many statistics textbooks prefer to ignore them. We do not.


One approach, instead of applying conventional P-values, is to use what are known as mid-P-values. To understand the difference between these two quantiles let us consider a value t, which is a member of a collection of such values, called T.

    Now, assuming t is an 'extreme' quantile of T, you can quantify how extreme it is using a P-value.
  • A conventional P-value is the proportion of T that is more extreme than t, plus the proportion of T that is as extreme as t.
  • A mid-P-value is the proportion of T that is more extreme than t, plus half the proportion of T that is as extreme as t.
    If you prefer to think in terms of ranks rather than proportions, if T has N different values, the conventional P-value is R/N, and the mid-P-value is R/N - 0.5/N.

Now imagine that T contains many values (say a million). If every value is different, the proportion that exactly equal t cannot be more than 1/1000000. In that case, if 5% of T are more extreme than t, then R/N=0.05 and R/N - 0.5/N = 0.0499995. However, as we shall see in later units, a million values is an unusually tiny population for a statistic being tested. So when T is infinitely large, as for instance is the chi-squared distribution, to all practical purposes conventional and mid-P-values are identical.

If however, T has a strongly discrete distribution, the proportion of T equal to t may not be negligible. In that situation, among those who are strict about that 5% boundary between significance and nonsignificance, the difference between conventional and mid-P-values can be noticeable. Since the 5% criterion is a legal requirement in some fields of medicine, and lawyers get rich from such discrepancies, this is no small matter.