Definitions
A statistical test estimates how consistent an observed statistic is compared to a hypothetical population of similarly obtained statistics  known as the test, or 'null' distribution.
The further the observed statistic diverges from that test population's median the less compatible it is with that population, and the less probable it is that such a divergent statistic would be obtained by simple chance. That compatibility is quantified as a Pvalue  where a low Pvalue indicates your observed statistic is an extreme quantile of the distribution it being tested against.
As pointed out in the Quantiles more info page, there are two ways obtaining a Pvalue.
 By convention a Pvalue may be defined as the estimated probability in the long run, of obtaining a statistic whose value is as extreme or more extreme than the one which was observed.
Conventional Pvalues work well for statistics representing normal populations  but run into difficulties with small populations or highly discrete statistics.
 Mid Pvalues define quantiles in terms of the mean rank. In other words they estimate what proportion of the test population are expected to have a more divergent rank than the observed statistic.
A
Pvalue is
not the probability the alternate hypothesis is true, nor is it the probability the null hypothesis is false!
When the probability of obtaining such a divergent value is smaller than a predefined value (α known as the significance level), usually 0.05, the statistic under test can be said to differ 'significantly' from the hypothetical or 'null' population. In which case the null hypothesis may be rejected in favour of an, untested but plausible, alternate hypothesis (H_{A} or H_{1}). Notice that, because α is commonly predefined (a priori) it is a fixed significance level, unlike the Pvalue, which is an observed significance level. Accordingly α is said to define the rejection region of that statistic's null distribution, whereas 1α is the acceptance region.
Notice that, under strict (binary) decisionmaking rules, some statisticians believe you should not give the observed Pvalue. Thus, by convention, the observed values of a statistic is only reported when P < 0.05 (hence the actual Pvalue is redundant  if, exceptionally, that rule is violated the result's nonsignificance must be explicit. Since this convention causes publication bias many statisticians now suggest that, instead, you report all treatment effects along with their precise observed Pvalues  or confidence limits. The issue is discussed in the core text
Assumptions
The most crucial, and most frequently violated, assumption is that sampling (or allocation to treatment) is random and observations are independent. This is equally true for hypotheses tests, confidence limits and likelihood comparisons  the key difference between them is not their assumptions but their models.
The second most important assumption is that the statistic chosen, and the test applied to it, tells you something useful about the biological situation you are investigating. Similarly, aside from the treatment effect under test, both the null model and the alternate model are assumed to be plausible.
The null model (H_{0}) comes in one of two forms:
 All samples are assumed to represent the same population, and any statistics calculated from them are assumed to represent a single population. Where there are more than one sample, their statistics are therefore pooled, to obtain the best estimate of that population's median  or, if distributed normally, its population mean.
Because there is no difference between the sample populations, this is sometimes known as the 'nil hypothesis'.
 Less commonly, samples are assumed to represent a different population or populations  but any statistics calculated from them are assumed to represent a single 'test' population.
Under the null hypothesis any difference between the statistic's observed value and its median (or its expected value) is assumed to have arisen by chance. In an experimental setting this observed difference (d) is known as the 'treatment effect'. The hypothetical null population distribution is estimated using a null model, constructed under the null hypothesis, and predicts how estimates of the treatment effect varies due to chance. In most tests the null hypothesis assumes the true treatment effect (δ) is zero. Irrespective of what value of δ is used to construct the null model, that value is the parameter under test.
There are several types of alternate hypothesis:
 Any difference between the observed treatment effect and that expected under the null hypothesis is not due to chance. In which case, the true value of d ≠ δ.
To accept this hypothesis you must allow for the probability of both positive and negative differences  and is therefore known as a '2 tailed test'.
 A one tailed test assumes the true difference is positive, and any negative difference that is observed is due to chance  or if testing the opposite tail, that the converse is true. In other words, if testing the upper tail H_{0} is only rejected when d > δ or, for the lower tail, when d < δ.
If a onetailed test is employed it is assumed the treatment effect, for some inherent and clear reason, can only be positive  or be negative.
Conventional Pvalues assume the statistic under test represents a continuous distribution. MidPvalues do not make that assumption  and, when applied to samples of a genuinely continuous distribution, yield identical results to conventional Pvalues. When conventional Pvalues are used for discrete statistics they yield conservative inferences  even then midPvalues are seldom used, and correspondingly controversial.
Pros and cons of significance tests
Advantages
 They provide a logical framework for hypothesis testing in biology
Much fundamental research progresses by testing hypotheses, rather than simply estimating the magnitude of treatment effects. Conventional significance tests provide a logical framework for hypothesis testing.
 They provide an accepted convention for statistical analysis
It is valuable to have a common approach across different disciplines for analysing data and testing hypotheses. For example, in the field of epidemiology it has been argued that epidemiologists need to agree by consensus on prespecified criteria so that the basis for decisions is explicit. The conventions of significance testing (such as the 0.05 level for significance) then provide a reasonable basis for facilitating scientific decision making. Null hypothesis significance tests are still widely used, and are often insisted upon by referees and journal editors.
 The techniques are tried and tested
Appropriate tests have been devised for a variety of statistics, statistical techniques and statistical models  including many 'precooked' experimental and sampling designs. Formulae and software packages are readily available, as is copious documentation.
 The alternative hypothesis can be rather vague
Although the null model has to be specified with some care, the alternate model can be relatively hazy. This has its down side as well, but some may see it as a plus.
 They reflect the same underlying statistical reasoning as confidence intervals
Significance tests and confidence intervals are in fact based on exactly the same underlying theory. Tests not only shed light upon confidence intervals, but also enable some of the more awkward ones to be estimated.
Disadvantages:
 They are commonly misunderstood and misinterpreted
The main misinterpretations are:
 A high value of P is taken as evidence in favour of the null hypothesis, or worse as proof of the null hypothesis. This is wrong because the Pvalue is not equal to the probability that the null hypothesis is true. It is only a measure of the degree of consistency of the data with the null hypothesis  and a very poor measure at that if the sample size is small!
 A low value of P is taken as evidence in favour of the alternative hypothesis, or worse as proof of the alternative hypothesis. This is also wrong because the Pvalue does not tell you anything directly about your chosen alternative hypothesis. It only tells us about the degree of consistency of the data with the null hypothesis. In many situations there may be other alternative hypotheses that you have not considered. As above there is also the problem of reliability if the sample size is small.
 If in one trial the null hypothesis is rejected at P = 0.05, it is thought that repeating the experiment many times will produce a significant result on 95% of occasions. Again this is wrong and is known as the 'replication fallacy'. In fact for the usual levels of power in ecological and veterinary research (< 0.5) , repetition is unlikely to produce a significant result on even 50% of occasions.
 Use of a rigid 0.05 level forces a false dichotomy into significant or not significant.
The P = 0.05 syndrome is characterized by a slavish adherence to comparing a Pvalue  that is subject to sampling error like any other statistic  to a fixed significance level  that is entirely arbitrary. If the sample size is small, the null hypothesis is accepted too readily. If the sample size is large, then biologically unimportant differences are accepted. Nester (1996) commented that because (most) biologists always want important differences to be significant and unimportant differences to be nonsignificant, the biologist is therefore reduced to one of following states of mind:
How biologists view significance tests 
Importance of observed difference 
Statistical significance of difference 
Not significant  Significant 
Not important 
Happy 
Annoyed 
Important 
Frustrated 
Elated 
What the biologist should be doing instead is interpreting the result in the light of the experimental design (designs differ in the strength of inference possible from the results) and other research results. In other words they should be thinking about their results!
 The Pvalue is uninformative compared to the confidence interval
Most journals now balk at accepting 'naked' Pvalues  in other words where neither the size of the effect, nor its precision are specified. There is a strong case to be made for always estimating the magnitude and the precision of the effect (using the confidence interval or better still the Pvalue function) along with the precise Pvalue. Confidence intervals are not fundamentally different from Pvalues  but they do provide useful additional information.
Unfortunately many (if not most) researchers who use confidence intervals only see them as surrogate null hypothesis significance tests. In other words, if the interval overlaps zero (for a difference) or one (for a ratio), then the effect is dismissed as nonsignificant  and one is no further forward in a rational approach to evidence. Confidence intervals should be seen as providing additional information on which to base your inferences and conclusions.
 The null hypothesis is nearly always false
It is true to say that nearly all null hypotheses are false on a priori grounds, at least for measurement variables. For example, if we are putting fertilizer on a crop, it would be very surprising if that fertilizer had no effect at all. In this situation we have no interest in disproving the null hypothesis that fertilizer has no effect  what we are interested in is the size of the effect.
Unfortunately the journals are full of tests based on obviously false null hypotheses, such as one spotted by Johnson (1999): "the density of large trees was greater in unlogged forest stands than in logged stands (P = 0.02).". It would indeed be truly amazing if there were no difference! If one looks at it and says  this cannot possibly be true  then it is probably a waste of time trying to disprove it.
 The onetailed or twotailed decision is profoundly subjective
Nearly all statistics text books give the usual bland explanation that whether one uses a onetailed or twotailed test depends on your initial hypotheses. If a difference is only considered possible in one direction, then it is considered legitimate to use a onetailed test which halves the Pvalue. But in practice it seems that onetailed tests are mainly used to push the Pvalue below the magic 0.05 level. These issues are generally avoided in medical research by a (largely unspoken) convention to always use two tailed tests. But occasionally, for good reason, that convention is broken  and invariably leads to disputes in journals.
 Pvalues take no account of any hypothesis other than the null.
The is the first of the more fundamental objections to significance tests. If an observation is rare under the null hypothesis, does it necessarily mean we should accept the alternative. Improbable events do happen  people do actually win the lottery on occasions. Do we therefore assume that the lottery has been 'fixed' because an improbable event has happened? Well, no  but if Tony wins the lottery, and we know that Tony's brother runs the lottery, we might feel differently. Now we have a viable alternative hypothesis.
The problem with Pvalues is that they take no account of any hypothesis other than the null. In other words, only negative nonrelative evidence is being used to evaluate evidence. The philosopher Karl Popper might have supported this approach  but many others disagree!
 Pvalues include all values more extreme than the observed result
When we work out a Pvalue we are not just asking how unlikely is this result  we are asking how unlikely is a result as extreme or more extreme than this result. But as we have never seen such result, we just have to imagine they exist. Some statisticians argue that Pvalues therefore overstate the degree of conflict with the null hypothesis. Others disagree, but it remains a controversial aspect of Pvalues.
 The null distribution of the test statistic may not match the actual sampling distribution of the test statistic
Calculation of the Pvalue assumes that the null distribution of the test statistic closely matches the actual sampling distribution of the test statistic. Whilst this may be true of some randomized experiments, it is likely to be much less true in observational studies  where all sorts of confounding factors are liable to be operating. Indeed some statisticians argue that significance tests should not be applied in observational studies at all!
Synthesis
 Null hypothesis significance testing will undoubtedly continue to play a role for many years to come, especially where it is being used to provide a logical framework for hypothesis testing. However, great care should be taken not to misinterpret the results of a test.
 The arbitrary 0.05 significance level has no place in science. It should be replaced by interpretation of precise P− values in the light of sample size, design and previous knowledge.
 Confidence intervals should always be given for the effect size, or better still Pvalue functions  which despite having been proposed years ago are still scarcely used in any discipline.
 Other approaches including likelihood ratios, Bayesian techniques and (especially) information criteria will continue to gain ground, especially for deciding between alternative models. But there is no 'perfect' alternative to significance tests  the important thing is for scientists to really think about their results, and what is being assumed when analyzing them.
Related
topics :

Parametric or nonparametric
Permutation tests

The Z test
