Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)



"There are but three events of a man's life: birth, life and death. He is not conscious of being born, he dies in pain, and he forgets to live."
Jean de La Bruyère 1645-1696
Les Caractères (1688). De l'Homme


Survival analysis refers to analysis of data where we have recorded the time period from a defined time origin up to a certain event for a number of individuals. That event is often termed a 'failure', and the length of time the failure time. This type of analysis is used most heavily in medical and veterinary research where the time origin is the start of a clinical trial and the event is the death of the patient. However the event need not be death - it could for example be the development of a specific side effect of treatment. Survival analysis is now being used increasingly in ecological research both for pesticide testing (where it is more powerful than probit analysis) and for life history studies.

We looked at how to estimate survival rates in Unit 1. For example we looked at the results of a randomised clinical trial (data are based on but not identical to study of Burri et al.) comparing two drugs for treatment of sleeping sickness. The time origin is the start of treatment and the event is the occurrence of a dangerous side effect - encephalitis - following treatment with the drug melarsoprol. In Unit 1 we only considered results for one of the treatment groups - below we show results for both groups.

{Fig. 1}

Standard errors and confidence intervals

We have already covered the variance for the interval/time specific mortality rates back in Unit 4 when looking at simple proportions. Since the time specific survival rate (p) is derived from the mortality rate (q) using p = 1 − q, the variance of each of these is pq/n. The standard error of these rates is the square root of pq/n, so you can get a 95% confidence interval using the normal approximation of 1.96 times the standard error. Two rates can be compared using any of the methods for comparing two proportions.

Usually one is more interested in attaching confidence intervals to the cumulative survival probability (S), the hazard function (h) or the density probability function (P). We will look at how to estimate the standard error of the survivorship function here.
The approach used depends on whether there has been any censoring:
  • If there is no censoring, then the standard error and confidence interval of S are estimated using the binomial distribution in the same way as for the time specific rates.
  • If there is censoring, we must take into account the error in all the previous interval survival probabilities. This is done fairly simply as shown here:

    An approximate 95% confidence interval can then be attached to each survival estimate using 1.96 SE.


Algebraically speaking -

SE (St)   =   St
n't − et
  • SE (St) is the standard error of the cumulative survival probability up to time t,
  • qt is the proportion dying by time t,
  • n't is the corrected number alive up to time t,
  • et is the number dying at time t

For examples we return to the development of encephalopathy in patients treated for sleeping sickness with melarsoprol. Those survival curves are shown here with (approximate) 95% confidence intervals:

{Fig. 3}

Remember that the width of these approximate intervals are determined solely by the proportion surviving and the sample size. Proportions close to 1 or 0 will have the smallest standard error, but the confidence intervals will be unreliable. The rule we adopted previously in Unit 5 was that the normal approximation was only valid if pqn is greater than 5. For the early part of the curve this will be the case when S (=p) is very close to 1. Examination of the intervals in the figure will show the upper limit of the first interval to be greater than 1.0, clearly an impossible value. The problem gets much more serious for the lower part of the curve, as the proportion surviving (=q) decreases below 0.3, and few survivors (=n) remain.

In the past these problems were sometimes minimised by using a double log transformation. This will always give values within the permissible range, but does little for the accuracy of the intervals when numbers are low. Various score intervals are available, but probably the best approach is to use bootstrap sampling. (see Akritas (1986))

We can get a rough comparison of survival curves by attaching confidence intervals to each cumulative survival estimate and comparing the two step plots visually. Even with this simple method we can see there is no evidence for any difference between the two survival plots shown above. But we need a more rigorous way to compare survival curves. This is done using the Mantel-Haenszel method for combining results from multiple 2×2 contingency tables.


Mantel-Haenszel survival tests

Group 1 Group 2
Time (t) No. at time t (nt) No. of events (et) Time (t) No. at time t (nt) No. of events (et)
3 250 1 3 250 1
4 249 2 5 249 2
7 247 1 6 247 2
8 246 3 7 245 3
9 243 1 8 242 1
10 242 4 9 241 1
11 238 1 22 240 2
15 237 1 25 238 1
      26 237 1
The essence of a non-parametric approach is that it does not assume that your data follow any specified theoretical distribution. Survival times tend not to be distributed normally, so non-parametric Mantel-Haenszel tests are one way round this. However, that does not mean that the tests have no assumptions - we will consider those once we have considered the test. The approach is the same for both standard and Kaplan-Meier life tables, but (confusingly) when it is applied to the latter it is instead called the log-rank test.
The first step is to combine the two separate group life tables (shown to the right) into a single combined table (shown below).

The first events are recorded on day 3, when there is one event out of 250 patients for each group. These are entered into the combined table as the first row. Then on day 4 there were two events out of 249 patients for group 1, but no events out of 249 patients for group 2. These results therefore comprise the next row in the combined data table. For day 5 the position is reversed and there are no events out of 247 patients in group 1 and two events out of 249 patients for group 2. This combined data table is shown below. Each row can then be displayed as a 2x2 contingency table - those containing the data from the first three rows are shown to the right of the combined data table:

Combined data
Time (t) Group 1 Group 2 Expected no. events in Group 1 Variance
No. at time t (n1) No. of events (e1) No. at time t (n2) No. of events (e2)
3 250 1 250 1 1.0000 0.498998
4 249 2 249 0 1.0000 0.498994
5 247 0 249 2 0.9960 0.498992
6 247 0 247 2 1.0000 0.498986
7 247 1 245 3 2.0081 0.993874
8 246 3 242 1 2.0164 0.993773
9 243 1 241 1 1.0041 0.498956
10 242 4 240 0 2.0083 0.993746
11 238 1 240 0 0.4979 0.249996
15 237 1 240 0 0.4969 0.249990
22 237 0 240 2 0.9937 0.498930
25 237 0 238 1 0.4989 0.249999
26 237 0 237 1 0.5000 0.250000
Sum   14     14.0203 6.975223
Time =3 days
 en − en
Group 11249250
Group 21249250
Time =4 days
 en − en
Group 12247249
Group 20249249
Time =5 days
 en − en
Group 10247247
Group 22247249

We then deal with the series of thirteen 2x2 contingency tables with Mantel Haenszel methods. The number of events in the top left hand cell of each 2x2 table is highlighted because this is the cell for which we calculate the expected frequency. For this we assume the null hypothesis of no difference between the groups. Thus the expected number of events in group 1 is obtained by multiplying the proportion of patients in group 1 by the total number of events (e1 + e2) . The variance is given by the product of the four margin totals divided by N2(N − 1).

The Mantel-Haenszel chi square statistic is obtained by summing the observed number of events (14), the expected number of events (14.0203) and the variances (6.9572); squaring the difference between observed and expected totals; and dividing this by the summed variance to give a chi square value. In this case we get a value of 0.00006 (df=1) for which P = 0.994. Hence there is no evidence for any difference between these two survival curves, and we can accept the null hypothesis.

This test can be extended to other more complex situations. For example three or more (k) groups can be compared in exactly the same way except that the final MH statistic is tested with k − 1 degrees of freedom. Alternatively a stratified test can be used. Data can be split into strata defined by the level of some confounding variable such as age, or by different sites in a multicentre study.

Before we look at the assumptions of this test we will briefly consider another test which is very similar to the log-rank test - this is known as the Wilcoxon test.

When discussing confidence intervals, we noted that the normal approximation should not be used when numbers have dropped very low. In addition, of course, the intervals will tend to be much wider - in other words we have less confidence in our estimate of the true value of S. But the log-rank test gives the same weight to each observation irrespective of the value of n. As a result chance differences when there are few survivors may bias the result of the test.

The Wilcoxon test is simply a weighted log rank test, where the contribution each 2x2 table makes to the total is weighted by the total number at risk (N = n1 + n2) at the start of the interval.

Algebraically speaking -

X2Wilcoxon   =    (ΣNi(ai  −  Σi))2

One then has to ask which of these tests is the more appropriate for the data being analysed. The log rank statistic weights each event equally. The Wilcoxon statistic places more weight on the early events and so is less sensitive to events that occur later on. Some argue that which test is to be used should be specified in advance, as there is otherwise a risk of bias in choosing that statistic most likely to give a significant result. An alternative view is that the choice of tests depends on whether certain assumptions are met. The most important of those assumptions - and one that we shall meet again when we look at parametric approaches to comparing survival curves - is that of proportional hazards. If the proportional hazards assumption is met then the most powerful test is the log-rank test; if not then one of the weighted log rank tests is more appropriate.


The proportional hazards assumption

For the log-rank test to be valid it is assumed that the relative probability of an event between groups remains constant over time. In other words if an event is twice as likely to occur in group one as in group two in the first time interval, it should also be twice as likely in all other time intervals. Note that this assumption is identical to the homogeneity or 'no interaction' assumption we made before when we used the Mantel-Haenszel test of association.

We can see below two hypothetical survival curves of numbers against time. Numbers in the 'new treatment' group decline more slowly than those in the 'standard treatment' group, and the curves do not cross. We have set the hazard functions to steadily increase over time with the hazard function in the 'new treatment' group (h1) set at half that in the standard treatment group (h0). If we plot the natural log of the hazard function against time, we get two parallel curves separated by the log of the ratio of the two hazards. Because the curves are parallel, the proportional hazards assumption is met.

{Fig. 4}

{Fig. 5}
In contrast in this example numbers in the 'new treatment' group initially decline more sharply than those in the 'standard treatment' group, but then level off and the curves cross. Here as before the hazard function for the 'standard treatment' group is increasing steadily over time. But the hazard function for the 'new treatment' group is high at first, then drops sharply only to increase again later. Clearly the hazard functions are not remotely parallel, and the proportional hazards assumption is not met.

In real life, plots of the hazard function can be very difficult to assess because of random variation in the number of events at any particular point in time. It makes more sense to instead use cumulative functions which show much less variability. One option is to plot the cumulative survival function fore each group against time. If they do not cross when plotted against time, we can probably assume that hazards are proportional.

The problem with a graphical approach is that it cannot cope with random variation in small samples. It is, for example, possible for the lines to cross even when the assumption is met. Look at our example of encephalopathy events after administration of melarsoprol. There is very little difference between the two survival curves, yet they do cross over. Is this just the result of random variation, or does it mean that the assumptions for our statistical test are not met?

One possible way of testing our two survival curves for homogeneity over time would be to use the Mantel-Haenszel test for interaction. However this test is seldom used because it lacks power with so many zeros in the table cells. We meet other ways to test the proportional hazards assumption in unit 14.