If there is censoring, we must take into account the error in all the previous interval survival probabilities. This is done fairly simply as shown here:
An approximate 95% confidence interval can then be attached to each survival estimate using 1.96 SE.
Algebraically speaking 
SE (S_{t})
 =
 S_{t}
 √  
Σ  q_{t
} 

n'_{t} − e_{t}
  
where
 SE (S_{t}) is the standard error of the cumulative survival probability up to time t,
 q_{t} is the proportion dying by time t,
 n'_{t} is the corrected number alive up to time t,
 e_{t} is the number dying at time t

For examples we return to the development of encephalopathy in patients treated for sleeping sickness with melarsoprol. Those survival curves are shown here with (approximate) 95% confidence intervals:
{Fig. 3}
Remember that the width of these approximate intervals are determined solely by the proportion surviving and the sample size. Proportions close to 1 or 0 will have the smallest standard error, but the confidence intervals will be unreliable. The rule we adopted previously in Unit 5 was that the normal approximation was only valid if pqn is greater than 5. For the early part of the curve this will be the case when S (=p) is very close to 1. Examination of the intervals in the figure will show the upper limit of the first interval to be greater than 1.0, clearly an impossible value. The problem gets much more serious for the lower part of the curve, as the proportion surviving (=q) decreases below 0.3, and few survivors (=n) remain.
In the past these problems were sometimes minimised by using a double log transformation. This will always give values within the permissible range, but does little for the accuracy of the intervals when numbers are low. Various score intervals are available, but probably the best approach is to use bootstrap sampling. (see Akritas (1986))
We can get a rough comparison of survival curves by attaching confidence intervals to each cumulative survival estimate and comparing the two step plots visually. Even with this simple method we can see there is no evidence for any difference between the two survival plots shown above. But we need a more rigorous way to compare survival curves. This is done using the MantelHaenszel method for combining results from multiple 2×2 contingency tables.
MantelHaenszel survival tests
The essence of a nonparametric approach is that it does not assume that your data follow any specified theoretical distribution. Survival times tend not to be distributed normally, so nonparametric MantelHaenszel tests are one way round this. However, that does not mean that the tests have no assumptions  we will consider those once we have considered the test. The approach is the same for both standard and KaplanMeier life tables, but (confusingly) when it is applied to the latter it is instead called the logrank test.
The first step is to combine the two separate group life tables (shown to the right) into a single combined table (shown below).
The first events are recorded on day 3, when there is one event out of 250 patients for each group. These are entered into the combined table as the first row. Then on day 4 there were two events out of 249 patients for group 1, but no events out of 249 patients for group 2. These results therefore comprise the next row in the combined data table. For day 5 the position is reversed and there are no events out of 247 patients in group 1 and two events out of 249 patients for group 2. This combined data table is shown below. Each row can then be displayed as a 2x2 contingency table  those containing the data from the first three rows are shown below the combined data table:
Time =3 days

 e  n − e  n

Group 1  1  249  250

Group 2  1  249  250

Totals  2  498  500

Time =4 days

 e  n − e  n

Group 1  2  247  249

Group 2  0  249  249

Totals  2  496  498

Time =5 days

 e  n − e  n

Group 1  0  247  247

Group 2  2  247  249

Totals  2  494  496

We then deal with the series of thirteen 2x2 contingency tables with Mantel Haenszel methods. The number of events in the top left hand cell of each 2x2 table is highlighted because this is the cell for which we calculate the expected frequency. For this we assume the null hypothesis of no difference between the groups. Thus the expected number of events in group 1 is obtained by multiplying the proportion of patients in group 1 by the total number of events (e_{1} + e_{2}) . The variance is given by the product of the four margin totals divided by N^{2}(N − 1).
The MantelHaenszel chi square statistic is obtained by summing the observed number of events (14), the expected number of events (14.0203) and the variances (6.9572); squaring the difference between observed and expected totals; and dividing this by the summed variance to give a chi square value. In this case we get a value of 0.00006 (df=1) for which P = 0.994. Hence there is no evidence for any difference between these two survival curves, and we can accept the null hypothesis.
This test can be extended to other more complex situations. For example three or more (k) groups can be compared in exactly the same way except that the final MH statistic is tested with k − 1 degrees of freedom. Alternatively a stratified test can be used. Data can be split into strata defined by the level of some confounding variable such as age, or by different sites in a multicentre study.
Before we look at the assumptions of this test we will briefly consider another test which is very similar to the logrank test  this is known as the Wilcoxon test.
When discussing confidence intervals, we noted that the normal approximation should not be used when numbers have dropped very low. In addition, of course, the intervals will tend to be much wider  in other words we have less confidence in our estimate of the true value of S. But the logrank test gives the same weight to each observation irrespective of the value of n. As a result chance differences when there are few survivors may bias the result of the test.
The Wilcoxon test is simply a weighted log rank test, where the contribution each 2x2 table makes to the total is weighted by the total number at risk (N = n_{1} + n_{2}) at the start of the interval.
Algebraically speaking 
X^{2}_{Wilcoxon}
 =
 (ΣN_{i}(a_{i} − Σ_{i}))^{2}


ΣN_{i}^{2}s^{2}_{ai}

 

One then has to ask which of these tests is the more appropriate for the data being analysed. The log rank statistic weights each event equally. The Wilcoxon statistic places more weight on the early events and so is less sensitive to events that occur later on. Some argue that which test is to be used should be specified in advance, as there is otherwise a risk of bias in choosing that statistic most likely to give a significant result. An alternative view is that the choice of tests depends on whether certain assumptions are met. The most important of those assumptions  and one that we shall meet again when we look at parametric approaches to comparing survival curves  is that of proportional hazards. If the proportional hazards assumption is met then the most powerful test is the logrank test; if not then one of the weighted log rank tests is more appropriate.
The proportional hazards assumption
For the logrank test to be valid it is assumed that the relative probability of an event between groups remains constant over time. In other words if an event is twice as likely to occur in group one as in group two in the first time interval, it should also be twice as likely in all other time intervals. Note that this assumption is identical to the homogeneity or 'no interaction' assumption we made before when we used the MantelHaenszel test of association.
We can see below two hypothetical survival curves of numbers against time. Numbers in the 'new treatment' group decline more slowly than those in the 'standard treatment' group, and the curves do not cross. We have set the hazard functions to steadily increase over time with the hazard function in the 'new treatment' group (h_{1}) set at half that in the standard treatment group (h_{0}). If we plot the natural log of the hazard function against time, we get two parallel curves separated by the log of the ratio of the two hazards. Because the curves are parallel, the proportional hazards assumption is met.
{Fig. 4}
In contrast in this example numbers in the 'new treatment' group initially decline more sharply than those in the 'standard treatment' group, but then level off and the curves cross. Here as before the hazard function for the 'standard treatment' group is increasing steadily over time. But the hazard function for the 'new treatment' group is high at first, then drops sharply only to increase again later. Clearly the hazard functions are not remotely parallel, and the proportional hazards assumption is not met.
{Fig. 5}
In real life, plots of the hazard function can be very difficult to assess because of random variation in the number of events at any particular point in time. It makes more sense to instead use cumulative functions which show much less variability. One option is to plot the cumulative survival function fore each group against time. If they do not cross when plotted against time, we can probably assume that hazards are proportional.
The problem with a graphical approach is that it cannot cope with random variation in small samples. It is, for example, possible for the lines to cross even when the assumption is met. Look at our example of encephalopathy events after administration of melarsoprol. There is very little difference between the two survival curves, yet they do cross over. Is this just the result of random variation, or does it mean that the assumptions for our statistical test are not met?
One possible way of testing our two survival curves for homogeneity over time would be to use the MantelHaenszel test for interaction. However this test is seldom used because it lacks power with so many zeros in the table cells. We meet other ways to test the proportional hazards assumption in unit 14.