Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

Search this site




What is survival analysis?

"There are but three events of a man's life: birth, life and death. He is not conscious of being born, he dies in pain, and he forgets to live."
Jean de La Bruyère 1645-1696
Les Caractères (1688). De l'Homme


Survival analysis refers to analysis of data where we have recorded the time period from a defined time of origin up to a certain event for a number of individuals. That event is often termed a 'failure', and the length of time the failure time. This type of analysis is used most heavily in medical and veterinary research where the time origin is the start of a clinical trial and the event is the death of the patient. However the event need not be death - it could for example be the development of a specific side effect of treatment. Survival analysis is now being used increasingly in ecological research both for pesticide testing (where it is more powerful than probit analysis) and for life history studies.

You can see below a couple of examples of the type of data that survival analysis is carried out on. The first uses data from Dohoo & Martin (1984) on the survival of dairy cattle from first calving (the time origin) to the removal of the cow from the herd for any reason (culling, sale, disease etc). The second is loosely based on the results of a randomized clinical trial comparing two drugs for treatment of sleeping sickness carried out by Burri et al. (2000). The time origin is the start of treatment and the event is the occurrence of a dangerous side effect - encephalitis - following treatment with the drug melarsoprol.

{Fig. 1}

Medical researchers (and of course by the life insurance industry) have long used survival analysis because of the long life span of humans. Other applied biologists such as veterinarians and ecologists have tended to just use the percentage mortality over a fixed time period.

What is wrong with just using the percentage mortality?

The problem is that we all die sometime. If we give treatment to very elderly patients and assess 30 years on, all will probably be dead for both treatments. But it makes a big difference if a patient survives for ten years rather than one year. Hence it makes more sense to compare survival - the time to death. So why not just use the same methods we use for comparing (for example) mean weights to compare times to survival? One problem as we shall see below is the distributions of those times.


Why do we need special methods of analysis?

There are in fact two major problems in using the methods of analysis we have considered so far on survival times:

  1. Failure times are not normally distributed
  2. As you can see below, failure times tend not to be normally distributed. Typically they may be uniform (as in the first example) or have a long tail to the right (as in one of the treatments in the second example).

{Fig. 2}

    This makes it difficult to compare treatments, such as the two drug schedules shown above, using standard parametric statistical tests. Moreover, there is usually no simple transformation that can make failure times normally distributed.

    We have two options when dealing with these sort of data

    1. We can use non-parametric tests
    2. We can use parametric tests, but assume samples are not from a normal distribution but instead from an exponential or other long-tailed distribution.

  1. Censored data points
  2. We introduced the term censored in Unit 1 to describe individuals that either survived beyond the end point of the study or withdrew from the study for a reason unrelated to treatment

    Most commonly both survivors and withdrawals are referred to as censored data points. For example all observations after day 500 for the cattle example and after day 26 for the drug treatment example are censored.

    Why are censored points such a problem? - it is NOT just a missing data point. If we consider two cohorts, one treated one control. If all individuals in each group have died over the period of the study - no problem. But otherwise we must take account of those that survived either up to their withdrawal (assumed to be for reason unrelated to treatment) or up to the end of the study. What do we do with these data points - omitting them will clearly bias downwards the survival time for that treatment. But we cannot include them because we do not know for how long they would have survived.