 InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Cox's proportional hazards regression  #### Worked example 1

These are hypothetical data on the ten-year survival of children born with Down syndrome ; they are loosely based on a recent study carried out in Ireland We have focused on two factors known to affect survival of children suffering from this disease - serious heart defects (CAVD) and leukemia. Note that the researchers also recorded data on various other possible explanatory variables such as birth weight of the child and age of the mother. We have grouped the data according to presence or absence of CAVD.  Survival times (days) of children born with Down syndrome in relation to occurrence of heart defects (CAVD) and leukemia.    With CAVD Without CAVD   No. Time Leuk No. Time Leuk 1 37 N 31 28 N 2 55 N 32 61 N 3 73 Y 33 113 Y 4 110 N 34 135 N 5 146 N 35 146 N 6 164 N 36 153 N 7 219 Y 37 183 N 8 310 N 38 256 N 9 329 N 39 292 N 10 475 N 40 336 N 11 730 N 41 365 N 12 949 N 42 548 N 13 1095 N 43 694 N 14-30 >3650 N 44 803 N 45 913 N 46 1497 N 47 1643 N 48-130 >3650 N Certain points are apparent from a careful examination of the data:

We have fitted these data to the Cox regression model using several software packages - some give slightly different results (possible because of using different methods to deal with ties) but we have just presented results using R. CAVD and leukemia were presented as main factors as well as an interaction term between these factors. If we decide to keep the interaction term in the model it means that having both CAVD and leukemia disproportionately affects survival. In most accounts that we have seen, the interaction is not tested for, but it would seem rather important because it is at least possible (if not probable) that the hazard from having both conditions could be higher than the sum of the two effects. We first summarize the output from R giving the fitted parameters for the model along with hazard ratios and P-values obtained from the Wald statistics. Variable β Standard   error Hazard   ratio Wald   statistic P-value Leukemia 3.570 1.103 35.516 10.478 .001 CAVD 1.048 .392 2.851 7.140 .008 Leuk x CAVD -1.459 1.312 .233 1.237 .266 The output from R is then given below. Note that the hazard ratio is entitled exp(coef) and the test statistic is given as a Z value (so we square these values to get the wald statistic).

 Using RCall: coxph(formula = Surv(leu\$time, leu\$cens) ~ leu\$leuk * leu\$cavd) n= 130, number of events= 30 coef exp(coef) se(coef) z Pr(>|z|) leu\$leukY 3.5701 35.5189 1.1030 3.237 0.00121 ** leu\$cavdY 1.0482 2.8526 0.3920 2.674 0.00750 ** leu\$leukY:leu\$cavdY -1.4599 0.2323 1.3117 -1.113 0.26573 --- Rsquare= 0.127 (max possible= 0.888 ) Likelihood ratio test= 17.73 on 3 df, p=0.0004999 Wald test = 25.27 on 3 df, p=1.355e-05 Score (logrank) test = 46.34 on 3 df, p=4.804e-10

The hazard ratios and P-values suggest that whilst CAVD and leukemia are significant risk factors, the interaction between the two factors is not significant. So should we drop the interaction term from the model? Well, in this particular case (as we shall see) this would be the right thing to do, but Wald tests should in general not be used as an aid in model selection in multivariate analyses. This is because the individual estimates of the regression coefficients are not independent of one another. Hence the P-values for each will change depending on which particular combination is being considered. Instead we should use the likelihood ratios to decide on which variables should be included in the model.

This is done with the proviso that comparisons can only be made of nested models. One model is said to be nested within another if the latter contains all the variables of the former plus at least one other. So, for our analysis, we can compare the fit of a model containing the variable CAVD with one containing both CAVD and leukemia - we cannot directly compare the fit of a model containing only CAVD with one containing only leukemia. As before the degrees of freedom for the likelihood ratio is given by the difference in the number of β-parameters in the two models. Hence the comparison of a model containing the variable CAVD with one containing both CAVD and leukemia has 1 degree of freedom.

We can readily obtain the log likelihoods for the different models using R. The first log likelihood is for the null model (-142.3934), the second is for the particular model under test.

 Using R coxph.fit1=coxph(Surv(leu\$time,leu\$cens)~leu\$leuk*leu\$cavd) > coxph.fit1\$loglik  -142.3934 -133.5282 MODEL5 > coxph.fit2=coxph(Surv(leu\$time,leu\$cens)~leu\$leuk+leu\$cavd) > coxph.fit2\$loglik  -142.3934 -134.0526 MODEL4 > coxph.fit3=coxph(Surv(leu\$time,leu\$cens)~leu\$leuk) > coxph.fit3\$loglik  -142.3934 -136.8067 MODEL3 > coxph.fit4=coxph(Surv(leu\$time,leu\$cens)~leu\$cavd) > coxph.fit4\$loglik  -142.3934 -138.1809 MODEL2

Model testing proceeds as follows: Model # Variables in model -2LogL 1 Null model 284.787 2 CAVD 276.362 3 leukemia 273.614 4 CAVD leukemia 268.105 5 CAVD leukemia interaction 267.164 1. We first compare model 5 with model 4 to assess whether there is a significant interaction between CAVD and leukemia. The log likelihood ratio statistic (-2 log Lnull model - (-2 log Lfull model) is 0.89 for which P = 0.345. This is not significant so we can eliminate the interaction term from the model.

2. We then compare model 4 with models 2 and 3. The log likelihood ratio statistics are 8.257 and 5.509 respectively for which P- values are 0.004 and 0.019 respectively. Hence including CAVD in a model containing leukemia improves the model, as does including leukemia in a model containing CAVD. Both of these factors should therefore be in the model, and we accept model 4 as the best fit model.

3. The overall significance level for the fit of the model is obtained by comparing model 4 with model 1. The log likelihood ratio statistic (with 2 df) is 16.675 for which the P- value is 0.0002.

The standard errors and confidence intervals of the hazard ratios for the best fit model are obtained from the analysis for that model. Note that we have a fairly narrow confidence interval for the CAVD hazard ratio, but a much wider one for the leukemia hazard ratio. This is because the risk estimate for leukemia is based on a very small number of deaths. Variable β Standard   error Hazard   ratio 95% CI Leukemia 2.442 .685 11.493 3.001-44.012 CAVD 0.947 .387 2.579 1.208-5.505 Using RCall: coxph(formula = Surv(leu\$time, leu\$cens) ~ leu\$leuk + leu\$cavd) n= 130, number of events= 30 coef exp(coef) se(coef) z Pr(>|z|) leu\$leukY 2.4418 11.4933 0.6851 3.564 0.000365 *** leu\$cavdY 0.9474 2.5790 0.3869 2.449 0.014333 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 exp(coef) exp(-coef) lower .95 upper .95 leu\$leukY 11.493 0.0870 3.001 44.012 leu\$cavdY 2.579 0.3878 1.208 5.505 Rsquare= 0.12 (max possible= 0.888 ) Likelihood ratio test= 16.68 on 2 df, p=0.0002386 Wald test = 25.32 on 2 df, p=3.17e-06 Score (logrank) test = 44.73 on 2 df, p=1.933e-10

We should then embark on a careful process of checking model diagnostics. First note the rather small value of Rsquare (0.12) compared to its maximum possible (0.89). This should warn us that we are only explaining a rather small proportion of the variability suggesting there are important explanatory variables missing from our model. We next check the proportional hazards assumption .

Plots of beta(t) for leukemia and CAVD against time are shown below:

These reveal we can safely accept the proportional hazards assumption, a decision reinforced by the P-values for both leukemia and CAVD ( 0.612 and 0.968 respectively) and an overall P-value of 0.875.

We leave you to carry out the remaining checks - namely for influential points and for nonlinearity in the relationship between the log hazard and covariates (see Fox (2002) )

 Except where otherwise specified, all text and images on this page are copyright InfluentialPoints, all rights reserved. Images not copyright InfluentialPoints credit their source on web-pages attached via hypertext links from those images.