Biology, images, analysis, design...
|"It has long been an axiom of mine that the little things are infinitely the most important" |
Cox's proportional hazards regression
Worked example 1
These are hypothetical data on the ten-year survival of children born with Down
Certain points are apparent from a careful examination of the data:
We have fitted these data to the Cox regression model using several software packages - some give slightly different results (possible because of using different methods to deal with ties) but we have just presented results using R. CAVD and leukemia were presented as main factors as well as an interaction term between these factors. If we decide to keep the interaction term in the model it means that having both CAVD and leukemia disproportionately affects survival. In most accounts that we have seen, the interaction is not tested for, but it would seem rather important because it is at least possible (if not probable) that the hazard from having both conditions could be higher than the sum of the two effects. We first summarize the output from R giving the fitted parameters for the model along with hazard ratios and P-values obtained from the Wald statistics.
The output from R is then given below. Note that the hazard ratio is entitled exp(coef) and the test statistic is given as a Z value (so we square these values to get the wald statistic).
The hazard ratios and P-values suggest that whilst CAVD and leukemia are significant risk factors, the interaction between the two factors is not significant. So should we drop the interaction term from the model? Well, in this particular case (as we shall see) this would be the right thing to do, but Wald tests should in general not be used as an aid in model selection in multivariate analyses. This is because the individual estimates of the regression coefficients are not independent of one another. Hence the P-values for each will change depending on which particular combination is being considered. Instead we should use the likelihood ratios to decide on which variables should be included in the model.
This is done with the proviso that comparisons can only be made of nested models. One model is said to be nested within another if the latter contains all the variables of the former plus at least one other. So, for our analysis, we can compare the fit of a model containing the variable CAVD with one containing both CAVD and leukemia - we cannot directly compare the fit of a model containing only CAVD with one containing only leukemia. As before the degrees of freedom for the likelihood ratio is given by the difference in the number of β-parameters in the two models. Hence the comparison of a model containing the variable CAVD with one containing both CAVD and leukemia has 1 degree of freedom.
We can readily obtain the log likelihoods for the different models using R. The first log likelihood is for the null model (-142.3934), the second is for the particular model under test.
Model testing proceeds as follows:
The standard errors and confidence intervals of the hazard ratios for the best fit model are obtained from the analysis for that model. Note that we have a fairly narrow confidence interval for the CAVD hazard ratio, but a much wider one for the leukemia hazard ratio. This is because the risk estimate for leukemia is based on a very small number of deaths.
We should then embark on a careful process of checking model diagnostics. First note the rather small value of Rsquare (0.12) compared to its maximum possible (0.89). This should warn us that we are only explaining a rather small proportion of the variability suggesting there are important explanatory variables missing from our model. We next check the proportional hazards assumption .
Plots of beta(t) for leukemia and CAVD against time are shown below:
These reveal we can safely accept the proportional hazards assumption, a decision reinforced by the P-values for both leukemia and CAVD ( 0.612 and 0.968 respectively) and an overall P-value of 0.875.