InfluentialPoints.com
Biology, images, analysis, design...
 Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

# Logistic regression

#### Worked example 1

Our first worked example uses raw binary data where each binary observation has unique values of one or more explanatory variables for each individual case. In this situation we must feed in each observation with a 1 or a 0 along with its values for the explanatory variables. We use data from Crawleyon costs of reproduction in a perennial plant. The first column (State) indicates survival through the winter (1) or death between growing seasons (0). Explanatory variables are seed production this summer (Flowers) and the size of its rootstock (Root).

Plant winter survival

1. #### Examine relationships between variables

Plot out data to get a visual assessment of the relationships between variables.

As you can see the raw binary data (black circles) are not very easy to interpret in this form as they are all either 0 or 1. However, one can observe that none of the plants that produced the largest number of flowers survived (there is none in the top right hand corner of the left hand plot). On the contrary not many of the plants with the biggest roots died (bottom right of the right hand plot).

We can get a much better idea of what is going on if we fit simple univariate logistic regressions to each graph, and the look at the fitted relationships (shown as red points in the flip). The probability of survival is negatively related to flower production and positively related to root size.

2. #### Fit full model with interaction

Below we have fitted the full model which contains both flower production and root size and their interaction.

 Using RCall: glm(formula = flowering\$state ~ flowering\$flowers * flowering\$root, family = binomial) Deviance Residuals: Min 1Q Median 3Q Max -1.74546 -0.44295 -0.03424 0.46458 2.69443 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.96010 1.53257 -1.931 0.05343 . flowering\$flowers -0.07889 0.04197 -1.880 0.06013 . flowering\$root 25.14934 7.84162 3.207 0.00134 ** flowering\$flowers:flowering\$root -0.20905 0.08930 -2.341 0.01924 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 78.903 on 58 degrees of freedom Residual deviance: 37.128 on 55 degrees of freedom AIC: 45.128

As expected we obtain a positive relationship with root size and a negative relationship with flower production. But in addition we have a significant interaction between these two factors. Since this complicates interpretation, we check whether we can drop the interaction (albeit its significance level suggests it cannot be dropped).

3. #### Compare models with and without interaction

We use anova to provide an analysis of deviance table for the two models. Note that we have used both chi square and Mallow's Cp as test statistics.

 Using R> anova(model1,model2,test="Chi") Analysis of Deviance Table Model 1: flowering\$state ~ flowering\$flowers * flowering\$root Model 2: flowering\$state ~ flowering\$flowers + flowering\$root Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 55 37.128 2 56 54.068 -1 -16.94 3.858e-05 *** --- > anova(model1,model2,test="Cp") Analysis of Deviance Table Model 1: flowering\$state ~ flowering\$flowers * flowering\$root Model 2: flowering\$state ~ flowering\$flowers + flowering\$root Resid. Df Resid. Dev Df Deviance Cp 1 55 37.128 45.128 2 56 54.068 -1 -16.94 60.068

Clearly we cannot drop the interaction term - the difference between the two models is highly significant using chi square, and the model without the interaction has a much larger value for Cp. Examination of the model coefficients shows that a given level of flowering has a bigger impact on reducing survival on plants with small roots than on larger plants.

Note that we cannot use the residual deviance to tell us anything about goodness of fit of the model because we have entered the data in raw binary format.

#### Worked example 2

Our second example uses a result from a cross-sectional survey on the prevalence of dystocia in cats. We previously looked at this work in relation to the confidence intervals attached to the odds and risk ratios in relation to breed of cat.

 Prevalence of dystocia in cats in relation to breed Breed No. positive No negative % positive Manx 1 17 5.55 Colony 3 800 0.37

1. #### Fit logistic regression model

 Using RCall: glm(formula = cbind(inf, nif) ~ breed, family = binomial, data = cats) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.5860 0.5784 -9.657 <2e-16 *** breedManx 2.7528 1.1804 2.332 0.0197 * --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 3.3229e+00 on 1 degrees of freedom Residual deviance: 1.1546e-14 on 0 degrees of freedom AIC: 8.9315 Number of Fisher Scoring iterations: 4

We have highlighted the key parameters in red. Note that we cannot use the residual deviance to tell us anything about goodness of fit of the model because the residual deviance is (effectively) equal to zero, as are its degrees of freedom. This is because we are fitting the saturated model. The natural log of the odds ratio is 2.7528 with a standard error of 1.1804, exactly as we obtained previously. We can detransform the odds ratio as exp(2.7528) to give 15.686.

2. #### Obtain likelihood confidence interval

The P-value of 0.0197 for breedManx suggests we can accept this odds ratio differs significantly from 1. But this should not be accepted at face value. Instead work out the confidence interval for the odds ratio. You cannot use a normal approximation confidence interval because this is invalid when some frequencies are very low. You need to estimate a likelihood confidence interval:

 Using RWaiting for profiling to be done... 2.5 % 97.5 % (Intercept) -6.9790561 -4.630517 breedManx -0.2808132 4.866370

The (detransformed) interval for the odds ratio is 0.755 - 129.85 - in other words it overlaps 1, so you cannot be confident that the effect is real. You may note that if you just plugged in your result to a logistic regression package (such as here) you get no warning about the problem of very small frequencies. Just because a regression package will do a logistic regression on your 2 × 2 table (or any other set of data), does not mean that the result will necessarily be reliable.

#### Worked example 3

Probability of developing AIDS symptoms

We use data from Agresti on the proportion of people developing AIDS symptoms (given as number positive and number negative) in relation to taking the antiretroviral drug azt (azt = yes or no) and race (col = black or white). Hence we have two binary explanatory variables and a binary response variable.

1. #### Fit full model

We first fitted the full model with the interaction between race and drug use. The interaction proved non-significant (in fact all terms were non-significant), so we dropped the interaction term to give the following 'simple main effects' model.

 Using RCall: glm(formula = cbind(pos, neg) ~ col + azt, family = binomial, data = aids) Deviance Residuals: 1 2 3 4 -0.5547 0.4253 0.7035 -0.6326 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.07357 0.26294 -4.083 4.45e-05 *** colw 0.05548 0.28861 0.192 0.84755 azty -0.71946 0.27898 -2.579 0.00991 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 8.3499 on 3 degrees of freedom Residual deviance: 1.3835 on 1 degrees of freedom AIC: 24.86 Number of Fisher Scoring iterations: 4

This suggests that taking AZT significantly reduces the chance of developing AIDS symptoms (P = 0.0099). The effect of race (col) (P = 0.848) is not significant and can possibly be dropped from the model. Note that the residual deviance (at 1.3835) is similar to the degrees of freedom (1), suggesting there is no problem of overdispersion.

2. #### Test effect of dropping race from model

We drop race (col) from the model using the function 'update' in base R, and test whether the change in deviance is significant using 'anova' to compare the two models.

 Using RAnalysis of Deviance Table Model 1: cbind(pos, neg) ~ azt Model 2: cbind(pos, neg) ~ col + azt Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 2 1.4206 2 1 1.3835 1 0.037084 0.8473

The change in deviance is not significant (P = 0.8473) so race is dropped from the model.

3. #### Summarize final model and obtain likelihood CI

The likelihood CI to the log(odds ratio) for developing AIDS symptoms for taking versus not taking AZT is obtained using the function 'confint' (in the package MASS but now usually included in base R).

 Using Rglm(formula = cbind(pos, neg) ~ azt, family = binomial, data = aids) Deviance Residuals: 1 2 3 4 -0.4813 0.5102 0.6026 -0.7521 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.0361 0.1755 -5.904 3.54e-09 *** aztyes -0.7218 0.2787 -2.590 0.00961 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 8.3499 on 3 degrees of freedom Residual deviance: 1.4206 on 2 degrees of freedom AIC: 22.897 Number of Fisher Scoring iterations: 4 > confint(model2) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -1.390358 -0.7006773 aztyes -1.279159 -0.1827049

The (detransformed) odds ratio for developing AIDS symptoms for taking versus not taking AZT is therefore e-0.7218 = 0.486 (95% CI: 0.278 to 0.833). The residual deviance is 1.4206 on 2 degrees of freedom which is not significant suggesting a good fit to the model.

One could proceed to estimate probabilities for each of the four predictor combinations (see p87 in Laura Thompson (2009) for this example) but this seems unjustified if one is dropping race from the model.

We have used P-values in the model simplification process, but better approach would be to use AIC. We have done this below using the R 'step' function on the full model (including the interaction) to demonstrate that in this case it gives the same result (key values in red).

 Using RStart: AIC=25.48 cbind(pos, neg) ~ col * azt Df Deviance AIC - col:azt 1 1.3835 24.860 0.0000 25.476 Step: AIC=24.86 cbind(pos, neg) ~ col + azt Df Deviance AIC - col 1 1.4206 22.897 1.3835 24.860 - azt 1 8.2544 29.731 Step: AIC=22.9 cbind(pos, neg) ~ azt Df Deviance AIC 1.4206 22.897 - azt 1 8.3499 27.826 Call: glm(formula = cbind(pos, neg) ~ azt, family = binomial, data = aids) Coefficients: (Intercept) aztyes -1.0361 -0.7218 Degrees of Freedom: 3 Total (i.e. Null); 2 Residual Null Deviance: 8.35 Residual Deviance: 1.421 AIC: 22.9

4. #### Check goodness of fit

We have already been given the deviance residual goodness of fit statistic above as the residual deviance (1.421) with 2 degrees of freedom, but we will print out this again along with the Pearson goodness of fit statistic, and (for good measure) two measures of pseudo-R-square, the Cox & Snell R square and Nagelkerke's R square (thanks to Sören Vogel.)

 Using Rmodel2\$deviance [1] 1.420614 > sum(residuals(model2, type = "pearson")^2) [1] 1.414068 > Rcsnagel(model2) Rcs Rnagel 0.8231287 0.939643

The deviance goodness of fit statistic is 1.4206 with 2 df, so P = 0.491. The Pearson goodness of fit statistic is 1.414 with 2 df, so P = 0.493. Both these are compatible with a model which explains most of the variation. The Cox & Snell R square (at 0.823) and Nagelkerke's R square (at 0.940) also indicate a very acceptable model.