InfluentialPoints.com
Biology, images, analysis, design...
Use/Abuse Principles How To Related
"It has long been an axiom of mine that the little things are infinitely the most important" (Sherlock Holmes)

 

 

Logistic regression

Worked example 1

Plant winter survival

Our first worked example uses raw binary data where each binary observation has unique values of one or more explanatory variables for each individual case. In this situation we must feed in each observation with a 1 or a 0 along with its values for the explanatory variables. We use data from Crawleyon costs of reproduction in a perennial plant. The first column (State) indicates survival through the winter (1) or death between growing seasons (0). Explanatory variables are seed production this summer (Flowers) and the size of its rootstock (Root).

  1. Examine relationships between variables

    Plot out data to get a visual assessment of the relationships between variables.

    Fgmb1a.gif

    As you can see the raw binary data (black circles) are not very easy to interpret in this form as they are all either 0 or 1. However, one can observe that none of the plants that produced the largest number of flowers survived (there is none in the top right hand corner of the left hand plot). On the contrary not many of the plants with the biggest roots died (bottom right of the right hand plot).

    We can get a much better idea of what is going on if we fit simple univariate logistic regressions to each graph, and the look at the fitted relationships (shown as red points in the flip). The probability of survival is negatively related to flower production and positively related to root size.

  2. Fit full model with interaction

    Below we have fitted the full model which contains both flower production and root size and their interaction.

    Using R
    Call:
    glm(formula = flowering$state ~ flowering$flowers * flowering$root,
        family = binomial)
    
    Deviance Residuals:
         Min        1Q    Median        3Q       Max
    -1.74546  -0.44295  -0.03424   0.46458   2.69443
    
    Coefficients:
                                     Estimate Std. Error z value Pr(>|z|)
    (Intercept)                      -2.96010    1.53257  -1.931  0.05343 .
    flowering$flowers                -0.07889    0.04197  -1.880  0.06013 .
    flowering$root                   25.14934    7.84162   3.207  0.00134 **
    flowering$flowers:flowering$root -0.20905    0.08930  -2.341  0.01924 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 78.903  on 58  degrees of freedom
    Residual deviance: 37.128  on 55  degrees of freedom
    AIC: 45.128
    

    As expected we obtain a positive relationship with root size and a negative relationship with flower production. But in addition we have a significant interaction between these two factors. Since this complicates interpretation, we check whether we can drop the interaction (albeit its significance level suggests it cannot be dropped).

  3. Compare models with and without interaction

    We use anova to provide an analysis of deviance table for the two models. Note that we have used both chi square and Mallow's Cp as test statistics.

    Using R
    
    > anova(model1,model2,test="Chi")
    Analysis of Deviance Table
    Model 1: flowering$state ~ flowering$flowers * flowering$root
    Model 2: flowering$state ~ flowering$flowers + flowering$root
      Resid. Df Resid. Dev Df Deviance P(>|Chi|)
    1        55     37.128
    2        56     54.068 -1   -16.94 3.858e-05 ***
    ---
    > anova(model1,model2,test="Cp")
    Analysis of Deviance Table
    Model 1: flowering$state ~ flowering$flowers * flowering$root
    Model 2: flowering$state ~ flowering$flowers + flowering$root
      Resid. Df Resid. Dev Df Deviance     Cp
    1        55     37.128             45.128
    2        56     54.068 -1   -16.94 60.068
    

    Clearly we cannot drop the interaction term - the difference between the two models is highly significant using chi square, and the model without the interaction has a much larger value for Cp. Examination of the model coefficients shows that a given level of flowering has a bigger impact on reducing survival on plants with small roots than on larger plants.

    Note that we cannot use the residual deviance to tell us anything about goodness of fit of the model because we have entered the data in raw binary format.

Worked example 2

Our second example uses a result from a cross-sectional survey on the prevalence of dystocia in cats. We previously looked at this work in relation to the confidence intervals attached to the odds and risk ratios in relation to breed of cat.

Prevalence of dystocia in cats in relation to breed
BreedNo. positiveNo negative% positive
Manx1175.55
Colony3 8000.37

  1. Fit logistic regression model

    Using R
    Call:
    glm(formula = cbind(inf, nif) ~ breed, family = binomial, data = cats)
    
    Deviance Residuals:
    [1]  0  0
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)  -5.5860     0.5784  -9.657   <2e-16 ***
    breedManx     2.7528     1.1804   2.332   0.0197 *
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 3.3229e+00  on 1  degrees of freedom
    Residual deviance: 1.1546e-14  on 0  degrees of freedom
    AIC: 8.9315
    
    Number of Fisher Scoring iterations: 4

    We have highlighted the key parameters in red. Note that we cannot use the residual deviance to tell us anything about goodness of fit of the model because the residual deviance is (effectively) equal to zero, as are its degrees of freedom. This is because we are fitting the saturated model. The natural log of the odds ratio is 2.7528 with a standard error of 1.1804, exactly as we obtained previously. We can detransform the odds ratio as exp(2.7528) to give 15.686.

  2. Obtain likelihood confidence interval

    The P-value of 0.0197 for breedManx suggests we can accept this odds ratio differs significantly from 1. But this should not be accepted at face value. Instead work out the confidence interval for the odds ratio. You cannot use a normal approximation confidence interval because this is invalid when some frequencies are very low. You need to estimate a likelihood confidence interval:

    Using R
    Waiting for profiling to be done...
                     2.5 %    97.5 %
    (Intercept) -6.9790561 -4.630517
    breedManx   -0.2808132  4.866370
    

    The (detransformed) interval for the odds ratio is 0.755 - 129.85 - in other words it overlaps 1, so you cannot be confident that the effect is real. You may note that if you just plugged in your result to a logistic regression package (such as here) you get no warning about the problem of very small frequencies. Just because a regression package will do a logistic regression on your 2 × 2 table (or any other set of data), does not mean that the result will necessarily be reliable.

Worked example 3

Probability of developing
AIDS symptoms
We use data from Agresti on the proportion of people developing AIDS symptoms (given as number positive and number negative) in relation to taking the antiretroviral drug azt (azt = yes or no) and race (col = black or white). Hence we have two binary explanatory variables and a binary response variable.

  1. Fit full model

    We first fitted the full model with the interaction between race and drug use. The interaction proved non-significant (in fact all terms were non-significant), so we dropped the interaction term to give the following 'simple main effects' model.

    Using R
    Call:
    glm(formula = cbind(pos, neg) ~ col + azt, family = binomial,
        data = aids)
    
    Deviance Residuals:
          1        2        3        4
    -0.5547   0.4253   0.7035  -0.6326
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)
    (Intercept) -1.07357    0.26294  -4.083 4.45e-05 ***
    colw         0.05548    0.28861   0.192  0.84755
    azty        -0.71946    0.27898  -2.579  0.00991 **
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 8.3499  on 3  degrees of freedom
    Residual deviance: 1.3835  on 1  degrees of freedom
    AIC: 24.86
    
    Number of Fisher Scoring iterations: 4

    This suggests that taking AZT significantly reduces the chance of developing AIDS symptoms (P = 0.0099). The effect of race (col) (P = 0.848) is not significant and can possibly be dropped from the model. Note that the residual deviance (at 1.3835) is similar to the degrees of freedom (1), suggesting there is no problem of overdispersion.

  2. Test effect of dropping race from model

    We drop race (col) from the model using the function 'update' in base R, and test whether the change in deviance is significant using 'anova' to compare the two models.

    Using R
    Analysis of Deviance Table
    
    Model 1: cbind(pos, neg) ~ azt
    Model 2: cbind(pos, neg) ~ col + azt
      Resid. Df Resid. Dev Df Deviance P(>|Chi|)
    1         2     1.4206
    2         1     1.3835  1 0.037084    0.8473

    The change in deviance is not significant (P = 0.8473) so race is dropped from the model.

  3. Summarize final model and obtain likelihood CI

    The likelihood CI to the log(odds ratio) for developing AIDS symptoms for taking versus not taking AZT is obtained using the function 'confint' (in the package MASS but now usually included in base R).

    Using R
    glm(formula = cbind(pos, neg) ~ azt, family = binomial, data = aids)
    
    Deviance Residuals:
          1        2        3        4
    -0.4813   0.5102   0.6026  -0.7521
    
    Coefficients:
                Estimate Std. Error z value Pr(>|z|)
    (Intercept)  -1.0361     0.1755  -5.904 3.54e-09 ***
    aztyes       -0.7218     0.2787  -2.590  0.00961 **
    ---
    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    
    (Dispersion parameter for binomial family taken to be 1)
    
        Null deviance: 8.3499  on 3  degrees of freedom
    Residual deviance: 1.4206  on 2  degrees of freedom
    AIC: 22.897
    
    Number of Fisher Scoring iterations: 4
    
    > confint(model2)
    Waiting for profiling to be done...
                    2.5 %     97.5 %
    (Intercept) -1.390358 -0.7006773
    aztyes      -1.279159 -0.1827049
    

    The (detransformed) odds ratio for developing AIDS symptoms for taking versus not taking AZT is therefore e-0.7218 = 0.486 (95% CI: 0.278 to 0.833). The residual deviance is 1.4206 on 2 degrees of freedom which is not significant suggesting a good fit to the model.

    One could proceed to estimate probabilities for each of the four predictor combinations (see p87 in Laura Thompson (2009) for this example) but this seems unjustified if one is dropping race from the model.

    We have used P-values in the model simplification process, but better approach would be to use AIC. We have done this below using the R 'step' function on the full model (including the interaction) to demonstrate that in this case it gives the same result (key values in red).

    Using R
    Start:  AIC=25.48
    cbind(pos, neg) ~ col * azt
    
              Df Deviance    AIC
    - col:azt  1   1.3835 24.860
             0.0000 25.476
    
    Step:  AIC=24.86
    cbind(pos, neg) ~ col + azt
    
           Df Deviance    AIC
    - col   1   1.4206 22.897
          1.3835 24.860
    - azt   1   8.2544 29.731
    
    Step:  AIC=22.9
    cbind(pos, neg) ~ azt
    
           Df Deviance    AIC
          1.4206 22.897
    - azt   1   8.3499 27.826
    
    Call:  glm(formula = cbind(pos, neg) ~ azt, family = binomial, data = aids)
    
    Coefficients:
    (Intercept)       aztyes
        -1.0361      -0.7218
    
    Degrees of Freedom: 3 Total (i.e. Null);  2 Residual
    Null Deviance:      8.35
    Residual Deviance: 1.421        AIC: 22.9

  4. Check goodness of fit

    We have already been given the deviance residual goodness of fit statistic above as the residual deviance (1.421) with 2 degrees of freedom, but we will print out this again along with the Pearson goodness of fit statistic, and (for good measure) two measures of pseudo-R-square, the Cox & Snell R square and Nagelkerke's R square (thanks to Sören Vogel.)

    Using R
    model2$deviance
    [1] 1.420614
    > sum(residuals(model2, type = "pearson")^2)
    [1] 1.414068
    > Rcsnagel(model2)
          Rcs    Rnagel
    0.8231287  0.939643

    The deviance goodness of fit statistic is 1.4206 with 2 df, so P = 0.491. The Pearson goodness of fit statistic is 1.414 with 2 df, so P = 0.493. Both these are compatible with a model which explains most of the variation. The Cox & Snell R square (at 0.823) and Nagelkerke's R square (at 0.940) also indicate a very acceptable model.