Germán Rodríguez
Generalized Linear Models Princeton University

3.8 Regression Diagnostics for Binary Data

We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.

Stata offers several tools as part of the predict and estat post-estimation commands. These are available after issuing a logit or logistic command. We continue to use glm, which offers many options.

When working with individual data Stata relies strongly on the concept of covariate patterns, grouping together all observations that share the same values of the covariates. In particular, it defines as saturated a model that has a separate parameter for each covariate pattern, not for each observation. We will consider a model with individual data soon.

The Additive Model

So here’s the additive model for the contraceptive use data.

. use https://grodri.github.io/datasets/cusew, clear
(Contraceptive Use Data (Fiji, 1976))

. gen n = users + nonusers

. label define nomore 0 "more" 1 "nomore", modify // shorter

. label define education 0 "lower primary-" 1 "upper primary+", modify

. glm users i.age educ nomore, family(binomial n)

Iteration 0:   log likelihood = -50.793156  
Iteration 1:   log likelihood = -50.712573  
Iteration 2:   log likelihood = -50.712565  
Iteration 3:   log likelihood = -50.712565  

Generalized linear models                         Number of obs   =         16
Optimization     : ML                             Residual df     =         10
                                                  Scale parameter =          1
Deviance         =  29.91722173                   (1/df) Deviance =   2.991722
Pearson          =  28.28833641                   (1/df) Pearson  =   2.828834

Variance function: V(u) = u*(1-u/n)               [Binomial]
Link function    : g(u) = ln(u/(n-u))             [Logit]

                                                  AIC             =   7.089071
Log likelihood   =  -50.7125647                   BIC             =   2.191335

─────────────┬────────────────────────────────────────────────────────────────
             │                 OIM
       users │ Coefficient  std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         age │
      25-29  │   .3893816   .1758501     2.21   0.027     .0447219    .7340414
      30-39  │   .9086135   .1646211     5.52   0.000     .5859621    1.231265
      40-49  │   1.189239     .21443     5.55   0.000     .7689639    1.609514
             │
        educ │   .3249947   .1240355     2.62   0.009     .0818894    .5680999
      nomore │   .8329548   .1174705     7.09   0.000     .6027169    1.063193
       _cons │  -1.966169   .1720307   -11.43   0.000    -2.303343   -1.628995
─────────────┴────────────────────────────────────────────────────────────────

The additive model has a deviance of 29.92 on 10 d.f. when we define the saturated model in terms of the 16 groups of women.

Deviance and Pearson Residuals

The predict command can be used to obtain predicted probabilities, deviance residuals and Pearson residuals. These residuals are the signed square roots of the contributions to the model deviance or Pearson chi-squared statistic

. predict fv, mu      // fitted value

. gen pfit = fv/n     // probability

. predict dr, dev     // deviance residual

. predict pr, pear    // Pearson residual

We will verify that if we square these residuals and sum them over the groups we get the deviance and Pearson chi-squared statistics.

. gen drsq = dr^2

. quietly sum drsq

. di r(sum)
29.917221

. gen prsq = pr^2

. quietly sum prsq

. di r(sum)
28.288336

So the deviance is 29.9 as noted at the outset, and Pearson’s chi-squared is 28.3. We now list all groups with squared deviance residuals above 3.84 (same as absolute values above 1.96).

. gen pobs = users/n

. format %5.3f pobs pfit

. format %6.2f pr dr

. list age educ nomore pobs pfit pr dr if pr^2 > 3.84

     ┌─────────────────────────────────────────────────────────────────┐
     │   age             educ   nomore    pobs    pfit      pr      dr │
     ├─────────────────────────────────────────────────────────────────┤
  4. │   <25   upper primary+   nomore   0.167   0.308   -2.38   -2.51 │
  8. │ 25-29   upper primary+   nomore   0.293   0.397   -2.03   -2.07 │
 13. │ 40-49   lower primary-     more   0.146   0.315   -2.32   -2.49 │
     └─────────────────────────────────────────────────────────────────┘

We see that a substantial part of the deviance of 29.92 comes from three groups where the model overestimates the probability of using contraception: young women (under age 25 and aged 25-29) with upper primary or more, who want no more children; and women in their forties with lower primary or less who want more children.

Leverage and Influence

Pregibon (1981) extended regression diagnostics to GLMs and introduced a weighted hat matrix, with diagonal elements representing the leverages. These can be calculated with the hat option of the predict command. (The corresponding option after a logit command is lev.)

. predict lev, hat

. sum lev

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         lev │         16        .375    .1788555   .0902827   .6696332

. gsort -lev

. format %5.3f lev

. list n age educ nomore pobs pfit lev in 1/3

     ┌───────────────────────────────────────────────────────────────┐
     │   n     age             educ   nomore    pobs    pfit     lev │
     ├───────────────────────────────────────────────────────────────┤
  1. │ 264     <25   upper primary+     more   0.197   0.162   0.670 │
  2. │ 209   25-29   upper primary+     more   0.258   0.222   0.577 │
  3. │  94   40-49   lower primary-   nomore   0.511   0.514   0.560 │
     └───────────────────────────────────────────────────────────────┘

The three cells with potentially the largest influence in the fit are young women with upper primary or more who want more children, and older women with lower primary or less who want no more children.

The elements of the hat matrix can be used to standardize Pearson (or deviance) residuals and to compute influence statistics. These are easily done “by hand”. (The rs option of predict after logit calculates standardized Pearson residuals.)

. gen ps = pr/sqrt(1-lev)

. gen ds = dr/sqrt(1-lev)

. gen sc = ps^2

. gsort -sc

. format %6.2f ps ds

. list n age educ nomore pobs pfit ps ds in 1/3

     ┌──────────────────────────────────────────────────────────────────────┐
     │  n     age             educ   nomore    pobs    pfit      ps      ds │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │ 60     <25   upper primary+   nomore   0.167   0.308   -2.89   -3.06 │
  2. │ 41   40-49   lower primary-     more   0.146   0.315   -2.72   -2.92 │
  3. │ 92   25-29   upper primary+   nomore   0.293   0.397   -2.69   -2.74 │
     └──────────────────────────────────────────────────────────────────────┘

We identify the same three observations picked up by the unstandardized residuals, but the absolute values are now closer to three, highlighting the lack of fit to these groups.

The cook option of the predict command after glm computes the one-step approximation of Cook’s distance. (The option is called db in predict after logit.) This statistic is called Pregibon’s influence statistic in the Stata documentation, and their calculation differs from the formula on page 49 of the notes in that it leaves out the number of parameters p. This happens after logit but not afterglm.

. predict cook, cook

. sum cook

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        cook │         16    .4712637    .6402413   .0020549   2.385867

. gsort -cook

. format %6.2f cook

. list n age educ nomore pobs pfit lev cook in 1/3

     ┌──────────────────────────────────────────────────────────────────────┐
     │   n     age             educ   nomore    pobs    pfit     lev   cook │
     ├──────────────────────────────────────────────────────────────────────┤
  1. │ 264     <25   upper primary+     more   0.197   0.162   0.670   2.39 │
  2. │ 157   30-39   lower primary-   nomore   0.510   0.444   0.542   1.18 │
  3. │  92   25-29   upper primary+   nomore   0.293   0.397   0.432   0.92 │
     └──────────────────────────────────────────────────────────────────────┘

The group with the highest leverage turned out to be also the most influential: women under 25, with upper primary or more, who want more children.

Goodness of Fit

With grouped data, or even with individual data where the number of covariate patters is small, the deviance provides a goodness of fit test. But what if you have truly individual data with many covariate patterns?

Hosmer and Lemeshow (1980) have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.

Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in the Stata website.

. use https://www.stata-press.com/data/r13/lbw, clear
(Hosmer & Lemeshow data)

. desc, s

Contains data from https://www.stata-press.com/data/r13/lbw.dta
 Observations:           189                  Hosmer & Lemeshow data
    Variables:            11                  15 Jan 2013 05:01
Sorted by: 

The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother’s age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here’s the fit

. logit low age lwt i.race smoke ptl ht ui

Iteration 0:   log likelihood =   -117.336  
Iteration 1:   log likelihood = -101.28644  
Iteration 2:   log likelihood = -100.72617  
Iteration 3:   log likelihood =   -100.724  
Iteration 4:   log likelihood =   -100.724  

Logistic regression                                     Number of obs =    189
                                                        LR chi2(8)    =  33.22
                                                        Prob > chi2   = 0.0001
Log likelihood = -100.724                               Pseudo R2     = 0.1416

─────────────┬────────────────────────────────────────────────────────────────
         low │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
         age │  -.0271003   .0364504    -0.74   0.457    -.0985418    .0443412
         lwt │  -.0151508   .0069259    -2.19   0.029    -.0287253   -.0015763
             │
        race │
      black  │   1.262647   .5264101     2.40   0.016     .2309024    2.294392
      other  │   .8620792   .4391532     1.96   0.050     .0013548    1.722804
             │
       smoke │   .9233448   .4008266     2.30   0.021      .137739    1.708951
         ptl │   .5418366    .346249     1.56   0.118     -.136799    1.220472
          ht │   1.832518   .6916292     2.65   0.008     .4769494    3.188086
          ui │   .7585135   .4593768     1.65   0.099    -.1418484    1.658875
       _cons │   .4612239    1.20459     0.38   0.702    -1.899729    2.822176
─────────────┴────────────────────────────────────────────────────────────────

. estat gof

Goodness-of-fit test after logistic model
Variable: low

      Number of observations =    189
Number of covariate patterns =    182
           Pearson chi2(173) = 179.24
                 Prob > chi2 = 0.3567

The sample of 189 observations has 182 different covariate patterns, so we can’t test goodness of fit this way. We try instead the Homer-Lemeshow test asking for 10 groups. We also specify the table option to get more detailed information.

. estat gof, group(10) table
note: obs collapsed on 10 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: low

  Table collapsed on quantiles of estimated probabilities
  ┌───────┬────────┬───────┬───────┬───────┬───────┬───────┐
  │ Group │   Prob │ Obs_1 │ Exp_1 │ Obs_0 │ Exp_0 │ Total │
  ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
  │     1 │ 0.0827 │     0 │   1.2 │    19 │  17.8 │    19 │
  │     2 │ 0.1276 │     2 │   2.0 │    17 │  17.0 │    19 │
  │     3 │ 0.2015 │     6 │   3.2 │    13 │  15.8 │    19 │
  │     4 │ 0.2432 │     1 │   4.3 │    18 │  14.7 │    19 │
  │     5 │ 0.2792 │     7 │   4.9 │    12 │  14.1 │    19 │
  ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
  │     6 │ 0.3138 │     7 │   5.6 │    12 │  13.4 │    19 │
  │     7 │ 0.3872 │     6 │   6.5 │    13 │  12.5 │    19 │
  │     8 │ 0.4828 │     7 │   8.2 │    12 │  10.8 │    19 │
  │     9 │ 0.5941 │    10 │  10.3 │     9 │   8.7 │    19 │
  │    10 │ 0.8391 │    13 │  12.8 │     5 │   5.2 │    18 │
  └───────┴────────┴───────┴───────┴───────┴───────┴───────┘

 Number of observations =    189
       Number of groups =     10
Hosmer–Lemeshow chi2(8) =   9.65
            Prob > chi2 = 0.2904

We get a Hosmer-Lemeshow chi-squared value of 9.65 on 8 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.

Paul Allison (2013) has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.

References

Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression. Online at https://statisticalhorizons.com/hosmer-lemeshow/.

Hosmer D.W. and Lemeshow S. (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics,A10:1043-1069.

Pregibon, D. (1981) Logistic Regression Diagnostics. The Annals of Statistics, 9(4): 705-724. JSTOR.

Updated fall 2022