We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.
Stata offers several tools as part of the predict and
estat post-estimation commands. These are available after
issuing a logit or logistic command. We
continue to use glm, which offers many options.
When working with individual data Stata relies strongly on the concept of covariate patterns, grouping together all observations that share the same values of the covariates. In particular, it defines as saturated a model that has a separate parameter for each covariate pattern, not for each observation. We will consider a model with individual data soon.
So here’s the additive model for the contraceptive use data.
. use https://grodri.github.io/datasets/cusew, clear
(Contraceptive Use Data (Fiji, 1976))
. gen n = users + nonusers
. label define nomore 0 "more" 1 "nomore", modify // shorter
. label define education 0 "lower primary-" 1 "upper primary+", modify
. glm users i.age educ nomore, family(binomial n)
Iteration 0: log likelihood = -50.793156
Iteration 1: log likelihood = -50.712573
Iteration 2: log likelihood = -50.712565
Iteration 3: log likelihood = -50.712565
Generalized linear models Number of obs = 16
Optimization : ML Residual df = 10
Scale parameter = 1
Deviance = 29.91722173 (1/df) Deviance = 2.991722
Pearson = 28.28833641 (1/df) Pearson = 2.828834
Variance function: V(u) = u*(1-u/n) [Binomial]
Link function : g(u) = ln(u/(n-u)) [Logit]
AIC = 7.089071
Log likelihood = -50.7125647 BIC = 2.191335
─────────────┬────────────────────────────────────────────────────────────────
│ OIM
users │ Coefficient std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
age │
25-29 │ .3893816 .1758501 2.21 0.027 .0447219 .7340414
30-39 │ .9086135 .1646211 5.52 0.000 .5859621 1.231265
40-49 │ 1.189239 .21443 5.55 0.000 .7689639 1.609514
│
educ │ .3249947 .1240355 2.62 0.009 .0818894 .5680999
nomore │ .8329548 .1174705 7.09 0.000 .6027169 1.063193
_cons │ -1.966169 .1720307 -11.43 0.000 -2.303343 -1.628995
─────────────┴────────────────────────────────────────────────────────────────
The additive model has a deviance of 29.92 on 10 d.f. when we define the saturated model in terms of the 16 groups of women.
The predict command can be used to obtain predicted
probabilities, deviance residuals and Pearson residuals. These residuals
are the signed square roots of the contributions to the model deviance
or Pearson chi-squared statistic
. predict fv, mu // fitted value . gen pfit = fv/n // probability . predict dr, dev // deviance residual . predict pr, pear // Pearson residual
We will verify that if we square these residuals and sum them over the groups we get the deviance and Pearson chi-squared statistics.
. gen drsq = dr^2 . quietly sum drsq . di r(sum) 29.917221 . gen prsq = pr^2 . quietly sum prsq . di r(sum) 28.288336
So the deviance is 29.9 as noted at the outset, and Pearson’s chi-squared is 28.3. We now list all groups with squared deviance residuals above 3.84 (same as absolute values above 1.96).
. gen pobs = users/n
. format %5.3f pobs pfit
. format %6.2f pr dr
. list age educ nomore pobs pfit pr dr if pr^2 > 3.84
┌─────────────────────────────────────────────────────────────────┐
│ age educ nomore pobs pfit pr dr │
├─────────────────────────────────────────────────────────────────┤
4. │ <25 upper primary+ nomore 0.167 0.308 -2.38 -2.51 │
8. │ 25-29 upper primary+ nomore 0.293 0.397 -2.03 -2.07 │
13. │ 40-49 lower primary- more 0.146 0.315 -2.32 -2.49 │
└─────────────────────────────────────────────────────────────────┘
We see that a substantial part of the deviance of 29.92 comes from three groups where the model overestimates the probability of using contraception: young women (under age 25 and aged 25-29) with upper primary or more, who want no more children; and women in their forties with lower primary or less who want more children.
Pregibon (1981) extended regression diagnostics to GLMs and
introduced a weighted hat matrix, with diagonal elements representing
the leverages. These can be calculated with the hat option
of the predict command. (The corresponding option after a
logit command is lev.)
. predict lev, hat
. sum lev
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
lev │ 16 .375 .1788555 .0902827 .6696332
. gsort -lev
. format %5.3f lev
. list n age educ nomore pobs pfit lev in 1/3
┌───────────────────────────────────────────────────────────────┐
│ n age educ nomore pobs pfit lev │
├───────────────────────────────────────────────────────────────┤
1. │ 264 <25 upper primary+ more 0.197 0.162 0.670 │
2. │ 209 25-29 upper primary+ more 0.258 0.222 0.577 │
3. │ 94 40-49 lower primary- nomore 0.511 0.514 0.560 │
└───────────────────────────────────────────────────────────────┘
The three cells with potentially the largest influence in the fit are young women with upper primary or more who want more children, and older women with lower primary or less who want no more children.
The elements of the hat matrix can be used to standardize Pearson (or
deviance) residuals and to compute influence statistics. These are
easily done “by hand”. (The rs option
of predict after logit calculates standardized
Pearson residuals.)
. gen ps = pr/sqrt(1-lev)
. gen ds = dr/sqrt(1-lev)
. gen sc = ps^2
. gsort -sc
. format %6.2f ps ds
. list n age educ nomore pobs pfit ps ds in 1/3
┌──────────────────────────────────────────────────────────────────────┐
│ n age educ nomore pobs pfit ps ds │
├──────────────────────────────────────────────────────────────────────┤
1. │ 60 <25 upper primary+ nomore 0.167 0.308 -2.89 -3.06 │
2. │ 41 40-49 lower primary- more 0.146 0.315 -2.72 -2.92 │
3. │ 92 25-29 upper primary+ nomore 0.293 0.397 -2.69 -2.74 │
└──────────────────────────────────────────────────────────────────────┘
We identify the same three observations picked up by the unstandardized residuals, but the absolute values are now closer to three, highlighting the lack of fit to these groups.
The cook option of the predict command
after glm computes the one-step approximation of Cook’s
distance. (The option is called db in predict
after logit.) This statistic is called Pregibon’s influence
statistic in the Stata documentation, and their calculation differs from
the formula on page 49 of the notes in that it leaves out the number of
parameters p. This happens after logit but not
afterglm.
. predict cook, cook
. sum cook
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
cook │ 16 .4712637 .6402413 .0020549 2.385867
. gsort -cook
. format %6.2f cook
. list n age educ nomore pobs pfit lev cook in 1/3
┌──────────────────────────────────────────────────────────────────────┐
│ n age educ nomore pobs pfit lev cook │
├──────────────────────────────────────────────────────────────────────┤
1. │ 264 <25 upper primary+ more 0.197 0.162 0.670 2.39 │
2. │ 157 30-39 lower primary- nomore 0.510 0.444 0.542 1.18 │
3. │ 92 25-29 upper primary+ nomore 0.293 0.397 0.432 0.92 │
└──────────────────────────────────────────────────────────────────────┘
The group with the highest leverage turned out to be also the most influential: women under 25, with upper primary or more, who want more children.
With grouped data, or even with individual data where the number of covariate patters is small, the deviance provides a goodness of fit test. But what if you have truly individual data with many covariate patterns?
Hosmer and Lemeshow (1980) have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.
Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in the Stata website.
. use https://www.stata-press.com/data/r13/lbw, clear
(Hosmer & Lemeshow data)
. desc, s
Contains data from https://www.stata-press.com/data/r13/lbw.dta
Observations: 189 Hosmer & Lemeshow data
Variables: 11 15 Jan 2013 05:01
Sorted by:
The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother’s age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here’s the fit
. logit low age lwt i.race smoke ptl ht ui
Iteration 0: log likelihood = -117.336
Iteration 1: log likelihood = -101.28644
Iteration 2: log likelihood = -100.72617
Iteration 3: log likelihood = -100.724
Iteration 4: log likelihood = -100.724
Logistic regression Number of obs = 189
LR chi2(8) = 33.22
Prob > chi2 = 0.0001
Log likelihood = -100.724 Pseudo R2 = 0.1416
─────────────┬────────────────────────────────────────────────────────────────
low │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
age │ -.0271003 .0364504 -0.74 0.457 -.0985418 .0443412
lwt │ -.0151508 .0069259 -2.19 0.029 -.0287253 -.0015763
│
race │
black │ 1.262647 .5264101 2.40 0.016 .2309024 2.294392
other │ .8620792 .4391532 1.96 0.050 .0013548 1.722804
│
smoke │ .9233448 .4008266 2.30 0.021 .137739 1.708951
ptl │ .5418366 .346249 1.56 0.118 -.136799 1.220472
ht │ 1.832518 .6916292 2.65 0.008 .4769494 3.188086
ui │ .7585135 .4593768 1.65 0.099 -.1418484 1.658875
_cons │ .4612239 1.20459 0.38 0.702 -1.899729 2.822176
─────────────┴────────────────────────────────────────────────────────────────
. estat gof
Goodness-of-fit test after logistic model
Variable: low
Number of observations = 189
Number of covariate patterns = 182
Pearson chi2(173) = 179.24
Prob > chi2 = 0.3567
The sample of 189 observations has 182 different covariate patterns,
so we can’t test goodness of fit this way. We try instead the
Homer-Lemeshow test asking for 10 groups. We also specify the
table option to get more detailed information.
. estat gof, group(10) table
note: obs collapsed on 10 quantiles of estimated probabilities.
Goodness-of-fit test after logistic model
Variable: low
Table collapsed on quantiles of estimated probabilities
┌───────┬────────┬───────┬───────┬───────┬───────┬───────┐
│ Group │ Prob │ Obs_1 │ Exp_1 │ Obs_0 │ Exp_0 │ Total │
├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 0.0827 │ 0 │ 1.2 │ 19 │ 17.8 │ 19 │
│ 2 │ 0.1276 │ 2 │ 2.0 │ 17 │ 17.0 │ 19 │
│ 3 │ 0.2015 │ 6 │ 3.2 │ 13 │ 15.8 │ 19 │
│ 4 │ 0.2432 │ 1 │ 4.3 │ 18 │ 14.7 │ 19 │
│ 5 │ 0.2792 │ 7 │ 4.9 │ 12 │ 14.1 │ 19 │
├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
│ 6 │ 0.3138 │ 7 │ 5.6 │ 12 │ 13.4 │ 19 │
│ 7 │ 0.3872 │ 6 │ 6.5 │ 13 │ 12.5 │ 19 │
│ 8 │ 0.4828 │ 7 │ 8.2 │ 12 │ 10.8 │ 19 │
│ 9 │ 0.5941 │ 10 │ 10.3 │ 9 │ 8.7 │ 19 │
│ 10 │ 0.8391 │ 13 │ 12.8 │ 5 │ 5.2 │ 18 │
└───────┴────────┴───────┴───────┴───────┴───────┴───────┘
Number of observations = 189
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 9.65
Prob > chi2 = 0.2904
We get a Hosmer-Lemeshow chi-squared value of 9.65 on 8 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.
Paul Allison (2013) has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.
Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression. Online at https://statisticalhorizons.com/hosmer-lemeshow/.
Hosmer D.W. and Lemeshow S. (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics,A10:1043-1069.
Pregibon, D. (1981) Logistic Regression Diagnostics. The Annals of Statistics, 9(4): 705-724. JSTOR.
Updated fall 2022