We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.
Stata offers several tools as part of the predict
and
estat
post-estimation commands. These are available after
issuing a logit
or logistic
command. We
continue to use glm
, which offers many options.
When working with individual data Stata relies strongly on the concept of covariate patterns, grouping together all observations that share the same values of the covariates. In particular, it defines as saturated a model that has a separate parameter for each covariate pattern, not for each observation. We will consider a model with individual data soon.
So here’s the additive model for the contraceptive use data.
. use https://grodri.github.io/datasets/cusew, clear (Contraceptive Use Data (Fiji, 1976)) . gen n = users + nonusers . label define nomore 0 "more" 1 "nomore", modify // shorter . label define education 0 "lower primary-" 1 "upper primary+", modify . glm users i.age educ nomore, family(binomial n) Iteration 0: log likelihood = -50.793156 Iteration 1: log likelihood = -50.712573 Iteration 2: log likelihood = -50.712565 Iteration 3: log likelihood = -50.712565 Generalized linear models Number of obs = 16 Optimization : ML Residual df = 10 Scale parameter = 1 Deviance = 29.91722173 (1/df) Deviance = 2.991722 Pearson = 28.28833641 (1/df) Pearson = 2.828834 Variance function: V(u) = u*(1-u/n) [Binomial] Link function : g(u) = ln(u/(n-u)) [Logit] AIC = 7.089071 Log likelihood = -50.7125647 BIC = 2.191335 ─────────────┬──────────────────────────────────────────────────────────────── │ OIM users │ Coefficient std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── age │ 25-29 │ .3893816 .1758501 2.21 0.027 .0447219 .7340414 30-39 │ .9086135 .1646211 5.52 0.000 .5859621 1.231265 40-49 │ 1.189239 .21443 5.55 0.000 .7689639 1.609514 │ educ │ .3249947 .1240355 2.62 0.009 .0818894 .5680999 nomore │ .8329548 .1174705 7.09 0.000 .6027169 1.063193 _cons │ -1.966169 .1720307 -11.43 0.000 -2.303343 -1.628995 ─────────────┴────────────────────────────────────────────────────────────────
The additive model has a deviance of 29.92 on 10 d.f. when we define the saturated model in terms of the 16 groups of women.
The predict
command can be used to obtain predicted
probabilities, deviance residuals and Pearson residuals. These residuals
are the signed square roots of the contributions to the model deviance
or Pearson chi-squared statistic
. predict fv, mu // fitted value . gen pfit = fv/n // probability . predict dr, dev // deviance residual . predict pr, pear // Pearson residual
We will verify that if we square these residuals and sum them over the groups we get the deviance and Pearson chi-squared statistics.
. gen drsq = dr^2 . quietly sum drsq . di r(sum) 29.917221 . gen prsq = pr^2 . quietly sum prsq . di r(sum) 28.288336
So the deviance is 29.9 as noted at the outset, and Pearson’s chi-squared is 28.3. We now list all groups with squared deviance residuals above 3.84 (same as absolute values above 1.96).
. gen pobs = users/n . format %5.3f pobs pfit . format %6.2f pr dr . list age educ nomore pobs pfit pr dr if pr^2 > 3.84 ┌─────────────────────────────────────────────────────────────────┐ │ age educ nomore pobs pfit pr dr │ ├─────────────────────────────────────────────────────────────────┤ 4. │ <25 upper primary+ nomore 0.167 0.308 -2.38 -2.51 │ 8. │ 25-29 upper primary+ nomore 0.293 0.397 -2.03 -2.07 │ 13. │ 40-49 lower primary- more 0.146 0.315 -2.32 -2.49 │ └─────────────────────────────────────────────────────────────────┘
We see that a substantial part of the deviance of 29.92 comes from three groups where the model overestimates the probability of using contraception: young women (under age 25 and aged 25-29) with upper primary or more, who want no more children; and women in their forties with lower primary or less who want more children.
Pregibon (1981) extended regression diagnostics to GLMs and
introduced a weighted hat matrix, with diagonal elements representing
the leverages. These can be calculated with the hat
option
of the predict
command. (The corresponding option after a
logit
command is lev
.)
. predict lev, hat . sum lev Variable │ Obs Mean Std. dev. Min Max ─────────────┼───────────────────────────────────────────────────────── lev │ 16 .375 .1788555 .0902827 .6696332 . gsort -lev . format %5.3f lev . list n age educ nomore pobs pfit lev in 1/3 ┌───────────────────────────────────────────────────────────────┐ │ n age educ nomore pobs pfit lev │ ├───────────────────────────────────────────────────────────────┤ 1. │ 264 <25 upper primary+ more 0.197 0.162 0.670 │ 2. │ 209 25-29 upper primary+ more 0.258 0.222 0.577 │ 3. │ 94 40-49 lower primary- nomore 0.511 0.514 0.560 │ └───────────────────────────────────────────────────────────────┘
The three cells with potentially the largest influence in the fit are young women with upper primary or more who want more children, and older women with lower primary or less who want no more children.
The elements of the hat matrix can be used to standardize Pearson (or
deviance) residuals and to compute influence statistics. These are
easily done “by hand”. (The rs
option
of predict
after logit
calculates standardized
Pearson residuals.)
. gen ps = pr/sqrt(1-lev) . gen ds = dr/sqrt(1-lev) . gen sc = ps^2 . gsort -sc . format %6.2f ps ds . list n age educ nomore pobs pfit ps ds in 1/3 ┌──────────────────────────────────────────────────────────────────────┐ │ n age educ nomore pobs pfit ps ds │ ├──────────────────────────────────────────────────────────────────────┤ 1. │ 60 <25 upper primary+ nomore 0.167 0.308 -2.89 -3.06 │ 2. │ 41 40-49 lower primary- more 0.146 0.315 -2.72 -2.92 │ 3. │ 92 25-29 upper primary+ nomore 0.293 0.397 -2.69 -2.74 │ └──────────────────────────────────────────────────────────────────────┘
We identify the same three observations picked up by the unstandardized residuals, but the absolute values are now closer to three, highlighting the lack of fit to these groups.
The cook
option of the predict
command
after glm
computes the one-step approximation of Cook’s
distance. (The option is called db
in predict
after logit
.) This statistic is called Pregibon’s influence
statistic in the Stata documentation, and their calculation differs from
the formula on page 49 of the notes in that it leaves out the number of
parameters p. This happens after logit
but not
afterglm
.
. predict cook, cook . sum cook Variable │ Obs Mean Std. dev. Min Max ─────────────┼───────────────────────────────────────────────────────── cook │ 16 .4712637 .6402413 .0020549 2.385867 . gsort -cook . format %6.2f cook . list n age educ nomore pobs pfit lev cook in 1/3 ┌──────────────────────────────────────────────────────────────────────┐ │ n age educ nomore pobs pfit lev cook │ ├──────────────────────────────────────────────────────────────────────┤ 1. │ 264 <25 upper primary+ more 0.197 0.162 0.670 2.39 │ 2. │ 157 30-39 lower primary- nomore 0.510 0.444 0.542 1.18 │ 3. │ 92 25-29 upper primary+ nomore 0.293 0.397 0.432 0.92 │ └──────────────────────────────────────────────────────────────────────┘
The group with the highest leverage turned out to be also the most influential: women under 25, with upper primary or more, who want more children.
With grouped data, or even with individual data where the number of covariate patters is small, the deviance provides a goodness of fit test. But what if you have truly individual data with many covariate patterns?
Hosmer and Lemeshow (1980) have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.
Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in the Stata website.
. use https://www.stata-press.com/data/r13/lbw, clear (Hosmer & Lemeshow data) . desc, s Contains data from https://www.stata-press.com/data/r13/lbw.dta Observations: 189 Hosmer & Lemeshow data Variables: 11 15 Jan 2013 05:01 Sorted by:
The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother’s age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here’s the fit
. logit low age lwt i.race smoke ptl ht ui Iteration 0: log likelihood = -117.336 Iteration 1: log likelihood = -101.28644 Iteration 2: log likelihood = -100.72617 Iteration 3: log likelihood = -100.724 Iteration 4: log likelihood = -100.724 Logistic regression Number of obs = 189 LR chi2(8) = 33.22 Prob > chi2 = 0.0001 Log likelihood = -100.724 Pseudo R2 = 0.1416 ─────────────┬──────────────────────────────────────────────────────────────── low │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── age │ -.0271003 .0364504 -0.74 0.457 -.0985418 .0443412 lwt │ -.0151508 .0069259 -2.19 0.029 -.0287253 -.0015763 │ race │ black │ 1.262647 .5264101 2.40 0.016 .2309024 2.294392 other │ .8620792 .4391532 1.96 0.050 .0013548 1.722804 │ smoke │ .9233448 .4008266 2.30 0.021 .137739 1.708951 ptl │ .5418366 .346249 1.56 0.118 -.136799 1.220472 ht │ 1.832518 .6916292 2.65 0.008 .4769494 3.188086 ui │ .7585135 .4593768 1.65 0.099 -.1418484 1.658875 _cons │ .4612239 1.20459 0.38 0.702 -1.899729 2.822176 ─────────────┴──────────────────────────────────────────────────────────────── . estat gof Goodness-of-fit test after logistic model Variable: low Number of observations = 189 Number of covariate patterns = 182 Pearson chi2(173) = 179.24 Prob > chi2 = 0.3567
The sample of 189 observations has 182 different covariate patterns,
so we can’t test goodness of fit this way. We try instead the
Homer-Lemeshow test asking for 10 groups. We also specify the
table
option to get more detailed information.
. estat gof, group(10) table note: obs collapsed on 10 quantiles of estimated probabilities. Goodness-of-fit test after logistic model Variable: low Table collapsed on quantiles of estimated probabilities ┌───────┬────────┬───────┬───────┬───────┬───────┬───────┐ │ Group │ Prob │ Obs_1 │ Exp_1 │ Obs_0 │ Exp_0 │ Total │ ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤ │ 1 │ 0.0827 │ 0 │ 1.2 │ 19 │ 17.8 │ 19 │ │ 2 │ 0.1276 │ 2 │ 2.0 │ 17 │ 17.0 │ 19 │ │ 3 │ 0.2015 │ 6 │ 3.2 │ 13 │ 15.8 │ 19 │ │ 4 │ 0.2432 │ 1 │ 4.3 │ 18 │ 14.7 │ 19 │ │ 5 │ 0.2792 │ 7 │ 4.9 │ 12 │ 14.1 │ 19 │ ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤ │ 6 │ 0.3138 │ 7 │ 5.6 │ 12 │ 13.4 │ 19 │ │ 7 │ 0.3872 │ 6 │ 6.5 │ 13 │ 12.5 │ 19 │ │ 8 │ 0.4828 │ 7 │ 8.2 │ 12 │ 10.8 │ 19 │ │ 9 │ 0.5941 │ 10 │ 10.3 │ 9 │ 8.7 │ 19 │ │ 10 │ 0.8391 │ 13 │ 12.8 │ 5 │ 5.2 │ 18 │ └───────┴────────┴───────┴───────┴───────┴───────┴───────┘ Number of observations = 189 Number of groups = 10 Hosmer–Lemeshow chi2(8) = 9.65 Prob > chi2 = 0.2904
We get a Hosmer-Lemeshow chi-squared value of 9.65 on 8 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.
Paul Allison (2013) has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.
Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression. Online at https://statisticalhorizons.com/hosmer-lemeshow/.
Hosmer D.W. and Lemeshow S. (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics,A10:1043-1069.
Pregibon, D. (1981) Logistic Regression Diagnostics. The Annals of Statistics, 9(4): 705-724. JSTOR.
Updated fall 2022