Germán Rodríguez
Generalized Linear Models Princeton University

3.8 Regression Diagnostics for Binary Data

We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.

The Additive Model

So here’s the additive model for the contraceptive use data.

> library(haven)
> library(dplyr)
> cuse <- read_dta("https://grodri.github.io/datasets/cusew.dta")
> cuse <- mutate(cuse, age=as_factor(age), educ=as.numeric(educ), 
+   nomore=as.numeric(nomore), Y=cbind(users, nonusers) )
> m <- glm(Y ~ age + educ + nomore, family=binomial, data=cuse) 
> m

Call:  glm(formula = Y ~ age + educ + nomore, family = binomial, data = cuse)

Coefficients:
(Intercept)     age25-29     age30-39     age40-49         educ       nomore  
    -1.9662       0.3894       0.9086       1.1892       0.3250       0.8330  

Degrees of Freedom: 15 Total (i.e. Null);  10 Residual
Null Deviance:      165.8 
Residual Deviance: 29.92    AIC: 113.4

The additive model has a deviance of 29.92 on 10 d.f. when we define the saturated model in terms of the 16 groups of women.

Deviance and Pearson Residuals

The residuals() function can be used to obtain (among other statistics) deviance and Pearson residuals, by specifying type="deviance" (the default), or type="pearson". These residuals are the signed square roots of the contributions to the model deviance or Pearson chi-squared statistic

> cuse <- mutate(cuse, pobs=users/(users+nonusers), pfit=fitted(m), 
+   dr=residuals(m, type="deviance"), pr = residuals(m, type="pearson"))

We will verify that if we square these residuals and sum them over the groups we get the deviance and Pearson chi-squared statistics.

> summarize(cuse, sdrsq=sum(dr^2), sprsq=sum(pr^2))
# A tibble: 1 × 2
  sdrsq sprsq
  <dbl> <dbl>
1  29.9  28.3

So the deviance is 29.9 as noted at the outset, and Pearson’s chi-squared is 28.3. We now list all groups with squared deviance residuals above 3.84 (same as absolute values above 1.96).

> select(cuse, age, educ, nomore, pobs, pfit, pr, dr) |> 
+ filter(dr^2 > 3.84)
# A tibble: 3 × 7
  age    educ nomore  pobs  pfit    pr    dr
  <fct> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25       1      1 0.167 0.308 -2.38 -2.51
2 25-29     1      1 0.293 0.397 -2.03 -2.07
3 40-49     0      0 0.146 0.315 -2.32 -2.49

We see that a substantial part of the deviance of 29.92 comes from three groups where the model overestimates the probability of using contraception: young women (under age 25 and aged 25-29) with upper primary or more, who want no more children; and women in their forties with lower primary or less who want more children.

Leverage and Influence

Pregibon (1981) extended regression diagnostics to GLMs and introduced a weighted hat matrix, with diagonal elements representing the leverages. These are calculated using the hatvalues(() function).

> cuse <- mutate(cuse, lev=hatvalues(m))
> select(cuse, age, educ, nomore, pobs, pfit, lev) |>  
+   arrange(desc(lev)) |> head(3)
# A tibble: 3 × 6
  age    educ nomore  pobs  pfit   lev
  <fct> <dbl>  <dbl> <dbl> <dbl> <dbl>
1 <25       1      0 0.197 0.162 0.670
2 25-29     1      0 0.258 0.222 0.577
3 40-49     0      1 0.511 0.514 0.560

The three cells with potentially the largest influence in the fit are young women with upper primary or more who want more children, and older women with lower primary or less who want no more children.

The elements of the hat matrix can be used to standardize Pearson (or deviance) residuals and to compute influence statistics. These are easily done “by hand”. (The rs option of predict after logit calculates standardized Pearson residuals.)

> cuse <- mutate(cuse, ps = pr/sqrt(1-lev), ds = dr/sqrt(1-lev), sc=ps^2)
> select(cuse, age, educ, nomore, pobs, pfit, ps, ds, sc) |> 
+   arrange(desc(sc)) |> head(3)
# A tibble: 3 × 8
  age    educ nomore  pobs  pfit    ps    ds    sc
  <fct> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25       1      1 0.167 0.308 -2.89 -3.06  8.34
2 40-49     0      0 0.146 0.315 -2.72 -2.92  7.40
3 25-29     1      1 0.293 0.397 -2.69 -2.74  7.22

We identify the same three observations picked up by the unstandardized residuals, but the absolute values are now closer to three, highlighting the lack of fit to these groups.

Cook’s distance can be calculated with cook.distance(). Don’t be surprised if things like residuals() and cook.distance() look like the same functions used in linear models. Like many other R functions, these are generic functions. R looks at the class of the object and calls the appropriate function, depending on whether the object is a linear model fitted by lm(), or a generalized linear model fitted by glm().

> cuse <- mutate(cuse, cook = cooks.distance(m))
> select(cuse, age, educ, nomore, pobs, pfit, lev, cook) |> 
+   arrange(desc(cook)) |> head(3)
# A tibble: 3 × 7
  age    educ nomore  pobs  pfit   lev  cook
  <fct> <dbl>  <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25       1      0 0.197 0.162 0.670 2.39 
2 30-39     0      1 0.510 0.444 0.542 1.18 
3 25-29     1      1 0.293 0.397 0.432 0.915

The group with the highest leverage turned out to be also the most influential: women under 25, with upper primary or more, who want more children.

Goodness of Fit

With grouped data, or even with individual data where the number of covariate patters is small, the deviance provides a goodness of fit test. But what if you have truly individual data with many covariate patterns?

Hosmer and Lemeshow (1980) have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.

Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in the Stata website.

> lbw <- read_dta("https://www.stata-press.com/data/r13/lbw.dta")
> lbw <- mutate(lbw, race=as_factor(race))
> names(lbw)
 [1] "id"    "low"   "age"   "lwt"   "race"  "smoke" "ptl"   "ht"    "ui"    "ftv"  
[11] "bwt"  

The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother’s age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here’s the fit

> model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui, 
+   family=binomial, data=lbw)
> model

Call:  glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui, 
    family = binomial, data = lbw)

Coefficients:
(Intercept)          age          lwt    raceblack    raceother        smoke  
    0.46122     -0.02710     -0.01515      1.26265      0.86208      0.92334  
        ptl           ht           ui  
    0.54184      1.83252      0.75851  

Degrees of Freedom: 188 Total (i.e. Null);  180 Residual
Null Deviance:      234.7 
Residual Deviance: 201.4    AIC: 219.4
> # count covariate patterns
> group_by(lbw, age, lwt, race, smoke, ptl, ht, ui) |> 
+   summarize(dummy=1) |> nrow()
`summarise()` has grouped output by 'age', 'lwt', 'race', 'smoke', 'ptl', 'ht'. You can
override using the `.groups` argument.
[1] 182

The sample of 189 observations has 182 different covariate patterns, so we can’t test goodness of fit this way. Instead we will compute predicted probabilities and create ten groups of approximately equal size by breaking at the deciles of the predicted probabilities. It pays to encapsulate the calculations in a function that can be reused:

> hosmer <- function(y, fv, groups=10, table=TRUE, type=2) {
+ # A simple implementation of the Hosmer-Lemeshow test
+   q <- quantile(fv, seq(0,1,1/groups), type=type)
+   fv.g <- cut(fv, breaks=q, include.lowest=TRUE)
+   obs <- xtabs( ~ fv.g + y)
+   fit <- cbind( e.0 = tapply(1-fv, fv.g, sum), e.1 = tapply(fv, fv.g, sum))
+   if(table) print(cbind(obs,fit))  
+   chi2 <- sum((obs-fit)^2/fit)
+   pval <- pchisq(chi2, groups-2, lower.tail=FALSE)
+   data.frame(test="Hosmer-Lemeshow",groups=groups,chi.sq=chi2,pvalue=pval)
+ }

We calculate quantiles, defaulting to deciles, and use these to create groups, taking care to include all values. We then tabulate the observed and predicted outcomes, using the sum of predicted probablities as the expected number of “successes” in a group. There is an option to print a table of expected and observed counts. The function returns the chi-squared statistic, the d.f., and the p-value.

Here’s the result of applying our function with all the defaults:

> hosmer(lbw$low, fitted(model))
                 0  1       e.0       e.1
[0.0273,0.0827] 19  0 17.822223  1.177777
(0.0827,0.128]  17  2 16.973902  2.026098
(0.128,0.201]   13  6 15.828544  3.171456
(0.201,0.243]   18  1 14.695710  4.304290
(0.243,0.279]   12  7 14.106205  4.893795
(0.279,0.314]   12  7 13.360124  5.639876
(0.314,0.387]   13  6 12.462805  6.537195
(0.387,0.483]   12  7 10.824166  8.175834
(0.483,0.594]    9 10  8.690142 10.309858
(0.594,0.839]    5 13  5.236180 12.763820
             test groups   chi.sq    pvalue
1 Hosmer-Lemeshow     10 9.650683 0.2904041

We get a Hosmer-Lemeshow chi-squared value of 9.65 on 8 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.

Statistical packages differ in how they calculate quantiles. R implements 9 types; the default is 7 for compatibility with S and R < 2.0, but they recommend type 8. Type ?quantile’ to learn more. We used type 2, the inverse of the empirical distribution function with averaging at discontinuities, for compatibility with Stata, but our function lets you try other types. A test using R’s recommended definition of deciles yields a chi-squared of 10.55 on the same 8 d.f., with a p-value of 0.2283.

Paul Allison (2013) has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.

References

Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression. Online at https://statisticalhorizons.com/hosmer-lemeshow/.

Hosmer D.W. and Lemeshow S. (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics,A10:1043-1069.

Pregibon, D. (1981) Logistic Regression Diagnostics. The Annals of Statistics, 9(4): 705-724. JSTOR.

Updated fall 2022