We now consider regression diagnostics for binary data, focusing on logistic regression models. We will work with the additive model of contraceptive use by age, education, and desire for more children, which we know to be inadequate.
So here’s the additive model for the contraceptive use data.
> library(haven)
> library(dplyr)
> cuse <- read_dta("https://grodri.github.io/datasets/cusew.dta")
> cuse <- mutate(cuse, age=as_factor(age), educ=as.numeric(educ),
+ nomore=as.numeric(nomore), Y=cbind(users, nonusers) )
> m <- glm(Y ~ age + educ + nomore, family=binomial, data=cuse)
> m
Call: glm(formula = Y ~ age + educ + nomore, family = binomial, data = cuse)
Coefficients:
(Intercept) age25-29 age30-39 age40-49 educ nomore
-1.9662 0.3894 0.9086 1.1892 0.3250 0.8330
Degrees of Freedom: 15 Total (i.e. Null); 10 Residual
Null Deviance: 165.8
Residual Deviance: 29.92 AIC: 113.4
The additive model has a deviance of 29.92 on 10 d.f. when we define the saturated model in terms of the 16 groups of women.
The residuals()
function can be used to obtain (among
other statistics) deviance and Pearson residuals, by specifying
type="deviance"
(the default), or
type="pearson"
. These residuals are the signed square roots
of the contributions to the model deviance or Pearson chi-squared
statistic
> cuse <- mutate(cuse, pobs=users/(users+nonusers), pfit=fitted(m),
+ dr=residuals(m, type="deviance"), pr = residuals(m, type="pearson"))
We will verify that if we square these residuals and sum them over the groups we get the deviance and Pearson chi-squared statistics.
> summarize(cuse, sdrsq=sum(dr^2), sprsq=sum(pr^2))
# A tibble: 1 × 2
sdrsq sprsq
<dbl> <dbl>
1 29.9 28.3
So the deviance is 29.9 as noted at the outset, and Pearson’s chi-squared is 28.3. We now list all groups with squared deviance residuals above 3.84 (same as absolute values above 1.96).
> select(cuse, age, educ, nomore, pobs, pfit, pr, dr) |>
+ filter(dr^2 > 3.84)
# A tibble: 3 × 7
age educ nomore pobs pfit pr dr
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25 1 1 0.167 0.308 -2.38 -2.51
2 25-29 1 1 0.293 0.397 -2.03 -2.07
3 40-49 0 0 0.146 0.315 -2.32 -2.49
We see that a substantial part of the deviance of 29.92 comes from three groups where the model overestimates the probability of using contraception: young women (under age 25 and aged 25-29) with upper primary or more, who want no more children; and women in their forties with lower primary or less who want more children.
Pregibon (1981) extended regression diagnostics to GLMs and
introduced a weighted hat matrix, with diagonal elements representing
the leverages. These are calculated using the hatvalues(()
function).
> cuse <- mutate(cuse, lev=hatvalues(m))
> select(cuse, age, educ, nomore, pobs, pfit, lev) |>
+ arrange(desc(lev)) |> head(3)
# A tibble: 3 × 6
age educ nomore pobs pfit lev
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25 1 0 0.197 0.162 0.670
2 25-29 1 0 0.258 0.222 0.577
3 40-49 0 1 0.511 0.514 0.560
The three cells with potentially the largest influence in the fit are young women with upper primary or more who want more children, and older women with lower primary or less who want no more children.
The elements of the hat matrix can be used to standardize Pearson (or
deviance) residuals and to compute influence statistics. These are
easily done “by hand”. (The rs
option
of predict
after logit
calculates standardized
Pearson residuals.)
> cuse <- mutate(cuse, ps = pr/sqrt(1-lev), ds = dr/sqrt(1-lev), sc=ps^2)
> select(cuse, age, educ, nomore, pobs, pfit, ps, ds, sc) |>
+ arrange(desc(sc)) |> head(3)
# A tibble: 3 × 8
age educ nomore pobs pfit ps ds sc
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25 1 1 0.167 0.308 -2.89 -3.06 8.34
2 40-49 0 0 0.146 0.315 -2.72 -2.92 7.40
3 25-29 1 1 0.293 0.397 -2.69 -2.74 7.22
We identify the same three observations picked up by the unstandardized residuals, but the absolute values are now closer to three, highlighting the lack of fit to these groups.
Cook’s distance can be calculated with cook.distance()
.
Don’t be surprised if things like residuals()
and
cook.distance()
look like the same functions used in linear
models. Like many other R functions, these are generic
functions. R looks at the class of the object and calls the appropriate
function, depending on whether the object is a linear model fitted by
lm()
, or a generalized linear model fitted by
glm()
.
> cuse <- mutate(cuse, cook = cooks.distance(m))
> select(cuse, age, educ, nomore, pobs, pfit, lev, cook) |>
+ arrange(desc(cook)) |> head(3)
# A tibble: 3 × 7
age educ nomore pobs pfit lev cook
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 <25 1 0 0.197 0.162 0.670 2.39
2 30-39 0 1 0.510 0.444 0.542 1.18
3 25-29 1 1 0.293 0.397 0.432 0.915
The group with the highest leverage turned out to be also the most influential: women under 25, with upper primary or more, who want more children.
With grouped data, or even with individual data where the number of covariate patters is small, the deviance provides a goodness of fit test. But what if you have truly individual data with many covariate patterns?
Hosmer and Lemeshow (1980) have proposed a goodness of fit for logistic regression models that can be used with individual data. The basic idea is to create groups using predicted probabilities, and then compare observed and fitted counts of successes and failures on those groups using a chi-squared statistic. Simulation has shown that with g groups the large sample distribution of the test statistic is approximately chi-squared with g-2 degrees of freedom.
Let us apply this test to the Hosmer and Lemeshow low birth weight data, which happen to be available in the Stata website.
> lbw <- read_dta("https://www.stata-press.com/data/r13/lbw.dta")
> lbw <- mutate(lbw, race=as_factor(race))
> names(lbw)
[1] "id" "low" "age" "lwt" "race" "smoke" "ptl" "ht" "ui" "ftv"
[11] "bwt"
The outcome is an indicator of low birth weight (< 2500 grams). The predictors include mother’s age, her weight at the last menstrual period, and indicators of ethnicity, smoking during pregnancy, history of premature labor, history of hypertension, and presence of uterine irritability. Here’s the fit
> model <- glm(low ~ age + lwt + race + smoke + ptl + ht + ui,
+ family=binomial, data=lbw)
> model
Call: glm(formula = low ~ age + lwt + race + smoke + ptl + ht + ui,
family = binomial, data = lbw)
Coefficients:
(Intercept) age lwt raceblack raceother smoke
0.46122 -0.02710 -0.01515 1.26265 0.86208 0.92334
ptl ht ui
0.54184 1.83252 0.75851
Degrees of Freedom: 188 Total (i.e. Null); 180 Residual
Null Deviance: 234.7
Residual Deviance: 201.4 AIC: 219.4
> # count covariate patterns
> group_by(lbw, age, lwt, race, smoke, ptl, ht, ui) |>
+ summarize(dummy=1) |> nrow()
`summarise()` has grouped output by 'age', 'lwt', 'race', 'smoke', 'ptl', 'ht'. You can
override using the `.groups` argument.
[1] 182
The sample of 189 observations has 182 different covariate patterns, so we can’t test goodness of fit this way. Instead we will compute predicted probabilities and create ten groups of approximately equal size by breaking at the deciles of the predicted probabilities. It pays to encapsulate the calculations in a function that can be reused:
> hosmer <- function(y, fv, groups=10, table=TRUE, type=2) {
+ # A simple implementation of the Hosmer-Lemeshow test
+ q <- quantile(fv, seq(0,1,1/groups), type=type)
+ fv.g <- cut(fv, breaks=q, include.lowest=TRUE)
+ obs <- xtabs( ~ fv.g + y)
+ fit <- cbind( e.0 = tapply(1-fv, fv.g, sum), e.1 = tapply(fv, fv.g, sum))
+ if(table) print(cbind(obs,fit))
+ chi2 <- sum((obs-fit)^2/fit)
+ pval <- pchisq(chi2, groups-2, lower.tail=FALSE)
+ data.frame(test="Hosmer-Lemeshow",groups=groups,chi.sq=chi2,pvalue=pval)
+ }
We calculate quantiles, defaulting to deciles, and use these to create groups, taking care to include all values. We then tabulate the observed and predicted outcomes, using the sum of predicted probablities as the expected number of “successes” in a group. There is an option to print a table of expected and observed counts. The function returns the chi-squared statistic, the d.f., and the p-value.
Here’s the result of applying our function with all the defaults:
> hosmer(lbw$low, fitted(model))
0 1 e.0 e.1
[0.0273,0.0827] 19 0 17.822223 1.177777
(0.0827,0.128] 17 2 16.973902 2.026098
(0.128,0.201] 13 6 15.828544 3.171456
(0.201,0.243] 18 1 14.695710 4.304290
(0.243,0.279] 12 7 14.106205 4.893795
(0.279,0.314] 12 7 13.360124 5.639876
(0.314,0.387] 13 6 12.462805 6.537195
(0.387,0.483] 12 7 10.824166 8.175834
(0.483,0.594] 9 10 8.690142 10.309858
(0.594,0.839] 5 13 5.236180 12.763820
test groups chi.sq pvalue
1 Hosmer-Lemeshow 10 9.650683 0.2904041
We get a Hosmer-Lemeshow chi-squared value of 9.65 on 8 d.f. with a p-value of 0.2904, and thus no evidence of lack of fit.
Statistical packages differ in how they calculate quantiles. R
implements 9 types; the default is 7 for compatibility with S and R <
2.0, but they recommend type 8. Type ?quantile
’ to learn
more. We used type 2, the inverse of the empirical distribution function
with averaging at discontinuities, for compatibility with Stata, but our
function lets you try other types. A test using R’s recommended
definition of deciles yields a chi-squared of 10.55 on the same 8 d.f.,
with a p-value of 0.2283.
Paul Allison (2013) has noted that the Hosmer-Lemeshow test is sensitive to the number of groups used. He provides an example where a test with 10 groups yields a p-value just below 0.05, but working with 9 or 11 groups raises it to 0.11 and 0.64 respectively. In this example the p-values are also quite different, but the conclusion does not change.
Allison, P. (2013). Why I Don’t Trust the Hosmer-Lemeshow Test for Logistic Regression. Online at https://statisticalhorizons.com/hosmer-lemeshow/.
Hosmer D.W. and Lemeshow S. (1980) A goodness-of-fit test for the multiple logistic regression model. Communications in Statistics,A10:1043-1069.
Pregibon, D. (1981) Logistic Regression Diagnostics. The Annals of Statistics, 9(4): 705-724. JSTOR.
Updated fall 2022