Generalized linear models are just as easy to fit in R as ordinary linear model. In fact, they require only an additional parameter to specify the variance and link functions.
The basic tool for fitting generalized linear models is the
glm() function, which has the folllowing general
> glm(formula, family, data, weights, subset, ...)
... stands for more esoteric options. The only
parameter that we have not encountered before is
which is a simple way of specifying a choice of variance and link
functions. There are six choices of family:
|binomial||binomial||logit, probit or cloglog|
|poisson||poisson||log, identity or sqrt|
|gamma||gamma||inverse, identity or log|
As can be seen, each of the first five choices has an associated variance function (for binomial, the binomial variance \(\mu(1 - \mu)\), and one or more choices of link functions (for binomial, the logit, probit or complementary log-log links).
As long as you want the default link, all you have to specify is the family name. If you want an alternative link, you must add a link argument. For example to do probits you use
> glm(formula, family = binomial(link = probit))
The last family on the list,
quasi, is there to allow
fitting user-defined models by maximum quasi-likelihood.
We will illustrate fitting logistic regression models using the contraceptive use data excerpted below (and shown in full further below):
age education wantsMore notUsing using <25 low yes 53 6 <25 low no 10 4 ... lines omitted ... 40-49 high no 12 31
The data are available on my website in a file called
cuse.dat, and can be read directly from within R:
> cuse <- read.table("https://grodri.github.io/datasets/cuse.dat", header + = TRUE) > cuse
age education wantsMore notUsing using 1 <25 low yes 53 6 2 <25 low no 10 4 3 <25 high yes 212 52 4 <25 high no 50 10 5 25-29 low yes 60 14 6 25-29 low no 19 10 7 25-29 high yes 155 54 8 25-29 high no 65 27 9 30-39 low yes 112 33 10 30-39 low no 77 80 11 30-39 high yes 118 46 12 30-39 high no 68 78 13 40-49 low yes 35 6 14 40-49 low no 46 48 15 40-49 high yes 8 8 16 40-49 high no 12 31
I specified the
header parameter as
because otherwise it would not have been obvious that the first line in
the file has the variable names. There are no row names specified, so
the rows will be numbered from 1 to 16. I also printed the data to make
sure we got it alright.
Strings as Factors. We encountered factors, or
categorical variables that take one of a discrete number of
levels, in Section 4.4. Internally a factor is represented as
an integer vector with the levels as an attribute. Versions of R prior
to 4 would automatically read all strings as factors, but the default in
functions such as
read.table() is now
StringAsFactor=FALSE. Because we kept the default,
variables such as
education have been read as strings, with
high. Modeling functions
treat strings pretty much the same as factors, but if necessary one can
convert a string variable to a factor using the
Let us try a simple additive model where contraceptive use depends on age, education and whether or not the woman wants more children:
> lrfit <- glm( cbind(using, notUsing) ~ age + education + wantsMore, + data = cuse, family = binomial)
There are a few things to explain here. First, the function is
glm() and I have assigned its value to an object called
lrfit (for logistic regression fit). The first argument of
the function is a model formula, which defines the response and linear
With binomial data the response can be either a vector or a matrix with two columns.
If the response is a vector, it can be numeric with 0 for failure and 1 for success, or a factor with the first level representing “failure” and all others representing “success”. In these cases R generates a vector of ones to represent the binomial denominators.
Alternatively, the response can be a matrix where the first column is the number of “successes” and the second column is the number of “failures”. In this case R adds the two columns together to produce the correct binomial denominator.
Because the latter approach is clearly the right one for us, I used
cbind() to create a matrix by binding the
column vectors containing the numbers using and not using
Following the special symbol
~ that separates the
response from the predictors, we have a standard Wilkinson-Rogers model
formula. In this case we are specifying main effects of
Because all three predictors are string vectors they are treated
automatically as categorical variables and represented using indicators
for the categories, as you can see by inspecting the results:
Call: glm(formula = cbind(using, notUsing) ~ age + education + wantsMore, family = binomial, data = cuse) Coefficients: (Intercept) age25-29 age30-39 age40-49 educationlow -0.8082 0.3894 0.9086 1.1892 -0.3250 wantsMoreyes -0.8330 Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4
R sorts the levels of a factor or string variable in alphabetical
<25 comes before
40-49, it has been picked as the
reference cell for
high is the
reference cell for
education because high comes
low! Finally, R picked
no as the base for
If you are unhappy about these choices, which are admittedly not ideal, you can
convert the variable to a factor and then change the reference
relevel(); for example for education we could
set “low” as the reference by coding
cuse$education <- relevel(as.factor(cuse$education), "low"),
define your own indicator variables.
I will use the second approach, defining indicators for women with
high education, and for and women who want no more children, both added
cuse data frame:
> cuse$noMore <- cuse$wantsMore == "no" > cuse$hiEduc <- cuse$education == "high"
Now try the model with these predictors
> glm(cbind(using, notUsing) ~ age + hiEduc + noMore, + data = cuse, family = binomial)
Call: glm(formula = cbind(using, notUsing) ~ age + hiEduc + noMore, family = binomial, data = cuse) Coefficients: (Intercept) age25-29 age30-39 age40-49 hiEducTRUE noMoreTRUE -1.9662 0.3894 0.9086 1.1892 0.3250 0.8330 Degrees of Freedom: 15 Total (i.e. Null); 10 Residual Null Deviance: 165.8 Residual Deviance: 29.92 AIC: 113.4
Our indicator for high education is a Boolean variable that takes the
TRUE. The corresponding
coefficient is labeled
hiEducTRUE to make it clear that it
represents the case when the condition is true. (Alternatively, we could
make the indicator take the values
as.numeric(), coding for example
cuse$hiEduc <- as.numeric(cuse$education == "high"). In
this case the coefficient would be labeled just
The residual deviance of 29.92 on 10 d.f. is highly significant, so the additive model does not fit the data.
> pchisq(29.92, 10, lower.tail = FALSE)
To obtain a p-value I specified
FALSE. This is more accurate than computing the default
lower tail and subtracting from one.
So, we need a better model. One of my favorites for this dataset introduces an interaction between age and wanting no more children, which is easily specified:
> lrfit2 <- glm( cbind(using, notUsing) ~ age * noMore + hiEduc , data = cuse, + family = binomial) > lrfit2
Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial, data = cuse) Coefficients: (Intercept) age25-29 age30-39 -1.80317 0.39460 0.54666 age40-49 noMoreTRUE hiEducTRUE 0.57952 0.06622 0.34065 age25-29:noMoreTRUE age30-39:noMoreTRUE age40-49:noMoreTRUE 0.25918 1.11266 1.36167 Degrees of Freedom: 15 Total (i.e. Null); 7 Residual Null Deviance: 165.8 Residual Deviance: 12.63 AIC: 102.1
Note how R built the interaction terms automatically, and even came up with sensible labels for them. The model’s deviance of 12.63 on 7 d.f. is not significant at the conventional five per cent level, so we have no evidence against this model.
To obtain more detailed information about this fit, try the
Call: glm(formula = cbind(using, notUsing) ~ age * noMore + hiEduc, family = binomial, data = cuse) Deviance Residuals: Min 1Q Median 3Q Max -1.30027 -0.66163 -0.03286 0.81945 1.73851 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.80317 0.18018 -10.008 < 2e-16 *** age25-29 0.39460 0.20145 1.959 0.05013 . age30-39 0.54666 0.19842 2.755 0.00587 ** age40-49 0.57952 0.34742 1.668 0.09530 . noMoreTRUE 0.06622 0.33071 0.200 0.84130 hiEducTRUE 0.34065 0.12577 2.709 0.00676 ** age25-29:noMoreTRUE 0.25918 0.40975 0.633 0.52704 age30-39:noMoreTRUE 1.11266 0.37404 2.975 0.00293 ** age40-49:noMoreTRUE 1.36167 0.48433 2.811 0.00493 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 165.77 on 15 degrees of freedom Residual deviance: 12.63 on 7 degrees of freedom AIC: 102.14 Number of Fisher Scoring iterations: 4
R follows the popular custom of flagging significant coefficients
with one, two or three stars depending on their p-values. Try
plot(lrfit2). You get the same plots as in a linear model,
but adapted to a generalized linear model; for example the residuals
plotted are deviance residuals (the square root of the contribution of
an observation to the deviance, with the same sign as the raw
The functions that can be used to extract results from the fit include
resid(), for the deviance residuals
fitted.values(), for the fitted values (estimated probabilities)
predict(), for the linear predictor (estimated logits)
coefficients(), for the coefficients, and
deviance(), for the deviance.
Some of these functions have optional arguments; for example, you can
extract five different types of residuals, called “deviance”, “pearson”,
“response” (defined as response - fitted value), “working” (the working
dependent variable in the IRLS algorithm - linear predictor), and
“partial” (a matrix of working residuals formed by omitting each term in
the model). You specify the one you want using the
argument, for example
residuals(lrfit2, type = "pearson").
If you want to modify a model you may consider using the special
update(). For example to drop the
age:noMore interaction in our model, one could use
> lrfit1 <- update(lrfit2, ~ . - age:noMore)
The first argument is the result of a fit, and the second an updating
formula. The tilde
~ separates the response from the
predictors, and the dot
. refers to the right-hand side of
the original formula, so here we simply remove
Alternatively, one can give a new formula as the second argument.
The update function may also be used to fit the same model to
different datasets, using the argument
data to specify a
new data frame. Another useful argument is
subset, to fit
the model to a different subsample. This function works with linear
models as well as generalized linear models.
If you plan to fit a sequence of models you will find the anova function useful. Given a series of nested models, it will calculate the change in deviance between them. Try
> anova(lrfit1, lrfit2)
Analysis of Deviance Table Model 1: cbind(using, notUsing) ~ age + noMore + hiEduc Model 2: cbind(using, notUsing) ~ age * noMore + hiEduc Resid. Df Resid. Dev Df Deviance 1 10 29.917 2 7 12.630 3 17.288
Adding the interaction has reduced the deviance by 17.288 at the expense of 3 d.f.
If the argument to
anova() is a single model, the
function will show the change in deviance obtained by adding each of the
terms in the order listed in the model formula, just as it did for
linear models. Because this requires fitting as many models as there are
terms in the formula, the function may take a while to complete its
anova() function lets you specify an optional test.
The usual choices will be “F” for linear models and “Chisq” for
generalized linear models. Adding the parameter
test = "Chisq" adds p-values next to the deviances. In our
> anova(lrfit2, test = "Chisq")
Analysis of Deviance Table Model: binomial, link: logit Response: cbind(using, notUsing) Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 15 165.772 age 3 79.192 12 86.581 < 2.2e-16 *** noMore 1 49.693 11 36.888 1.798e-12 *** hiEduc 1 6.971 10 29.917 0.0082860 ** age:noMore 3 17.288 7 12.630 0.0006167 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that all terms were highly significant when they were introduced into the model.
A very powerful tool in R is a function for stepwise regression that has three remarkable features:
It works with generalized linear models, so it will do stepwise logistic regression, or stepwise Poisson regression,
It understands hierarchical models, so it will only consider adding interactions after including the corresponding main effects in the models, and
It understands terms involving more than one degree of freedom, so it it will keep together dummy variables representing the effects of a factor.
The basic idea of the procedure is to start from a given model (which could well be the null model) and take a series of steps, by either deleting a term already in the model, or adding a term from a list of candidates for inclusion, called the scope of the search and defined, of course, by a model formula.
Selection of terms for deletion or inclusion is based on Akaike’s information criterion (AIC). R defines AIC as
AIC = –2 maximized log-likelihood + 2 number of parameters
The procedure stops when the AIC criterion cannot be improved.
In R all of this work is done by calling a couple of functions,
drop1(), that consider adding or
dropping one term from a model. These functions can be very useful in
model selection, and both of them accept a
drop1(). For our logistic regression
> drop1(lrfit2, test = "Chisq")
Single term deletions Model: cbind(using, notUsing) ~ age * noMore + hiEduc Df Deviance AIC LRT Pr(>Chi) <none> 12.630 102.14 hiEduc 1 20.099 107.61 7.4695 0.0062755 ** age:noMore 3 29.917 113.42 17.2877 0.0006167 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Obviously we can’t drop any of these terms. Note that R considered dropping the main effect of education, and the age by want no more interaction, but did not examine the main effects of age or want no more, because one would not drop these main effects while retaining the interaction.
The sister function
add1() requires a scope to define
the additional terms to be considered. In our example we will consider
all possible two-factor interactions:
> add1(lrfit2, ~ .^2, test = "Chisq")
Single term additions Model: cbind(using, notUsing) ~ age * noMore + hiEduc Df Deviance AIC LRT Pr(>Chi) <none> 12.6296 102.14 age:hiEduc 3 5.7983 101.31 6.8313 0.07747 . noMore:hiEduc 1 10.8240 102.33 1.8055 0.17905 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We see that neither of the missing two-factor interactions is significant by itself at the conventional five percent level. (However, they happen to be jointly significant.) Note that the model with the age by education interaction has a lower AIC than our starting model.
step() function will do an automatic search. Here we
let it start from the additive model and search in a scope defined by
all two-factor interactions.
> search <- step(lrfit1, ~.^2)
step() function produces detailed trace output that
I have supressed. The returned object, however, includes an
anova component that summarizes the search:
Step Df Deviance Resid. Df Resid. Dev AIC 1 NA NA 10 29.917222 113.4251 2 + age:noMore -3 17.287669 7 12.629553 102.1375 3 + age:hiEduc -3 6.831288 4 5.798265 101.3062 4 + noMore:hiEduc -1 3.356777 3 2.441488 99.9494
As you can see, the automated procedure introduced, one by one, all three remaining two-factor interactions, to yield a final AIC of 99.9. This is an example where AIC, by requiring a deviance improvement of only 2 per parameter, may have led to overfitting the data.
Some analysts prefer a higher penalty per parameter. In particular,
log(n) instead of 2 as a multiplier yields BIC, the
Bayesian Information Criterion. In our example
log(1607) = 7.38, so we would require a deviance reduction
of 7.38 per additional parameter. The
k as an argument, with default 2. You may verify
k = log(1607) leads to a much simpler
model; not only are no new interactions introduced, but the main effect
of education is dropped (even though it is significant).
In this example AIC would lead to a model that may be too complex, and BIC would lead to a model that may be too simple. In my opinion, the model with only one interaction is just right.
Continue with Conclusion