Germán Rodríguez
Generalized Linear Models Princeton University

We apply the theory of generalized linear models to the case of binary data, and in particular to logistic regression models.

B.4.1 The Binomial Distribution

First we verify that the binomial distribution \(B(n_i,\pi_i)\) belongs to the exponential family of Nelder and Wedderburn (1972). The binomial probability distribution function (p.d.f.) is

\[ \tag{B.14} f_i(y_i) = {n_i \choose y_i} \pi_i^{y_i} (1-\pi_i)^{n_i-y_i}. \]

Taking logs we find that

\[ \log f_i(y_i) = y_i \log(\pi_i) + (n_i-y_i)\log(1-\pi_i) + \log {n_i \choose y_i}. \]

Collecting terms on \(y_i\) we can write

\[ \log f_i(y_i) = y_i \log ( \frac{\pi_i}{1-\pi_i} ) + n_i\log(1-\pi_i) + \log{n_i \choose y_i}. \]

This expression has the general exponential form

\[ \log f_i(y_i) = \frac{y_i \theta_i - b(\theta_i)}{a_i(\phi)} + c(y_i,\phi) \]

with the following equivalences: Looking first at the coefficient of \(y_i\) we note that the canonical parameter is the logit of \(\pi_i\)

\[ \tag{B.15} \theta_i = \log ( \frac{\pi_i}{1-\pi_i} ). \]

Solving for \(\pi_i\) we see that

\[ \pi_i = \frac{e^{\theta_i}}{1 + e^{\theta_i}}, \quad\mbox{so}\quad 1-\pi_i = \frac{1}{1 + e^{\theta_i}}. \]

If we rewrite the second term in the p.d.f. as a function of \(\theta_i\), so \(\log(1-\pi_i) = -\log(1+e^{\theta_i})\), we can identify the cumulant function \(b(\theta_i)\) as

\[ b(\theta_i) = n_i \log(1+e^{\theta_i}). \]

The remaining term in the p.d.f. is a function of \(y_i\) but not \(\pi_i\), leading to

\[ c(y_i,\phi) = \log {n_i \choose y_i}. \]

Note finally that we may set \(a_i(\phi)=\phi\) and \(\phi=1\).

Let us verify the mean and variance. Differentiating \(b(\theta_i)\) with respect to \(\theta_i\) we find that

\[ \mu_i = b'(\theta_i) = n_i \frac{e^{\theta_i}}{1+e^{\theta_i}} = n_i \pi_i, \]

in agreement with what we knew from elementary statistics. Differentiating again using the quotient rule, we find that

\[ v_i = a_i(\phi) b''(\theta_i) = n_i \frac{e^{\theta_i}}{(1+e^{\theta_i})^2} = n_i \pi_i (1-\pi_i), \]

again in agreement with what we knew before.

In this development I have worked with the binomial count \(y_i\), which takes values \(0(1)n_i\). McCullagh and Nelder (1989) work with the proportion \(p_i=y_i/n_i\), which takes values \(0(1/n_i)1\). This explains the differences between my results and their Table 2.1.

B.4.2 Fisher Scoring in Logistic Regression

Let us now find the working dependent variable and the iterative weight used in the Fisher scoring algorithm for estimating the parameters in logistic regression, where we model

\[ \tag{B.16} \eta_i = \mbox{logit}(\pi_i). \]

It will be convenient to write the link function in terms of the mean \(\mu_i\), as:

\[ \eta_i = \log(\frac{\pi_i}{1-\pi_i}) = \log(\frac{\mu_i}{n_i-\mu_i}), \]

which can also be written as \(\eta_i = \log(\mu_i)-\log(n_i-\mu_i)\).

Differentiating with respect to \(\mu_i\) we find that

\[ \frac{d\eta_i}{d\mu_i} = \frac{1}{\mu_i}+\frac{1}{n_i-\mu_i} = \frac{n_i}{\mu_i(n_i-\mu_i)} = \frac{1}{n_i \pi_i (1-\pi_i)}. \]

The working dependent variable, which in general is

\[ z_i = \eta_i + (y_i-\mu_i)\frac{d\eta_i}{d\mu_i}, \]

turns out to be

\[ \tag{B.17} z_i = \eta_i + \frac{y_i-n_i\pi_i}{n_i \pi_i (1-\pi_i)}. \]

The iterative weight turns out to be

\[ \tag{B.18}\begin{align} w_i &= 1 / \left[ b''(\theta_i) (\frac{d\eta_i}{d\mu_i})^2 \right] \\ &= \frac{1}{n_i \pi_i (1-\pi_i)} [ n_i \pi_i (1-\pi_i) ]^2,\end{align} \]

and simplifies to

\[ \tag{B.19} w_i = n_i \pi_i (1-\pi_i). \]

Note that the weight is inversely proportional to the variance of the working dependent variable. The results here agree exactly with the results in Chapter 4 of McCullagh and Nelder (1989).

Exercise: Obtain analogous results for Probit analysis, where one models

\[ \eta_i = \Phi^{-1}(\mu_i), \]

where \(\Phi()\) is the standard normal cdf. Hint: To calculate the derivative of the link function find \(d\mu_i/d\eta_i\) and take reciprocals.\(\Box\)

B.4.3 The Binomial Deviance

Finally, let us figure out the binomial deviance. Let \(\hat{\mu_i}\) denote the m.l.e. of \(\mu_i\) under the model of interest, and let \(\tilde{\mu_i}=y_i\) denote the m.l.e. under the saturated model. From first principles,

\[ \tag{B.20} \begin{split} D &= 2 \sum [ y_i\log( \frac{y_i}{n_i} ) + (n_i-y_i)\log( \frac{n_i-y_i}{n_i} ) \\ &- y_i \log( \frac{\hat{\mu_i}}{n_i} ) - (n_i-y_i)\log( \frac{n_i-\hat{\mu_i}}{n_i} ) ] \end{split} \]

Note that all terms involving \(\log(n_i)\) cancel out. Collecting terms on \(y_i\) and on \(n_i-y_i\) we find that

\[ \tag{B.21} D = 2 \sum \left[ y_i \log(\frac{y_i}{\hat{\mu_i}}) + (n_i-y_i) \log( \frac{n_i-y_i}{n_i-\hat{\mu_i}}) \right] \]

Alternatively, you may obtain this result from the general form of the deviance given in Section B.3.

Note that the binomial deviance has the form

\[ D = 2 \sum o_i \log(\frac{o_i}{e_i}), \]

where \(o_i\) denotes observed, \(e_i\) denotes expected (under the model of interest) and the sum is over both “successes” and “failures” for each \(i\) (i.e. we have a contribution from \(y_i\) and one from \(n_i-y_i\)).

For grouped data the deviance has an asymptotic chi-squared distribution as \(n_i \rightarrow \infty\) for all \(i\), and can be used as a goodness of fit test.

More generally, the difference in deviances between nested models (i.e. the log of the likelihood ratio test criterion) has an asymptotic chi-squared distribution as the number of groups \(k \rightarrow \infty\) or the size of each group \(n_i \rightarrow \infty\), provided the number of parameters stays fixed.

As a general rule of thumb due to Cochrane (1950), the asymptotic chi-squared distribution provides a reasonable approximation when all expected frequencies (both \(\hat{\mu_i}\) and \(n_i-\hat{\mu_i}\)) under the larger model exceed one, and at least 80% exceed five.

Math rendered by