Germán Rodríguez
Generalized Linear Models Princeton University

B.3 Tests of Hypotheses

We consider Wald tests and likelihood ratio tests, introducing the deviance statistic.

B.3.1 Wald Tests

The Wald test follows immediately from the fact that the information matrix for generalized linear models is given by

\[\tag{B.9}\boldsymbol{I}(\boldsymbol{\beta}) = \boldsymbol{X}'\boldsymbol{W}\boldsymbol{X}/\phi,\]

so the large sample distribution of the maximum likelihood estimator \( \hat{\boldsymbol{\beta}} \) is multivariate normal

\[\tag{B.10}\hat{\boldsymbol{\beta}} \sim N_p( \boldsymbol{\beta}, (\boldsymbol{X}'\boldsymbol{W}\boldsymbol{X})^{-1}\phi ).\]

with mean \( \boldsymbol{\beta} \) and variance-covariance matrix \( (\boldsymbol{X}'\boldsymbol{W}\boldsymbol{X})^{-1}\phi \).

Tests for subsets of \( \boldsymbol{\beta} \) are based on the corresponding marginal normal distributions.

Example: In the case of normal errors with identity link we have \( \boldsymbol{W}=\boldsymbol{I} \) (where \( \boldsymbol{I} \) denotes the identity matrix), \( \phi=\sigma^2 \), and the exact distribution of \( \hat{\boldsymbol{\beta}} \) is multivariate normal with mean \( \boldsymbol{\beta} \) and variance-covariance matrix \( (\boldsymbol{X}'\boldsymbol{X})^{-1}\sigma^2 \).

B.3.2 Likelihood Ratio Tests and The Deviance

We will show how the likelihood ratio criterion for comparing any two nested models, say \( \omega_1 \subset \omega_2 \), can be constructed in terms of a statistic called the deviance and an unknown scale parameter \( \phi \).

Consider first comparing a model of interest \( \omega \) with a saturated model \( \Omega \) that provides a separate parameter for each observation.

Let \( \hat{\mu}_i \) denote the fitted values under \( \omega \) and let \( \hat{\theta}_i \) denote the corresponding estimates of the canonical parameters. Similarly, let \( \tilde{\mu}_O=y_i \) and \( \tilde{\theta}_i \) denote the corresponding estimates under \( \Omega \).

The likelihood ratio criterion to compare these two models in the exponential family has the form

\[ -2\log\lambda = 2 \sum_{i=1}^n \frac { y_i(\tilde{\theta_i}-\hat{\theta_i})- b(\tilde{\theta_i}) + b(\hat{\theta_i}) } {a_i(\phi)}. \]

Assume as usual that \( a_i(\phi)=\phi/p_i \) for known prior weights \( p_i \). Then we can write the likelihood-ratio criterion as follows:

\[\tag{B.11}-2\log\lambda = \frac{D(\boldsymbol{y},\hat{\boldsymbol{\mu}})}{\phi}.\]

The numerator of this expression does not depend on unknown parameters and is called the deviance:

\[\tag{B.12}D(\boldsymbol{y},\hat{\boldsymbol{\mu}}) = 2 \sum_{i=1}^n p_i [ y_i(\tilde{\theta_i}-\hat{\theta_i})- b(\tilde{\theta_i}) + b(\hat{\theta_i}) ].\]

The likelihood ratio criterion \( -2\log L \) is the deviance divided by the scale parameter \( \phi \), and is called the scaled deviance.

Example: Recall that for the normal distribution we had \( \theta_i=\mu_i \), \( b(\theta_i) = \frac{1}{2}\theta_i^2 \), and \( a_i(\phi)=\sigma^2 \), so the prior weights are \( p_i=1 \).

Thus, the deviance is \[ \tag{B.13}\begin{align} D(\boldsymbol{y},\hat{\boldsymbol{\mu}}) &= 2 \sum\{ y_i(y_i-\hat{\mu_i})- \frac{1}{2} y_i^2 +\frac{1}{2}\hat{\mu_i}^2\} \\ &= 2\sum\{ \frac{1}{2} y_i^2 - y_i\hat{\mu_i}^2 + \frac{1}{2}\hat{\mu_i}^2\} \\ &= \sum(y_i - \hat{\mu_i})^2 \end{align} \] our good old friend, the residual sum of squares. \(\Box\)

Let us now return to the comparison of two nested models \( \omega_1 \), with \( p_1 \) parameters, and \( \omega_2 \), with \( p_2 \) parameters, such that \( \omega_1 \in \omega_2 \) and \( p_2 > p1 \).

The log of the ratio of maximized likelihoods under the two models can be written as a difference of deviances, since the maximized log-likelihood under the saturated model cancels out. Thus, we have

\[\tag{B.14}-2\log\lambda = \frac{D(\omega_1)-D(\omega_2)}{\phi}\]

The scale parameter \( \phi \) is either known or estimated using the larger model \( \omega_2 \).

Large sample theory tells us that the asymptotic distribution of this criterion under the usual regularity conditions is \( \chi^2_\nu \) with \( \nu=p_2-p_1 \) degrees of freedom.

Example: In the linear model with normal errors we estimate the unknown scale parameter \( \phi \) using the residual sum of squares of the larger model, so the criterion becomes \[ -2\log\lambda= \frac{\mbox{RSS}(\omega_1)-\mbox{RSS}(\omega_2)} {\mbox{RSS}(\omega_2)/(n-p_2)}. \]

In large samples the approximate distribution of this criterion is \( \chi^2_\nu \) with \( \nu=p_2-p_1 \) degrees of freedom. Under normality, however, we have an exact result: dividing the criterion by \( p_2-p_1 \) we obtain an \( F \) with \( p_2-p_1 \) and \( n-p_2 \) degrees of freedom. Note that as \( n \rightarrow \infty \) the degrees of freedom in the denominator approach \( \infty \) and the \( F \) converges to \( (p_2-p_1)\chi^2 \), so the asymptotic and exact criteria become equivalent.\( \Box \)

In Sections B.4 and B.5 we will construct likelihood ratio tests for binomial and Poisson data. In those cases \( \phi=1 \) (unless one allows over-dispersion and estimates \( \phi \), but that’s another story) and the deviance is the same as the scaled deviance. All our tests will be based on asymptotic \( \chi^2 \) statistics.

Math rendered by