Germán Rodríguez
Generalized Linear Models Princeton University

2.3 Tests of Hypotheses

Chapters and Sections in HTML Format

Consider testing hypotheses about the regression coefficients β. Sometimes we will be interested in testing the significance of a single coefficient, say βj, but on other occasions we will want to test the joint significance of several components of β. In the next few sections we consider tests based on the sampling distribution of the maximum likelihood estimator and likelihood ratio tests.

2.3.1 Wald Tests

Consider first testing the significance of one particular coefficient, say

H0:βj=0.

The m.l.e. βj^ has a distribution with mean 0 (under H0) and variance given by the j-th diagonal element of the matrix in Equation 2.9. Thus, we can base our test on the ratio

(2.10)t=βj^var(βj^).

Note from Equation 2.9 that var(βj^) depends on σ2, which is usually unknown. In practice we replace σ2 by the unbiased estimate based on the residual sum of squares.

Under the assumption of normality of the data, the ratio of the coefficient to its standard error has under H0 a Student's t distribution with np degrees of freedom when σ2 is estimated, and a standard normal distribution if σ2 is known. This result provides a basis for exact inference in samples of any size.

Under the weaker second-order assumptions concerning the means, variances and covariances of the observations, the ratio has approximately in large samples a standard normal distribution. This result provides a basis for approximate inference in large samples.

Many analysts treat the ratio as a Student’s t statistic regardless of the sample size. If normality is suspect one should not conduct the test unless the sample is large, in which case it really makes no difference which distribution is used. If the sample size is moderate, using the t test provides a more conservative procedure. (The Student’s t distribution converges to a standard normal as the degrees of freedom increases to . For example the 95% two-tailed critical value is 2.09 for 20 d.f., and 1.98 for 100 d.f., compared to the normal critical value of 1.96.)

The t test can also be used to construct a confidence interval for a coefficient. Specifically, we can state with 100(1α)% confidence that βj is between the bounds

(2.11)β^j±t1α/2,npvar(βj^),

where t1α/2,np is the two-sided critical value of Student’s t distribution with np d.f. for a test of size α.

The Wald test can also be used to test the joint significance of several coefficients. Let us partition the vector of coefficients into two components, say β=(β1,β2) with p1 and p2 elements, respectively, and consider the hypothesis

H0:β2=0.

In this case the Wald statistic is given by the quadratic form

W=β^2var1(β^2)β^2,

where β^2 is the m.l.e. of β2 and var(β^2) is its variance-covariance matrix. Note that the variance depends on σ2 which is usually unknown; in practice we substitute the estimate based on the residual sum of squares.

In the case of a single coefficient p2=1 and this formula reduces to the square of the t statistic in Equation 2.10.

Asymptotic theory tells us that under H0 the large-sample distribution of the m.l.e. is multivariate normal with mean vector 0 and variance-covariance matrix var(β2). Consequently, the large-sample distribution of the quadratic form W is chi-squared with p2 degrees of freedom. This result holds whether σ2 is known or estimated.

Under the assumption of normality we have a stronger result. The distribution of W is exactly chi-squared with p2 degrees of freedom if σ2 is known. In the more general case where σ2 is estimated using a residual sum of squares based on np d.f., the distribution of W/p2 is an F with p2 and np d.f.

Note that as n approaches infinity for fixed p (so np approaches infinity), the F distribution times p2 approaches a chi-squared distribution with p2 degrees of freedom. Thus, in large samples it makes no difference whether one treats W as chi-squared or W/p2 as an F statistic. Many analysts treat W/p2 as F for all sample sizes.

The situation is exactly analogous to the choice between the normal and Student’s t distributions in the case of one variable. In fact, a chi-squared with one degree of freedom is the square of a standard normal, and an F with one and v degrees of freedom is the square of a Student’s t with v degrees of freedom.

2.3.2 The Likelihood Ratio Test

Consider again testing the joint significance of several coefficients, say

H0:β2=0

as in the previous subsection. Note that we can partition the model matrix into two components X=(X1,X2) with p1 and p2 predictors, respectively. The hypothesis of interest states that the response does not depend on the last p2 predictors.

We now build a likelihood ratio test for this hypothesis. The general theory directs us to (1) fit two nested models: a smaller model with the first p1 predictors in X1, and a larger model with all p predictors in X; and (2) compare their maximized likelihoods (or log-likelihoods).

Suppose then that we fit the smaller model with the predictors in X1 only. We proceed by maximizing the log-likelihood of Equation 2.5 for a fixed value of σ2. The maximized log-likelihood is

maxlogL(β1)=c12RSS(X1)/σ2,

where c=(n/2)log(2πσ2) is a constant depending on π and σ2 but not on the parameters of interest. In a slight abuse of notation, we have written RSS(X1) for the residual sum of squares after fitting X1, which is of course a function of the estimate β^1.

Consider now fitting the larger model X1+X2 with all predictors. The maximized log-likelihood for a fixed value of σ2 is

maxlogL(β1,β2)=c12RSS(X1+X2)/σ2,

where RSS(X1+X2) is the residual sum of squares after fitting X1 and X2, itself a function of the estimate β^.

To compare these log-likelihoods we calculate minus twice their difference. The constants cancel out and we obtain the likelihood ratio criterion

(2.12)2logλ=RSS(X1)RSS(X1+X2)σ2.

There are two things to note about this criterion. First, we are directed to look at the reduction in the residual sum of squares when we add the predictors in X2. Basically, these variables are deemed to have a significant effect on the response if including them in the model results in a reduction in the residual sum of squares. Second, the reduction is compared to σ2, the error variance, which provides a unit of comparison.

To determine if the reduction (in units of σ2) exceeds what could be expected by chance alone, we compare the criterion to its sampling distribution. Large sample theory tells us that the distribution of the criterion converges to a chi-squared with p2 d.f.  The expected value of a chi-squared distribution with ν degrees of freedom is ν (and the variance is 2ν). Thus, chance alone would lead us to expect a reduction in the RSS of about one σ2 for each variable added to the model. To conclude that the reduction exceeds what would be expected by chance alone, we usually require an improvement that exceeds the 95-th percentile of the reference distribution.

One slight difficulty with the development so far is that the criterion depends on σ2, which is not known. In practice, we substitute an estimate of σ2 based on the residual sum of squares of the larger model. Thus, we calculate the criterion in Equation 2.12 using

σ^2=RSS(X1+X2)/(np).

The large-sample distribution of the criterion continues to be chi-squared with p2 degrees of freedom, even if σ2 has been estimated.

Under the assumption of normality, however, we have a stronger result. The likelihood ratio criterion 2logλ has an exact chi-squared distribution with p2 d.f. if σ2 is know. In the usual case where σ2 is estimated, the criterion divided by p2, namely

(2.13)F=(RSS(X1)RSS(X1+X2))/p2RSS(X1+X2)/(np),

has an exact F distribution with p2 and np d.f.

The numerator of F is the reduction in the residual sum of squares per degree of freedom spent. The denominator is the average residual sum of squares, a measure of noise in the model. Thus, an F-ratio of one would indicate that the variables in X2 are just adding noise. A ratio in excess of one would be indicative of signal. We usually reject H0, and conclude that the variables in X2 have an effect on the response if the F criterion exceeds the 95-th percentage point of the F distribution with p2 and np degrees of freedom.

A Technical Note:\/ In this section we have built the likelihood ratio test for the linear parameters β by treating σ2 as a nuisance parameter. In other words, we have maximized the log-likelihood with respect to β for fixed values of σ2. You may feel reassured to know that if we had maximized the log-likelihood with respect to both β and σ2 we would have ended up with an equivalent criterion based on a comparison of the logarithms of the residual sums of squares of the two models of interest. The approach adopted here leads more directly to the distributional results of interest and is typical of the treatment of scale parameters in generalized linear models.

2.3.3 Student’s t, F and the Anova Table

You may be wondering at this point whether you should use the Wald test, based on the large-sample distribution of the m.l.e., or the likelihood ratio test, based on a comparison of maximized likelihoods (or log-likelihoods). The answer in general is that in large samples the choice does not matter because the two types of tests are asymptotically equivalent.

In linear models, however, we have a much stronger result: the two tests are identical. The proof is beyond the scope of these notes, but we will verify it in the context of specific applications. The result is unique to linear models. When we consider logistic or Poisson regression models later in the sequel we will find that the Wald and likelihood ratio tests differ.

At least for linear models, however, we can offer some simple practical advice:

The calculations leading to an F-test are often set out in an analysis of variance (anova) table, showing how the total sum of squares (the RSS of the null model) can be partitioned into a sum of squares associated with X1, a sum of squares added by X2, and a residual sum of squares. The table also shows the degrees of freedom associated with each sum of squares, and the mean square, or ratio of the sum of squares to its d.f.

Table 2.2 shows the usual format. We use ϕ to denote the null model. We also assume that one of the columns of X1 was the constant, so this block adds only p11 variables to the null model.

Table 2.2. The Hierarchical Anova Table

Source ofSum ofDegrees of
variationsquaresfreedom
X1RSS(ϕ)RSS(X1)p11
X2 given X1RSS(X1)RSS(X1+X2)p2
ResidualRSS(X1+X2)np
TotalRSS(ϕ)n1

Sometimes the component associated with the constant is shown explicitly and the bottom line becomes the total (also called ‘uncorrected’) sum of squares: yi2. More detailed analysis of variance tables may be obtained by introducing the predictors one at a time, while keeping track of the reduction in residual sum of squares at each step.

Rather than give specific formulas for these cases, we stress here that all anova tables can be obtained by calculating differences in RSS’s and differences in the number of parameters between nested models. Many examples will be given in the applications that follow. A few descriptive measures of interest, such as simple, partial and multiple correlation coefficients, turn out to be simple functions of these sums of squares, and will be introduced in the context of the applications.

An important point to note before we leave the subject is that the order in which the variables are entered in the anova table (reflecting the order in which they are added to the model) is extremely important. In Table 2.2, we show the effect of adding the predictors in X2 to a model that already has X1. This net effect of X2 after allowing for X1 can be quite different from the gross effect of X2 when considered by itself. The distinction is important and will be stressed in the context of the applications that follow.

Math rendered by