Consider testing hypotheses about the regression coefficients
. Sometimes we will be interested in testing the
significance of a single coefficient, say ,
but on other occasions we will want to test the
joint significance of several components of .
In the next few sections we consider tests based on the
sampling distribution of the maximum likelihood estimator
and likelihood ratio tests.
2.3.1 Wald Tests
Consider first testing the significance of one particular coefficient,
say
The m.l.e.
has a distribution with mean 0 (under ) and
variance given by the -th diagonal element of the
matrix in Equation 2.9. Thus, we can base our test
on the ratio
Note from Equation 2.9
that depends on , which is
usually unknown.
In practice we replace by the unbiased estimate
based on the residual sum of squares.
Under the assumption of normality of the data,
the ratio of the coefficient to its standard error
has under a Student's t distribution with
degrees of freedom when is
estimated, and a standard normal distribution if
is known. This result provides a basis for exact inference
in samples of any size.
Under the weaker second-order assumptions concerning the
means, variances and covariances of the observations, the
ratio has approximately in large samples a standard normal
distribution. This result provides a basis for approximate
inference in large samples.
Many analysts treat the ratio as a Student’s
statistic regardless of the sample size. If normality is
suspect one should not conduct the test unless the sample is
large, in which case it really makes no difference which
distribution is used. If the sample size is moderate,
using the test provides a more conservative procedure.
(The Student’s distribution
converges to a standard normal as the degrees of freedom
increases to . For example the 95% two-tailed
critical value is 2.09 for 20 d.f., and 1.98 for 100 d.f.,
compared to the normal critical value of 1.96.)
The test can also be used to construct a confidence
interval for a coefficient.
Specifically, we can state with %
confidence that is between the bounds
where is the two-sided critical value
of Student’s distribution with d.f. for a test of
size .
The Wald test can also be used to test the joint significance
of several coefficients.
Let us partition the vector of coefficients into two components,
say with and
elements, respectively, and consider the hypothesis
In this case the Wald statistic is given by the quadratic
form
where is the m.l.e. of
and is its variance-covariance matrix.
Note that the variance depends on which is usually
unknown; in practice we substitute the estimate based on the
residual sum of squares.
In the case of a single coefficient and
this formula reduces to the
square of the statistic in Equation 2.10.
Asymptotic theory tells us that under the
large-sample distribution of the m.l.e. is multivariate normal
with mean vector 0 and variance-covariance matrix
.
Consequently, the large-sample distribution of the quadratic form
is chi-squared with degrees of freedom.
This result holds whether is known or estimated.
Under the assumption of normality we have a stronger result.
The distribution of is exactly chi-squared with
degrees of freedom if is known. In the more general
case where is estimated using a residual sum of squares
based on d.f., the distribution of is an
with and d.f.
Note that as approaches infinity for fixed
(so approaches infinity),
the distribution times approaches a chi-squared
distribution with degrees of freedom. Thus, in large samples
it makes no difference whether one treats as chi-squared or
as an statistic. Many analysts treat
as for all sample sizes.
The situation is exactly analogous
to the choice between the normal and Student’s distributions
in the case of one variable.
In fact, a chi-squared with one
degree of freedom is the square of a standard normal, and an
F with one and degrees of freedom is the square of a
Student’s with degrees of freedom.
2.3.2 The Likelihood Ratio Test
Consider again testing the joint significance of several
coefficients, say
as in the previous subsection.
Note that we can partition the model matrix into
two components with and
predictors, respectively.
The hypothesis of interest states that the
response does not depend on the last predictors.
We now build a likelihood ratio test for this hypothesis.
The general theory directs us to (1) fit two nested models:
a smaller model with the first predictors in ,
and a larger model with all predictors in ; and
(2) compare their maximized likelihoods (or log-likelihoods).
Suppose then that we fit the smaller model with the predictors
in only.
We proceed by maximizing
the log-likelihood of Equation 2.5 for a fixed
value of .
The maximized log-likelihood is
where is a constant depending
on and but not on the parameters of interest.
In a slight abuse of notation, we have written for the
residual sum of squares after fitting ,
which is of course a function of the estimate .
Consider now fitting the larger model with all
predictors. The maximized log-likelihood for a fixed value
of is
where is the
residual sum of squares after fitting and ,
itself a function of the estimate .
To compare these log-likelihoods we calculate minus twice their
difference. The constants cancel out and we obtain
the likelihood ratio criterion
There are two things to note about this criterion. First, we
are directed to look at the reduction in the residual sum of
squares when we add the predictors in . Basically,
these variables are deemed to have a significant effect on
the response if including them in the model results in a
reduction in the residual sum of squares.
Second, the reduction is compared to , the
error variance, which provides a unit of comparison.
To determine if the reduction (in units of )
exceeds what could be expected by chance alone, we
compare the criterion to its sampling distribution.
Large sample theory tells us that the distribution of the
criterion converges to a chi-squared with d.f.
The expected value of a chi-squared distribution
with degrees of freedom is (and the variance
is ). Thus, chance alone would lead us to expect
a reduction in the of about one for
each variable added to the model. To conclude that the
reduction exceeds what would be expected by chance alone,
we usually require an improvement that exceeds the 95-th
percentile of the reference distribution.
One slight difficulty with the development so far is that the
criterion depends on , which is not known. In
practice, we substitute an estimate of
based on the residual sum of squares of the larger
model. Thus, we calculate the criterion in Equation 2.12
using
The large-sample distribution of the criterion continues
to be chi-squared with degrees of freedom, even if
has been estimated.
Under the assumption of normality, however, we have a
stronger result. The likelihood ratio criterion
has an exact chi-squared distribution with
d.f. if is know. In the usual case where
is estimated, the criterion divided by ,
namely
has an exact distribution with and d.f.
The numerator of is the reduction in the residual sum of squares
per degree of freedom spent. The denominator is the
average residual sum of squares, a measure of noise in
the model.
Thus, an -ratio of one would indicate that the
variables in are just adding noise. A ratio in
excess of one would be indicative of signal. We usually
reject , and conclude that the variables in
have an effect on the response if the criterion
exceeds the 95-th percentage point of the distribution
with and degrees of freedom.
A Technical Note:\/ In this section we have built the
likelihood ratio test for the linear parameters
by treating as a nuisance parameter.
In other words, we have maximized the log-likelihood with
respect to for fixed values of .
You may feel reassured to know that if we had maximized the
log-likelihood with respect to both and
we would have ended up with an equivalent criterion
based on a comparison of the logarithms of the residual sums
of squares of the two models of interest. The approach
adopted here leads more directly to the distributional
results of interest and is typical of the treatment of
scale parameters in generalized linear models.
2.3.3 Student’s t, F and the Anova Table
You may be wondering at this point whether
you should use the Wald test, based on the large-sample
distribution of the m.l.e., or the likelihood ratio test,
based on a comparison of maximized likelihoods (or
log-likelihoods).
The answer in general is that in large samples
the choice does not matter because the two types of
tests are asymptotically equivalent.
In linear models, however, we have a much stronger result:
the two tests are identical.
The proof is beyond the scope of these notes, but we will
verify it in the context of specific applications.
The result is unique to linear models. When we
consider logistic or Poisson regression models later
in the sequel we will find that the Wald and likelihood
ratio tests differ.
At least for linear models, however, we can
offer some simple practical advice:
To test hypotheses
about a single coefficient, use the -test based on
the estimator and its standard error,
as given in Equation 2.10.
To test hypotheses about several coefficients, or more
generally to compare nested models, use the -test
based on a comparison of 's, as given
in Equation 2.13.
The calculations leading to an -test are often set out
in an analysis of variance (anova) table, showing how the total
sum of squares (the of the null model) can be
partitioned into a sum of squares associated with ,
a sum of squares added by, and a residual
sum of squares. The table also shows the degrees of
freedom associated with each sum of squares, and the
mean square, or ratio of the sum of squares to its d.f.
Table 2.2 shows the usual format. We
use to denote the null model. We also assume that
one of the columns of was the constant, so this
block adds only variables to the null model.
Table 2.2. The Hierarchical Anova Table
Source of
Sum of
Degrees of
variation
squares
freedom
given
Residual
Total
Sometimes the component associated with the constant is
shown explicitly and the bottom line becomes the total
(also called ‘uncorrected’) sum of squares: .
More detailed analysis of variance tables may be obtained by
introducing the predictors one at a time, while keeping
track of the reduction in residual sum of squares at
each step.
Rather than give specific formulas for these
cases, we stress here that all anova tables can
be obtained by calculating differences in ’s and
differences in the number of parameters between nested models.
Many examples will be given in the applications that follow.
A few descriptive measures of interest, such as simple,
partial and multiple correlation coefficients, turn out to
be simple functions of these sums of squares, and will be
introduced in the context of the applications.
An important point to note before we leave the subject
is that the order in which the variables are entered in
the anova table (reflecting the order in which they are
added to the model) is extremely important.
In Table 2.2, we show the effect of adding the
predictors in to a model that already has .
This net effect of after allowing for
can be quite different from the gross effect
of when considered by itself. The distinction is
important and will be stressed in the context of the applications
that follow.