Statistics and Population

Lecture Notes

Home Lecture Notes Stata Logs R Logs Datasets Problem Sets

2.6 One-Way Analysis of Variance

We now consider models where the predictors are categorical variables or factors with a discrete number of levels. To illustrate the use of these models we will group the index of social setting (and later the index of family planning effort) into discrete categories.

2.6.1 The One-Way Layout

Table 2.10 shows the percent decline in the CBR for the 20 countries in our illustrative dataset, classified according to the index of social setting in three categories: low (under 70 points), medium (70–79) and high (80 or more).

Table 2.10. CBR Decline by Levels of Social Setting

Setting	Percent decline in CBR
Low	1, 0, 7, 21, 13, 4, 7
Medium	10, 6, 2, 0, 25
High	9, 11, 29, 29, 40, 21, 22, 29

It will be convenient to modify our notation to reflect the one-way layout of the data explicitly. Let \( k \) denote the number of groups or levels of the factor, \( n_i \) denote the number of observations in group \( i \), and let \( y_{ij} \) denote the response for the \( j \)-th unit in the \( i \)-th group, for \( j=1,\ldots,n_i \), and \( i=1,\ldots,k \). In our example \( k=3 \) and \( y_{ij} \) is the CBR decline in the \( j \)-th country in the \( i \)-th category of social setting, with \( i=1,2,3; j=1, \ldots, n_i; n_1=7, n_2=5 \) and \( n_3=8 \)).

2.6.2 The One-Factor Model

As usual, we treat \( y_{ij} \) as a realization of a random variable \( Y_{ij} \sim N(\mu_{ij}, \sigma^2) \), where the variance is the same for all observations. In terms of the systematic structure of the model, we assume that

\[\tag{2.18}\mu_{ij} = \mu + \alpha_i,\]

where \( \mu \) plays the role of the constant and \( \alpha_i \) represents the effect of level \( i \) of the factor.

Before we proceed further, it is important to note that the model as written is not identified. We have essentially \( k \) groups but have introduced \( k+1 \) linear parameters. The solution is to introduce a constraint, and there are several ways in which we could proceed.

One approach is to set \( \mu=0 \) (or simply drop \( \mu \)). If we do this, the \( \alpha_i \)’s become cell means, with \( \alpha_i \) representing the expected response in group \( i \). While simple and attractive, this approach does not generalize well to models with more than one factor.

Our preferred alternative is to set one of the \( \alpha_i \)’s to zero. Conventionally we set \( \alpha_1=0 \), but any of the groups could be chosen as the reference cell or level. In this approach \( \mu \) becomes the expected response in the reference cell, and \( \alpha_i \) becomes the effect of level \( i \) of the factor, compared to the reference level.

A third alternative is to require the group effects to add-up to zero, so \( \sum \alpha_i = 0 \). In this case \( \mu \) represents some sort of overall expected response, and \( \alpha_i \) measures the extent to which responses at level \( i \) of the factor deviate from the overall mean. Some statistics texts refer to this constraint as the ‘usual’ restrictions, but I think the reference cell method is now used more widely in social research.

A variant of the ‘usual’ restrictions is to require a weighted sum of the effects to add up to zero, so \( \sum w_i \alpha_i = 0 \). The weights are often taken to be the number of observations in each group, so \( w_i=n_i \). In this case \( \mu \) is a weighted average representing the expected response, and \( \alpha_i \) is, as before, the extent to which responses at level \( i \) of the factor deviate from the overall mean.

Each of these parameterizations can easily be translated into one of the others, so the choice can rest on practical considerations. The reference cell method is easy to implement in a regression context and the resulting parameters have a clear interpretation.

2.6.3 Estimates and Standard Errors

The model in Equation 2.18 is a special case of the generalized linear model, where the design matrix \( \boldsymbol{X} \) has \( k+1 \) columns: a column of ones representing the constant, and \( k \) columns of indicator variables, say \( x_1, \ldots, x_k \), where \( x_i \) takes the value one for observations at level \( i \) of the factor and the value zero otherwise.

Note that the model matrix as defined so far is rank deficient, because the first column is the sum of the last \( k \). Hence the need for constraints. The cell means approach is equivalent to dropping the constant, and the reference cell method is equivalent to dropping one of the indicator or dummy variables representing the levels of the factor. Both approaches are easily implemented. The other two approaches, which set to zero either the unweighted or weighted sum of the effects, are best implemented using Lagrange multipliers and will not be considered here.

Parameter estimates, standard errors and \( t \) ratios can then be obtained from the general results of Sections 2.2 and 2.3. You may be interested to know that the estimates of the regression coefficients in the one-way layout are simple functions of the cell means. Using the reference cell method,

\[ \hat{\mu} = \bar{y}_1 \quad\mbox{and}\quad \hat{\alpha_i} = \bar{y}_i-\bar{y}_1 \:\mbox{for}\:i>1, \]

where \( \bar{y}_i \) is the average of the responses at level \( i \) of the factor.

Table 2.11 shows the estimates for our sample data. We expect a CBR decline of almost 8% in countries with low social setting (the reference cell). Increasing social setting to medium or high is associated with additional declines of one and 16 percentage points, respectively, compared to low setting.

Table 2.11. Estimates for One-Way Anova Model of
CBR Decline by Levels of Social Setting

Parameter	Symbol	Estimate	Std. Error	\(t\)-ratio
Low	\(\mu\)	7.571	3.498	2.16
Medium (vs. low)	\(\alpha_2\)	1.029	5.420	0.19
High (vs. low)	\(\alpha_3\)	16.179	4.790	3.38

Looking at the \( t \) ratios we see that the difference between medium and low setting is not significant, so we accept \( H_0: \alpha_2=0 \), whereas the difference between high and low setting, with a \( t \)-ratio of 3.38 on 17 d.f. and a two-sided P-value of 0.004, is highly significant, so we reject \( H_0:\alpha_3=0 \). These \( t \)-ratios test the significance of two particular contrasts: medium vs. low and high vs. low. In the next subsection we consider an overall test of the significance of social setting.

2.6.4 The One-Way Anova Table

Fitting the model with social setting treated as a factor reduces the \( \mbox{RSS} \) from 2650.2 (for the null model) to \( 1456.4 \), a gain of 1193.8 at the expense of two degrees of freedom (the two \( \alpha \)’s). We can contrast this gain with the remaining \( \mbox{RSS} \) of 1456.4 on 17 d.f. The calculations are laid out in Table 2.12, and lead to an \( F \)-test of 6.97 on 2 and 17 d.f., which has a P-value of 0.006. We therefore reject the hypothesis \( H_0: \alpha_2=\alpha_3=0 \) of no setting effects, and conclude that the expected response depends on social setting.

Table 2.12. Analysis of Variance for One-Factor Model
of CBR Decline by Levels of Social Setting

Source of	Sum of	Degrees of	Mean	\(F\)-
variation	squares	Freedom	squared	ratio
Setting	1193.8	2	596.9	6.97
Residual	1456.4	17	85.7
Total	2650.2	19

Having established that social setting has an effect on CBR decline, we can inspect the parameter estimates and \( t \)-ratios to learn more about the nature of the effect. As noted earlier, the difference between high and low settings is significant, while that between medium and low is not.

It is instructive to calculate the Wald test for this example. Let \( \boldsymbol{\alpha} = (\alpha_2,\alpha_3)' \) denote the two setting effects. The estimate and its variance-covariance matrix, calculated using the general results of Section 2.2, are

\[ \hat{\boldsymbol{\alpha}} = \left( \begin{array}{r} 1.029\\16.179 \end{array} \right) \quad\mbox{and}\quad \hat{\mbox{var}}(\hat{\boldsymbol{\alpha}}) = \left( \begin{array}{rr} 29.373& 12.239 \\ 12.239& 22.948 \end{array} \right). \]

The Wald statistic is

\[ W = \hat{\boldsymbol{\alpha}}' \: \hat{\mbox{var}}^{-1}(\hat{\boldsymbol{\alpha}}) \: \hat{\boldsymbol{\alpha}} = 13.94, \]

and has approximately a chi-squared distribution with two d.f. Under the assumption of normality, however, we can divide by two to obtain \( F \) = 6.97, which has an \( F \) distribution with two and 17 d.f., and coincides with the test based on the reduction in the residual sum of squares, as shown in Table 2.12.

2.6.5 The Correlation Ratio

Note from Table 2.12 that the model treating social setting as a factor with three levels has reduced the \( \mbox{RSS} \) by 1456.6 out of 2650.2, thereby explaining 45.1%. The square root of the proportion of variance explained by a discrete factor is called the correlation ratio, and is often denoted \( \eta \). In our example \( \hat{\eta}=0.672 \).

If the factor has only two categories the resulting coefficient is called the point-biserial correlation, a measure often used in psychometrics to correlate a test score (a continuous variable) with the answer to a dichotomous item (correct or incorrect). Note that both measures are identical in construction to Pearson’s correlation coefficient. The difference in terminology reflects whether the predictor is a continuous variable with a linear effect or a discrete variable with two or more than two categories.

Math rendered by