Germán Rodríguez
Generalized Linear Models Princeton University

2.6 One-Way Analysis of Variance

Let us read the data again, and then group social setting into three categories: < 70, 70-79 and 80+.

First we will make a copy, which I’ll call setting_g for social setting grouped. (Everyone has their own conventions for naming variables. I try to keep variable names short, lowercase, and hopefully not too cryptic. Because we are just starting I will emphasize the ‘not too cryptic’ part, otherwise I might have used ssg. Stata allows variable names to have up to 32 characters, but most commands print only 12, so it is best to stick to a maximum of 12.)

. use https://grodri.github.io/datasets/effort, clear
(Family Planning Effort Data)

. generate setting_g = setting

Then we recode it into categories <70, 70-79, and 80+, thus creating a discrete factor with three levels.

. recode setting_g min/69=1 70/79=2 80/max=3
(20 changes made to setting_g)

It might be good idea to label the new variable and its categories. I will define a new set of labels called setting_g and assign it to the values of the variable. The names of the variable and the label don’t have to be the same. For example one could have a label called yesno assigned to the values of all variables that take “Yes” and “No” values. In this case it makes sense to use the same name.

. label var setting_g "Social Setting (Grouped)

. label define setting_g 1 "Low" 2 "Medium" 3 "High"

. label values setting_g setting_g

By the way one can shorten this process using options of the recode command as shown in Section 2.7 in this log, but I think it’s good to see all the steps once.

Let us look at the mean response by level of social setting

. tabulate setting_g, summarize(change)

     Social │ Summary of % Change in CBR between
    Setting │            1965 and 1975
  (Grouped) │        Mean   Std. dev.       Freq.
────────────┼────────────────────────────────────
        Low │   7.5714286   7.3452284           7
     Medium │         8.6   9.9398189           5
       High │       23.75   10.264363           8
────────────┼────────────────────────────────────
      Total │        14.3   11.810343          20

We observe substantially more fertility decline in countries with higher setting, but only a small difference between the low and medium categories.

A One-Factor Model

Stata has an anova command that can fit linear models with discrete factors as predictors. We will use regress instead, to emphasize that all these models are in fact regression models. This will help us along when we move on to logit and Poisson models, which no longer make this distinction.

To handle a categorical variable in a regression model we need indicators for all the categories except one, usually called the reference cell. Stata 11 introduced factor variables, a powerful way to specify main effects and interactions in regression models, and Stata 13 improved the labeling of the results, so there’s really no need to “roll your own” anymore. For learning purposes, however, we will show below how you would go about doing that. First, however, we run the model using i.setting_g to specify that we want indicator variables for setting grouped. Stata automatically picks the lowest code as the reference cell.

. regress change i.setting_g

      Source │       SS           df       MS      Number of obs   =        20
─────────────┼──────────────────────────────────   F(2, 17)        =      6.97
       Model │  1193.78571         2  596.892857   Prob > F        =    0.0062
    Residual │  1456.41429        17  85.6714286   R-squared       =    0.4505
─────────────┼──────────────────────────────────   Adj R-squared   =    0.3858
       Total │      2650.2        19  139.484211   Root MSE        =    9.2559

─────────────┬────────────────────────────────────────────────────────────────
      change │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
   setting_g │
     Medium  │   1.028571   5.419692     0.19   0.852    -10.40598    12.46312
       High  │   16.17857   4.790376     3.38   0.004     6.071761    26.28538
             │
       _cons │   7.571429   3.498396     2.16   0.045     .1904579     14.9524
─────────────┴────────────────────────────────────────────────────────────────

Fertility declined, on average, 16 percentage points more in countries with high setting than in countries with low setting. Compare the parameter estimates with the values in Table 2.11 and the anova with Table 2.12 in the notes.

You can verify that the constant is the average decline in low setting countries, and the coefficients for medium and high are the differences between medium and low, and between high and low; in other words, differences with respect to the omitted category.

Just for the record, this is how you could get exactly the same results by creating indicators for medium and high setting:

gen setting_med  = setting_g==2 // or setting >= 70 & setting < 80
gen setting_high = setting_g==3 // or setting >= 80 & !missing(setting)
regress change setting_med setting_high

We could have coded the conditions in terms of the original variable as shown in the comments above, with exactly the same result. I probably would have used that approach if the dummies were called setting70to79 and setting80plus.

The F-Test

Stata has a test command that can be used to test one or more terms in a model. With factor variables you are better off using the testparm command, which automatically finds the terms involving a factor. Here’s the F test for the indicators of setting. The result is, of course, the same as in the anova table: the differences by setting are significant at the one-percent level.

. testparm i.setting_g

 ( 1)  2.setting_g = 0
 ( 2)  3.setting_g = 0

       F(  2,    17) =    6.97
            Prob > F =    0.0062

As the output shows, Stata names the coefficients of a factor variable using the number of the level followed by a dot and the name of the factor, as in 2.setting_g. You could reproduce this F-test using the command test 2.setting_g 3.setting_g, which works fine because these are terms (single variables) in the model.

On a related matter, Stata stores the coefficients in a matrix called e(b), and you can list them using mat list e(b). This is how you can find the names of the coefficients representing factor variables.

The F-test of 6.97 on 2 and 17 d.f. tells us that the differences between the social setting categories are much larger than one would expect by chance if all experienced the same decline in fertility.

Exercise: Obtain the parameter estimates and anova table for the model with family planning effort grouped into three categories: 0-4, 5-14 and 15+, labelled weak, moderate and strong.

Updated fall 2022