Let us read the data again, and then group social setting into three categories: < 70, 70-79 and 80+.
We will use the dplyr
package for data manipulation, so
make sure you have installed it. To recode setting we will use
mutate()
with the function cut()
, which takes
as arguments the original variable and a set of cutpoints or “breaks”,
here the minimum (23), 70, 80, and the maximum (91). By default, the
cutpoint itself is included in the lower category, so the intervals are
closed on the right, as in (70,80], but we want the opposite, as in
[70,80), so we specify right=FALSE
. To make sure we include
the highest cutpoint in the last category we specify
include.lowest=TRUE
. (I know, this option can be confusing;
by default it will include the lowest cutpoint in the first category, as
the name suggests, but having specified that right
is
false, it will actually include the highest cutpoint in the last
category.)
R generates labels for the categories, and you should probably accept them at first to ensure that the grouping was done as intended, but I’ll call the categories “low”, medium”, and “high”, for consistency with the notes. We also add this variable to the data frame.
> library(dplyr)
> fpe <- read.table("https://grodri.github.io/datasets/effort.dat")
> fpe <- mutate(fpe, setting.g = cut(setting,
+ breaks=c(min(setting), 70, 80, max(setting)),
+ right=FALSE, include.lowest=TRUE, labels=c("Low", "Medium", "High")))
To verify the grouping we need to split the data by categories of social setting and compute the range of setting in each group.
> group_by(fpe, setting.g) |> summarize(bot=min(setting), top=max(setting))
# A tibble: 3 × 3
setting.g bot top
<fct> <int> <int>
1 Low 35 68
2 Medium 70 77
3 High 83 91
Here group_by()
splits fpe
by categories of
setting.g
. The pipe operator |>
passes the
result to summarize()
, which computes the min and max of
setting
. We see that the ranges included in each category
are 35-68, 70-77 and 83-91, so obviously the cut was correct. We can use
the same approach to compute the average fertility decline in each
category of social setting:
Let us look at the mean response by level of social setting
> group_by(fpe, setting.g) |> summarize(change = mean(change))
# A tibble: 3 × 2
setting.g change
<fct> <dbl>
1 Low 7.57
2 Medium 8.6
3 High 23.8
We observe substantially more fertility decline in countries with higher setting, but only a small difference between the low and medium categories.
To fit a linear model treating social setting as a factor we simply use the categorical variable in the model formula. R will automatically create dummy variables for each category other than the first.
> m1g <- lm(change ~ setting.g, data=fpe)
> summary(m1g)
Call:
lm(formula = change ~ setting.g, data = fpe)
Residuals:
Min 1Q Median 3Q Max
-14.750 -6.579 -1.161 5.250 16.400
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.571 3.498 2.164 0.04497 *
setting.gMedium 1.029 5.420 0.190 0.85173
setting.gHigh 16.179 4.790 3.377 0.00358 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 9.256 on 17 degrees of freedom
Multiple R-squared: 0.4505, Adjusted R-squared: 0.3858
F-statistic: 6.967 on 2 and 17 DF, p-value: 0.006167
> anova(m1g)
Analysis of Variance Table
Response: change
Df Sum Sq Mean Sq F value Pr(>F)
setting.g 2 1193.8 596.89 6.9672 0.006167 **
Residuals 17 1456.4 85.67
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Fertility declined, on average, 16 percentage points more in countries with high setting than in countries with low setting. Compare the parameter estimates with the values in Table 2.11 and the anova with Table 2.12 in the notes.
You can verify that the constant is the average decline in low setting countries, and the coefficients for medium and high are the differences between medium and low, and between high and low; in other words, differences with respect to the omitted category.
We could, of course, compute the dummy variables “by hand”, with the
same results. Below I use as.numeric()
to coerce a logical
expression to take the values 1 and 0, otherwise the coefficients would
have names like settingMediumTRUE
.
fpe <- mutate(fpe, settingMedium = as.numeric(setting.g=="Medium"),
settingHigh=as.numeric(setting.g=="High"))
lm(change ~ settingMedium + settingHigh, data=fpe)
Using the factor notation is not only simpler but tells R that the terms belong together.
The t-statistics produced by summary()
compare each
category to the reference cell. To obtain an overall test of the
significance of social setting we need to compare the models with and
without setting. This can be done using the anova()
function:
> anova(m1g)
Analysis of Variance Table
Response: change
Df Sum Sq Mean Sq F value Pr(>F)
setting.g 2 1193.8 596.89 6.9672 0.006167 **
Residuals 17 1456.4 85.67
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The F-test of 6.97 on 2 and 17 d.f. tells us that the differences between the social setting categories are much larger than one would expect by chance if all experienced the same decline in fertility.
It may be instructive to calculate the test “by hand” using the residual sum of squares for the null model and the model treating setting as a factor:
> rss <- function(lm) sum(residuals(lm)^2)
> m0 <- lm(change ~ 1, data=fpe)
> ((rss(m0)-rss(m1g))/2)/(rss(m1g)/df.residual(m1g))
[1] 6.967234
We obtain the same 6.97 on 2 and 17 d.f. We can also compute the test
from the coefficients as their variance-covariance matrix, as shown on
page 32 of the notes. We do this using coef()
to extract
the coefficients and vcov()
for their variance-covariance
matrix, using -1 to exclude the constant:
> b = coef(m1g)[-1]
> V = vcov(m1g)[-1,-1]
> W = t(b) %*% solve(V) %*% b
> c(W, W/2)
[1] 13.934467 6.967234
We obtain a Wald statistic of 13.94 on 2 d.f. in agreement with the notes. Dividing by 2 we obtain, once again, an F statistic of 6.92 on 2 and 17 d.f. Recall that the Wald statistic is asymptotically chi-squared, whereas under normality the F statistic has an F distribution.
Exercise: Obtain the parameter estimates and anova table for the model with family planning effort grouped into three categories: 0-4, 5-14 and 15+, labelled weak, moderate and strong.
Updated fall 2022