This continues our work on smoothing. Let's start by making sure we have loaded the data.
. infile age pop using /// > https://grodri.github.io/datasets/cohhpop.dat, clear (99 observations read)
> co <- read.table("https://grodri.github.io/datasets/cohhpop.dat", + col.names=c("age","pop"), header=FALSE)
Our application will focus on regression splines, because they are the easiest ones to use, but we will mention briefly natural regression splines and smoothing splines.
A cubic spline S(x) with knots t1 … tk has linear, quadratic and cubic terms on x, and one term of the form (x - t1)3+ for each knot. The + notation instructs us to take the positive part of the argument. These terms are easy to compute and can be entered as predictors in a regression, although one should be careful about multi-collinearity when there are lots of knots.
A better solution is to use b-splines, a well-conditioned
basis for splines. Stata does not have built-in
b-splines, but Roger Newson has contributed a command called
bspline
.R has a function
bs()
included in the splines
package as part
of the base installation.The Stata and R implementations use
somewhat different bases, but lead to the same fitted values.
Let us use spline regression to smooth the Colombian data. We will
use a cubic spline with three internal knots at ages 25, 50 and 75. This
spline has a total of 7 parameters. In
bspline
you need to specify the minimum and maximum as
knots. You should alsospecify
p(3)
to get cubic splines (the default is linear) and seven
output variables.In R the spline basis may be
specified as part of the model.
. bspline, xvar(age) knots(0 25 50 75 100) p(3) gen(_bs3k)
We now regress the population counts on the spline basis omitting the constant (or one of the generated variables).I skip the detailed output because we are interested in the fitted values only.
. quietly regress pop _bs3k*, noconstant . predict bs3k (option xb assumed; fitted values) . twoway (scatter pop age)(line bs3k age) , legend(off) /// > note(knots 25 50 75) title(A Regression Spline) . graph export cohhrs.png, width(500) replace file cohhrs.png saved as PNG format
> library(splines) > library(dplyr) > library(ggplot2) > sf <- lm(pop ~ bs(age, knots=c(25, 50, 75)), data=co) > co <- mutate(co, smooth=fitted(sf)) > ggplot(co, aes(age, pop)) + geom_point() + + geom_line(aes(age, smooth)) + ggtitle("A Regression Spline") > ggsave("cohhrsr.png", width=500/72, height=400/72, dpi=72)
As you can see, the spline does an excellent job smoothing the data. Try using four knots at ages 20, 40, 60 and 80. The fit will look very similar. Placing the knots is an art; a common choice is to place them at given quantiles, for example the quartiles Q1, Q2 and Q3 if you want three internal knots. The number of knots is chosen to balance smoothness and goodness of fit.
Just to convince yourself that there is nothing magic about b-splines, we will reproduce the results “by hand” using the power series as described in the handout
. gen age2 = age^2 . gen age3 = age^3 . gen k25 = cond(age > 25, (age - 25)^3, 0) . gen k50 = cond(age > 50, (age - 50)^3, 0) . gen k75 = cond(age > 75, (age - 75)^3, 0) . quietly regress pop age age2 age3 k25 k50 k75 . predict myspline (option xb assumed; fitted values) . sum bs3k myspline Variable │ Obs Mean Std. dev. Min Max ─────────────┼───────────────────────────────────────────────────────── bs3k │ 99 563.1313 556.2422 -.7697959 1657.724 myspline │ 99 563.1313 556.2422 -.7688386 1657.724 . corr bs3k myspline (obs=99) │ bs3k myspline ─────────────┼────────────────── bs3k │ 1.0000 myspline │ 1.0000 1.0000
> cox <- mutate(co, age2 = age^2, age3 = age^3, + k25 = ifelse(age > 25, (age - 25)^3, 0), + k50 = ifelse(age > 50, (age - 50)^3, 0), + k75 = ifelse(age > 75, (age - 75)^3, 0)) > sf2 <- lm(pop ~ age + age2 + age3 + k25 + k50 + k75, data=cox) > fits <- mutate(cox, myspline = fitted(sf2)) |> + select(smooth, myspline) > summary(fits) smooth myspline Min. : -0.7688 Min. : -0.7688 1st Qu.: 82.1765 1st Qu.: 82.1765 Median : 380.0158 Median : 380.0158 Mean : 563.1313 Mean : 563.1313 3rd Qu.: 920.1823 3rd Qu.: 920.1823 Max. :1657.7240 Max. :1657.7240 > cor(fits) smooth myspline smooth 1 1 myspline 1 1
As you can see, we get essentially the same results, a tribute to the numerical prowess of modern statistical software in the presence of high multicollineary. In case you are curious the correlation between age and its square is 0.9676, between age2 and age3 is 0.9859, and between age3 and the first knot is 0.9867. The terms in the b-spline basis have much lower inter-correlations.
Sometimes we have little information at the extremes of the range.
Natural cubic splines, which are constrained to be linear outside the
range of the data, provide a useful tool in those circumstances. Note
that requiring linearity outside the range of the data imposes
additional smoothness constraints inside the range; for example the
polynomials used at the ends must terminate with zero curvature. Stata does not have a natural cubic spline function, but
coding one is not too hard.R’s function
ns()
in the splines
package provides a natural
spline basis.
A smoothing spline has a knot at each data point, but introduces a penalty for lack of smoothness. If the penalty is zero you get a function that interpolates the data. If the penalty is infinite you get a straight line fitted by ordinary least squares. Usually a nice compromise can be found somewhere in between. We usually focus on splines of odd degree, particularly on cubic splines which have some nice properties as noted in the handout.
Stata and R do not have built-in functions for computing smoothing splines, but it is not too difficult to construct one using the results on page 7 of the handout.