Stata and R have extensive facilities for fitting survival models. In this course we discuss only the use of Poisson regression to fit piece-wise exponential survival models. For other approaches see the survival analysis course.
The datasets section has the original tabulation of children by sex, cohort, age and survival status (dead or still alive at interview), as analyzed by Somoza (1980).
As is often the case with survival data, a good part of the effort is to convert the raw data into the counts of events and exposure needed for analysis.
We will start by reading the data and collapsing over sex, and will then compute events and exposure to reproduce Table 7.1 in the lecture notes.
> library(haven)
> library(dplyr)
> somoza <- read_dta("https://grodri.github.io/datasets/somoza.dta") |>
+ mutate(age=as_factor(age), cohort=as_factor(cohort))
> somoza <- group_by(somoza, cohort, age) |>
+ summarize(deaths=sum(dead), survivors=sum(alive))
`summarise()` has grouped output by 'cohort'. You can override using the `.groups`
argument.
We take advantage of the fact that the data have been grouped by `cohort’. We compute the number who start each age as the total number of cases minus the number of deaths and survivors in all subsequent intervals. We add the width of the age interval in month, and then compute exposure as the number who start minus half the number who die or at alive in that age intervals, dividing by 12 to express exposure in years. Finally we drop age 5-10 years.
> somoza <- mutate(somoza,
+ start = sum(deaths+survivors) - c(0, cumsum(deaths+survivors)[-8]),
+ width = c(1,2,3,6,12,36,60,NA)[as.numeric(age)],
+ exposure = width * (start -0.5*(deaths+survivors))/12) |>
+ filter(as.numeric(age) <= 7)
We then list the results, which coincide with Table 7.1 in the notes.
> library(tibble) # to format exposure for printing
> select(somoza, cohort, age, deaths, exposure) |>
+ mutate(exposure = num(exposure, digits=1))
# A tibble: 21 × 4
# Groups: cohort [3]
cohort age deaths exposure
<fct> <fct> <dbl> <num:.1!>
1 1941-59 0-1 months 168 278.4
2 1941-59 1-3 months 48 538.8
3 1941-59 3-6 months 63 794.4
4 1941-59 6-12 months 89 1550.8
5 1941-59 1-2 years 102 3006.0
6 1941-59 2-5 years 81 8743.5
7 1941-59 5-10 years 40 14270.0
8 1960-67 0-1 months 197 403.2
9 1960-67 1-3 months 48 786.0
10 1960-67 3-6 months 62 1165.2
# … with 11 more rows
Table 7.1 is also available as a Stata file called
somoza2.dta
in the datasets section.
In preparation for modeling, let us calculate the logarithm of exposure time, to be used as an offset.
> somoza <- mutate(somoza, os=log(exposure))
Let us fit the null model, which is equivalent to a simple exponential survival model.
> mexp <- glm(deaths ~ 1, offset=os, family=poisson, data=somoza)
> summary(mexp)
Call:
glm(formula = deaths ~ 1, family = poisson, data = somoza, offset = os)
Deviance Residuals:
Min 1Q Median 3Q Max
-21.602 -6.959 5.532 8.679 30.220
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.99645 0.02372 -168.5 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4239.9 on 20 degrees of freedom
Residual deviance: 4239.9 on 20 degrees of freedom
AIC: 4370.2
Number of Fisher Scoring iterations: 6
Note the astronomical deviance. The estimate of the constant happens to be the log of the overall mortality rate. Let’s verify this fact
> exp(coef(mexp))
(Intercept)
0.0183808
> data.frame("observed.rate"= sum(somoza$deaths)/sum(somoza$exposure))
observed.rate
1 0.0183808
We have an overall mortality rate of 18.4 deaths per thousand child-years of exposure.
On to the one-factor models. We start with the cohort model, which is equivalent to a separate exponential survival model for each cohort:
> mcoh <- glm(deaths ~ cohort, offset=os, family=poisson, data=somoza)
> summary(mcoh)
Call:
glm(formula = deaths ~ cohort, family = poisson, data = somoza,
offset = os)
Deviance Residuals:
Min 1Q Median 3Q Max
-18.697 -7.419 4.800 8.254 31.485
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.89949 0.04113 -94.799 < 2e-16 ***
cohort1960-67 -0.30204 0.05733 -5.268 1.38e-07 ***
cohort1968-76 0.07422 0.05897 1.259 0.208
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4239.9 on 20 degrees of freedom
Residual deviance: 4190.7 on 18 degrees of freedom
AIC: 4325.1
Number of Fisher Scoring iterations: 6
> exp(coef(mcoh)[-1])
cohort1960-67 cohort1968-76
0.7393088 1.0770409
Compare these results with the gross effect estimates in Table 7.3. Note that the hazard rate declined 26% between the 1941-59 and 1960-67 cohorts, but appears to have increased almost 8% for the 1968-76 cohort compared to the 1941-59 cohort. (We will return to this issue.)
The extremely large deviance shows that this model does not provide a reasonable description of the data. It is, however, better than the model where all cohorts follow the same exponential survival curve, as evidenced by the model chi-squared or the Wald test.
> anova(mexp, mcoh)
Analysis of Deviance Table
Model 1: deaths ~ 1
Model 2: deaths ~ cohort
Resid. Df Resid. Dev Df Deviance
1 20 4239.9
2 18 4190.7 2 49.161
> wald <- function(model, pattern) {
+ indices = grep(pattern, names(coef(model)))
+ b <- coef(model)[indices]
+ V <- vcov(model)[indices, indices]
+ data.frame(wald=t(b) %*% solve(V) %*% b)
+ }
> wald(mcoh, "cohort")
wald
1 47.99641
Both tests are highly significant, indicating that overall mortality rates are not the same across cohorts.
Now we consider the age model, where the hazard depends on the age of the child:
> mage <- glm(deaths ~ age, offset=os, family=poisson, data=somoza)
> summary(mage)
Call:
glm(formula = deaths ~ age, family = poisson, data = somoza,
offset = os)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7308 -1.0399 -0.5649 1.3472 3.4629
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.74268 0.04226 -17.57 <2e-16 ***
age1-3 months -1.97263 0.09170 -21.51 <2e-16 ***
age3-6 months -2.16186 0.08515 -25.39 <2e-16 ***
age6-12 months -2.48789 0.07555 -32.93 <2e-16 ***
age1-2 years -3.00435 0.07268 -41.34 <2e-16 ***
age2-5 years -4.08593 0.07565 -54.01 <2e-16 ***
age5-10 years -5.35520 0.11411 -46.93 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4239.87 on 20 degrees of freedom
Residual deviance: 72.64 on 14 degrees of freedom
AIC: 214.98
Number of Fisher Scoring iterations: 4
> exp(coef(mage)[-1])
age1-3 months age3-6 months age6-12 months age1-2 years age2-5 years
0.139090922 0.115111026 0.083085029 0.049570876 0.016807471
age5-10 years
0.004723507
The age model is equivalent to a piece-wise exponential survival model with no cohort effects. Compare the results with the gross effects in Table 7.3. Note the dramatic decrease in risk with age. At age one the risk of death is only 5% of what it was in the first month of life.
This model still doesn’t fit the data, as evidenced by the deviance or goodness of fit chi-squared. It is, however, a remarkable improvement over the null, as indicated by the model chi-squared or the Wald test,
> anova(mexp, mage)
Analysis of Deviance Table
Model 1: deaths ~ 1
Model 2: deaths ~ age
Resid. Df Resid. Dev Df Deviance
1 20 4239.9
2 14 72.6 6 4167.2
> wald(mage, "age")
wald
1 4689.271
You can see why demographers prefer age-specific mortality rates :)
Now on to the additive model with main effects of age and cohort, which is equivalent to a proportional hazards model:
> mphaz <- glm(deaths ~ age + cohort, offset=os, family=poisson, data=somoza)
> summary(mphaz)
Call:
glm(formula = deaths ~ age + cohort, family = poisson, data = somoza,
offset = os)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.08098 -0.35175 -0.00231 0.35063 0.81568
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.44848 0.05454 -8.223 < 2e-16 ***
age1-3 months -1.97269 0.09170 -21.513 < 2e-16 ***
age3-6 months -2.16332 0.08515 -25.406 < 2e-16 ***
age6-12 months -2.49167 0.07555 -32.980 < 2e-16 ***
age1-2 years -3.01405 0.07270 -41.457 < 2e-16 ***
age2-5 years -4.11538 0.07583 -54.274 < 2e-16 ***
age5-10 years -5.43589 0.11477 -47.364 < 2e-16 ***
cohort1960-67 -0.32424 0.05734 -5.655 1.56e-08 ***
cohort1968-76 -0.47836 0.05933 -8.063 7.43e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 4239.871 on 20 degrees of freedom
Residual deviance: 6.183 on 12 degrees of freedom
AIC: 152.53
Number of Fisher Scoring iterations: 3
> exp(coef(mphaz)[-1])
age1-3 months age3-6 months age6-12 months age1-2 years age2-5 years
0.13908251 0.11494288 0.08277124 0.04909233 0.01631970
age5-10 years cohort1960-67 cohort1968-76
0.00435737 0.72307607 0.61979965
Note that this model fits reasonably well, with a deviance of 6.18 on 12 d.f., so the assumption of proportional hazards is consistent with the data.
Compare the results with the net effect estimates in Table 7.3, and note that the anomaly with the youngest cohort has dissappeared. The estimates now indicate a steady decline in mortality across cohorts. Taking the 1941-59 cohort as a baseline, mortality at every age from zero to ten was 28% lower for the 1960-67 cohort and 36% lower for the more recent 1968-76 cohort. Can you explain why this trend did not emerge until we controlled for age? Hint: the survey was conducted in 1976.
Here’s a likelihood ratio test for the cohort effect adjusted for age. Note that we compare the age model with the additive model that has age and cohort. That is followed by the Wald test.
> anova(mage, mphaz)
Analysis of Deviance Table
Model 1: deaths ~ age
Model 2: deaths ~ age + cohort
Resid. Df Resid. Dev Df Deviance
1 14 72.640
2 12 6.183 2 66.457
> wald(mphaz, "cohort")
wald
1 68.59521
The cohort differences within age groups are highly significant.
Let us calculate the fitted life table shown in Table 7.4 of the lecture notes.
The predict()
function returns the predicted number of
events, so we divide by exposure to obtain a predicted rate. we then
calculate the cumulative hazard as a cumulative sum of the hazard times
the width of the intervals in years. Finally we exponentiate minus the
cumulative hazard to obtain the survival function.
> surv <- ungroup(somoza) |>
+ mutate(rate = fitted(mphaz)/exposure,
+ cumhaz = cumsum(rate*width)/12,
+ survival = exp(-cumhaz)) |>
+ select(age, cohort, survival)
The last thing to do is print our handy work. I will pivot the dataset to print a two-way table of survival probabilities.
> library(tidyr)
> pivot_wider(surv, names_from="cohort", values_from="survival")
# A tibble: 7 × 4
age `1941-59` `1960-67` `1968-76`
<fct> <dbl> <dbl> <dbl>
1 0-1 months 0.948 0.796 0.698
2 1-3 months 0.934 0.788 0.692
3 3-6 months 0.917 0.777 0.684
4 6-12 months 0.893 0.763 0.673
5 1-2 years 0.866 0.746 0.660
6 2-5 years 0.839 0.729 0.647
7 5-10 years 0.828 0.722 0.642
We see that the probability of surviving to age one increased from 89.3% to 92.2% and then to 93.2% across cohorts. The complement of the probability of surviving to age one is known as the infant mortality rate (although it is a probability, not a rate) and is usually expressed per thousand births; it declined from 106.7 to 78.3 to 67.5 across cohorts.
For another example of piecewise exponential survival, this time applied to recidivism and using individual rather than group data, and illustrating the creating of person-year files, see this page.
Our software is able to fit some of the parametric models discussed in the bibliographic notes, such as the Weibull model. It also has non-parametric methods, including procedures for calculating Kaplan-Meier estimates and for fitting Cox regression models by partial likelihood. Finally, the data management facilities include facilities for generating person-year files. For more information see the survival course pages.
Somoza, J.L. (1980). “Illustrative Analysis: Infant and Child Mortality in Colombia.” WFS Scientific Reports, Number 10. Scanned version here
Updated fall 2022