Germán Rodríguez
Generalized Linear Models Princeton University

7. Survival Models

Stata and R have extensive facilities for fitting survival models. In this course we discuss only the use of Poisson regression to fit piece-wise exponential survival models. For other approaches see the survival analysis course.

7.5 Infant and Child Mortality in Colombia

The datasets section has the original tabulation of children by sex, cohort, age and survival status (dead or still alive at interview), as analyzed by Somoza (1980).

As is often the case with survival data, a good part of the effort is to convert the raw data into the counts of events and exposure needed for analysis.

Data Preparation

We will start by reading the data and collapsing over sex, and will then compute events and exposure to reproduce Table 7.1 in the lecture notes.

> library(haven)
> library(dplyr)
> somoza <- read_dta("https://grodri.github.io/datasets/somoza.dta") |>
+   mutate(age=as_factor(age), cohort=as_factor(cohort))
> somoza <- group_by(somoza, cohort, age) |> 
+   summarize(deaths=sum(dead), survivors=sum(alive))
`summarise()` has grouped output by 'cohort'. You can override using the `.groups`
argument.

We take advantage of the fact that the data have been grouped by `cohort’. We compute the number who start each age as the total number of cases minus the number of deaths and survivors in all subsequent intervals. We add the width of the age interval in month, and then compute exposure as the number who start minus half the number who die or at alive in that age intervals, dividing by 12 to express exposure in years. Finally we drop age 5-10 years.

> somoza <- mutate(somoza, 
+   start = sum(deaths+survivors) - c(0, cumsum(deaths+survivors)[-8]),
+   width = c(1,2,3,6,12,36,60,NA)[as.numeric(age)],       
+   exposure = width * (start -0.5*(deaths+survivors))/12) |>
+ filter(as.numeric(age) <= 7)

We then list the results, which coincide with Table 7.1 in the notes.

> library(tibble) # to format exposure for printing
> select(somoza, cohort, age, deaths, exposure) |>
+   mutate(exposure = num(exposure, digits=1))
# A tibble: 21 × 4
# Groups:   cohort [3]
   cohort  age         deaths  exposure
   <fct>   <fct>        <dbl> <num:.1!>
 1 1941-59 0-1 months     168     278.4
 2 1941-59 1-3 months      48     538.8
 3 1941-59 3-6 months      63     794.4
 4 1941-59 6-12 months     89    1550.8
 5 1941-59 1-2 years      102    3006.0
 6 1941-59 2-5 years       81    8743.5
 7 1941-59 5-10 years      40   14270.0
 8 1960-67 0-1 months     197     403.2
 9 1960-67 1-3 months      48     786.0
10 1960-67 3-6 months      62    1165.2
# … with 11 more rows

Table 7.1 is also available as a Stata file called somoza2.dta in the datasets section.

Offset and Predictors

In preparation for modeling, let us calculate the logarithm of exposure time, to be used as an offset.

> somoza <- mutate(somoza, os=log(exposure))

Exponential Survival

Let us fit the null model, which is equivalent to a simple exponential survival model.

> mexp <- glm(deaths ~ 1, offset=os, family=poisson, data=somoza)
> summary(mexp)

Call:
glm(formula = deaths ~ 1, family = poisson, data = somoza, offset = os)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-21.602   -6.959    5.532    8.679   30.220  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -3.99645    0.02372  -168.5   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4239.9  on 20  degrees of freedom
Residual deviance: 4239.9  on 20  degrees of freedom
AIC: 4370.2

Number of Fisher Scoring iterations: 6

Note the astronomical deviance. The estimate of the constant happens to be the log of the overall mortality rate. Let’s verify this fact

> exp(coef(mexp))
(Intercept) 
  0.0183808 
> data.frame("observed.rate"= sum(somoza$deaths)/sum(somoza$exposure))
  observed.rate
1     0.0183808

We have an overall mortality rate of 18.4 deaths per thousand child-years of exposure.

Three Exponentials

On to the one-factor models. We start with the cohort model, which is equivalent to a separate exponential survival model for each cohort:

> mcoh <- glm(deaths ~ cohort, offset=os, family=poisson, data=somoza)
> summary(mcoh)

Call:
glm(formula = deaths ~ cohort, family = poisson, data = somoza, 
    offset = os)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-18.697   -7.419    4.800    8.254   31.485  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -3.89949    0.04113 -94.799  < 2e-16 ***
cohort1960-67 -0.30204    0.05733  -5.268 1.38e-07 ***
cohort1968-76  0.07422    0.05897   1.259    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4239.9  on 20  degrees of freedom
Residual deviance: 4190.7  on 18  degrees of freedom
AIC: 4325.1

Number of Fisher Scoring iterations: 6
> exp(coef(mcoh)[-1])
cohort1960-67 cohort1968-76 
    0.7393088     1.0770409 

Compare these results with the gross effect estimates in Table 7.3. Note that the hazard rate declined 26% between the 1941-59 and 1960-67 cohorts, but appears to have increased almost 8% for the 1968-76 cohort compared to the 1941-59 cohort. (We will return to this issue.)

The extremely large deviance shows that this model does not provide a reasonable description of the data. It is, however, better than the model where all cohorts follow the same exponential survival curve, as evidenced by the model chi-squared or the Wald test.

> anova(mexp, mcoh)
Analysis of Deviance Table

Model 1: deaths ~ 1
Model 2: deaths ~ cohort
  Resid. Df Resid. Dev Df Deviance
1        20     4239.9            
2        18     4190.7  2   49.161
> wald <- function(model, pattern) {
+   indices = grep(pattern, names(coef(model)))
+   b <- coef(model)[indices]
+   V <- vcov(model)[indices, indices]
+   data.frame(wald=t(b) %*% solve(V) %*% b)
+ } 
> wald(mcoh, "cohort")
      wald
1 47.99641

Both tests are highly significant, indicating that overall mortality rates are not the same across cohorts.

Piece-Wise Exponential Survival

Now we consider the age model, where the hazard depends on the age of the child:

> mage <- glm(deaths ~ age, offset=os, family=poisson, data=somoza)
> summary(mage)

Call:
glm(formula = deaths ~ age, family = poisson, data = somoza, 
    offset = os)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.7308  -1.0399  -0.5649   1.3472   3.4629  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.74268    0.04226  -17.57   <2e-16 ***
age1-3 months  -1.97263    0.09170  -21.51   <2e-16 ***
age3-6 months  -2.16186    0.08515  -25.39   <2e-16 ***
age6-12 months -2.48789    0.07555  -32.93   <2e-16 ***
age1-2 years   -3.00435    0.07268  -41.34   <2e-16 ***
age2-5 years   -4.08593    0.07565  -54.01   <2e-16 ***
age5-10 years  -5.35520    0.11411  -46.93   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4239.87  on 20  degrees of freedom
Residual deviance:   72.64  on 14  degrees of freedom
AIC: 214.98

Number of Fisher Scoring iterations: 4
> exp(coef(mage)[-1])
 age1-3 months  age3-6 months age6-12 months   age1-2 years   age2-5 years 
   0.139090922    0.115111026    0.083085029    0.049570876    0.016807471 
 age5-10 years 
   0.004723507 

The age model is equivalent to a piece-wise exponential survival model with no cohort effects. Compare the results with the gross effects in Table 7.3. Note the dramatic decrease in risk with age. At age one the risk of death is only 5% of what it was in the first month of life.

This model still doesn’t fit the data, as evidenced by the deviance or goodness of fit chi-squared. It is, however, a remarkable improvement over the null, as indicated by the model chi-squared or the Wald test,

> anova(mexp, mage)
Analysis of Deviance Table

Model 1: deaths ~ 1
Model 2: deaths ~ age
  Resid. Df Resid. Dev Df Deviance
1        20     4239.9            
2        14       72.6  6   4167.2
> wald(mage, "age")
      wald
1 4689.271

You can see why demographers prefer age-specific mortality rates :)

The Proportional Hazards Model

Now on to the additive model with main effects of age and cohort, which is equivalent to a proportional hazards model:

> mphaz <-  glm(deaths ~ age + cohort, offset=os, family=poisson, data=somoza)
> summary(mphaz)

Call:
glm(formula = deaths ~ age + cohort, family = poisson, data = somoza, 
    offset = os)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.08098  -0.35175  -0.00231   0.35063   0.81568  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.44848    0.05454  -8.223  < 2e-16 ***
age1-3 months  -1.97269    0.09170 -21.513  < 2e-16 ***
age3-6 months  -2.16332    0.08515 -25.406  < 2e-16 ***
age6-12 months -2.49167    0.07555 -32.980  < 2e-16 ***
age1-2 years   -3.01405    0.07270 -41.457  < 2e-16 ***
age2-5 years   -4.11538    0.07583 -54.274  < 2e-16 ***
age5-10 years  -5.43589    0.11477 -47.364  < 2e-16 ***
cohort1960-67  -0.32424    0.05734  -5.655 1.56e-08 ***
cohort1968-76  -0.47836    0.05933  -8.063 7.43e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 4239.871  on 20  degrees of freedom
Residual deviance:    6.183  on 12  degrees of freedom
AIC: 152.53

Number of Fisher Scoring iterations: 3
> exp(coef(mphaz)[-1])
 age1-3 months  age3-6 months age6-12 months   age1-2 years   age2-5 years 
    0.13908251     0.11494288     0.08277124     0.04909233     0.01631970 
 age5-10 years  cohort1960-67  cohort1968-76 
    0.00435737     0.72307607     0.61979965 

Note that this model fits reasonably well, with a deviance of 6.18 on 12 d.f., so the assumption of proportional hazards is consistent with the data.

Compare the results with the net effect estimates in Table 7.3, and note that the anomaly with the youngest cohort has dissappeared. The estimates now indicate a steady decline in mortality across cohorts. Taking the 1941-59 cohort as a baseline, mortality at every age from zero to ten was 28% lower for the 1960-67 cohort and 36% lower for the more recent 1968-76 cohort. Can you explain why this trend did not emerge until we controlled for age? Hint: the survey was conducted in 1976.

Here’s a likelihood ratio test for the cohort effect adjusted for age. Note that we compare the age model with the additive model that has age and cohort. That is followed by the Wald test.

> anova(mage, mphaz)
Analysis of Deviance Table

Model 1: deaths ~ age
Model 2: deaths ~ age + cohort
  Resid. Df Resid. Dev Df Deviance
1        14     72.640            
2        12      6.183  2   66.457
> wald(mphaz, "cohort")
      wald
1 68.59521

The cohort differences within age groups are highly significant.

Estimating Survival Probabilities

Let us calculate the fitted life table shown in Table 7.4 of the lecture notes.

The predict() function returns the predicted number of events, so we divide by exposure to obtain a predicted rate. we then calculate the cumulative hazard as a cumulative sum of the hazard times the width of the intervals in years. Finally we exponentiate minus the cumulative hazard to obtain the survival function.

> surv <- ungroup(somoza) |> 
+   mutate(rate = fitted(mphaz)/exposure,
+     cumhaz = cumsum(rate*width)/12,
+     survival = exp(-cumhaz)) |>
+ select(age, cohort, survival)

The last thing to do is print our handy work. I will pivot the dataset to print a two-way table of survival probabilities.

> library(tidyr)
> pivot_wider(surv, names_from="cohort", values_from="survival")
# A tibble: 7 × 4
  age         `1941-59` `1960-67` `1968-76`
  <fct>           <dbl>     <dbl>     <dbl>
1 0-1 months      0.948     0.796     0.698
2 1-3 months      0.934     0.788     0.692
3 3-6 months      0.917     0.777     0.684
4 6-12 months     0.893     0.763     0.673
5 1-2 years       0.866     0.746     0.660
6 2-5 years       0.839     0.729     0.647
7 5-10 years      0.828     0.722     0.642

We see that the probability of surviving to age one increased from 89.3% to 92.2% and then to 93.2% across cohorts. The complement of the probability of surviving to age one is known as the infant mortality rate (although it is a probability, not a rate) and is usually expressed per thousand births; it declined from 106.7 to 78.3 to 67.5 across cohorts.

Other Methods

For another example of piecewise exponential survival, this time applied to recidivism and using individual rather than group data, and illustrating the creating of person-year files, see this page.

Our software is able to fit some of the parametric models discussed in the bibliographic notes, such as the Weibull model. It also has non-parametric methods, including procedures for calculating Kaplan-Meier estimates and for fitting Cox regression models by partial likelihood. Finally, the data management facilities include facilities for generating person-year files. For more information see the survival course pages.

References

Somoza, J.L. (1980). “Illustrative Analysis: Infant and Child Mortality in Colombia.” WFS Scientific Reports, Number 10. Scanned version here

Updated fall 2022