Stata and R have extensive facilities for fitting survival models. In this course we discuss only the use of Poisson regression to fit piece-wise exponential survival models. For other approaches see the survival analysis course.
The datasets section has the original tabulation of children by sex, cohort, age and survival status (dead or still alive at interview), as analyzed by Somoza (1980).
As is often the case with survival data, a good part of the effort is to convert the raw data into the counts of events and exposure needed for analysis.
We will start by reading the data and collapsing over sex, and will then compute events and exposure to reproduce Table 7.1 in the lecture notes.
. use https://grodri.github.io/datasets/somoza, clear (Infant and Child Survival in Colombia, 1976) . collapse (sum) alive (sum) dead, by (cohort age)
We make sure the data are sorted by cohort and then age, use
egen
to count the total number of children in each cohort,
and then use replace
with a by cohort
prefix
to recompute the number of children alive at the start of each age
group, calculated as the number who started the previous age group minus
those still alive at the previous age group and minus those who died in
the previous age group. Having done this we can drop kids older than 10,
as we are only interested in survival to age ten.
. sort cohort age // make sure data are sorted . egen start = total(alive+dead), by(cohort) . by cohort: replace start = /// > start[_n-1] - alive[_n-1] - dead[_n-1] if _n > 1 (21 real changes made) . drop if age > 7 (3 observations deleted)
The next step is to use recode
to generate a variable
representing the width of the age intervals in months. We then use
generate
to compute exposure, assuming everyone is exposed
the full width of the interval except those censored or who die in the
interval, who are exposed on average half the interval. We divide by 12
to express exposure in person-years.
. recode age 4=6 5=12 6=36 7=60, gen(width) // interval width in months (12 differences between age and width) . gen exposure = width * (start - 0.5 * (alive + dead))/12 // in years
For convenience we rename dead to deaths
and set a
format, so exposure
prints with one decimal.
We then list the results, which coincide with Table 7.1 in the notes.
. rename dead deaths . format expo %8.1f . list cohort age deaths expo, sep(7) ┌───────────────────────────────────────────┐ │ cohort age deaths exposure │ ├───────────────────────────────────────────┤ 1. │ 1941-59 0-1 months 168 278.4 │ 2. │ 1941-59 1-3 months 48 538.8 │ 3. │ 1941-59 3-6 months 63 794.4 │ 4. │ 1941-59 6-12 months 89 1550.8 │ 5. │ 1941-59 1-2 years 102 3006.0 │ 6. │ 1941-59 2-5 years 81 8743.5 │ 7. │ 1941-59 5-10 years 40 14270.0 │ ├───────────────────────────────────────────┤ 8. │ 1960-67 0-1 months 197 403.2 │ 9. │ 1960-67 1-3 months 48 786.0 │ 10. │ 1960-67 3-6 months 62 1165.3 │ 11. │ 1960-67 6-12 months 81 2294.8 │ 12. │ 1960-67 1-2 years 97 4500.5 │ 13. │ 1960-67 2-5 years 103 13201.5 │ 14. │ 1960-67 5-10 years 39 19525.0 │ ├───────────────────────────────────────────┤ 15. │ 1968-76 0-1 months 195 495.3 │ 16. │ 1968-76 1-3 months 55 956.7 │ 17. │ 1968-76 3-6 months 58 1381.4 │ 18. │ 1968-76 6-12 months 85 2604.5 │ 19. │ 1968-76 1-2 years 87 4618.5 │ 20. │ 1968-76 2-5 years 70 9814.5 │ 21. │ 1968-76 5-10 years 10 5802.5 │ └───────────────────────────────────────────┘
We label the dataset and save it. The resulting file is available in
the datasets section as somoza2
.
. label data "Infant and Child Mortality in Colombia, 1976" . notes : "Events and Exposure in Table 7.1, GLM Notes" . save somoza2, replace file somoza2.dta saved
In preparation for modeling, let us calculate the logarithm of exposure time, to be used as an offset.
. gen logexp = log(exposure)
Let us fit the null model, which is equivalent to a simple exponential survival model. We will also store the estimates for use in later tests.
. poisson deaths, offset(logexp) Iteration 0: log likelihood = -2184.107 Iteration 1: log likelihood = -2184.107 (backed up) Poisson regression Number of obs = 21 LR chi2(0) = 0.00 Prob > chi2 = . Log likelihood = -2184.107 Pseudo R2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── deaths │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── _cons │ -3.996449 .0237156 -168.52 0.000 -4.04293 -3.949967 logexp │ 1 (offset) ─────────────┴──────────────────────────────────────────────────────────────── . estat gof Deviance goodness-of-fit = 4239.871 Prob > chi2(20) = 0.0000 Pearson goodness-of-fit = 15397.26 Prob > chi2(20) = 0.0000 . estimates store null
Note the astronomical deviance. The estimate of the constant happens to be the log of the overall mortality rate. Let’s verify this fact
. di "Fitted rate = " exp(_b[_cons]) Fitted rate = .0183808 . quietly summarize deaths . scalar ndeaths = r(sum) . quietly summarize exposure . di "Observed Rate = " ndeaths/r(sum) Observed Rate = .0183808
We have an overall mortality rate of 18.4 deaths per thousand child-years of exposure.
On to the one-factor models. We start with the cohort model, which is equivalent to a separate exponential survival model for each cohort:
. poisson deaths i.cohort, offset(logexp) Iteration 0: log likelihood = -2160.0647 Iteration 1: log likelihood = -2159.5266 Iteration 2: log likelihood = -2159.5264 Iteration 3: log likelihood = -2159.5264 Poisson regression Number of obs = 21 LR chi2(2) = 49.16 Prob > chi2 = 0.0000 Log likelihood = -2159.5264 Pseudo R2 = 0.0113 ─────────────┬──────────────────────────────────────────────────────────────── deaths │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── cohort │ 1960-67 │ -.3020391 .0573319 -5.27 0.000 -.4144075 -.1896707 1968-76 │ .0742177 .0589726 1.26 0.208 -.0413664 .1898017 │ _cons │ -3.899487 .0411345 -94.80 0.000 -3.980109 -3.818865 logexp │ 1 (offset) ─────────────┴──────────────────────────────────────────────────────────────── . di exp(_b[2.cohort]), exp(_b[3.cohort]) // 1960-67 amd 1968-76 .73930913 1.0770412 . estat gof Deviance goodness-of-fit = 4190.709 Prob > chi2(18) = 0.0000 Pearson goodness-of-fit = 15387.58 Prob > chi2(18) = 0.0000
Compare these results with the gross effect estimates in Table 7.3. Note that the hazard rate declined 26% between the 1941-59 and 1960-67 cohorts, but appears to have increased almost 8% for the 1968-76 cohort compared to the 1941-59 cohort. (We will return to this issue.)
The extremely large deviance shows that this model does not provide a reasonable description of the data. It is, however, better than the model where all cohorts follow the same exponential survival curve, as evidenced by the model chi-squared or the Wald test.
. lrtest null . Likelihood-ratio test Assumption: null nested within . LR chi2(2) = 49.16 Prob > chi2 = 0.0000 . testparm i.cohort ( 1) [deaths]2.cohort = 0 ( 2) [deaths]3.cohort = 0 chi2( 2) = 48.00 Prob > chi2 = 0.0000
Both tests are highly significant, indicating that overall mortality rates are not the same across cohorts.
Now we consider the age model, where the hazard depends on the age of the child:
. poisson deaths i.age, offset(logexp) Iteration 0: log likelihood = -100.89918 Iteration 1: log likelihood = -100.49174 Iteration 2: log likelihood = -100.49167 Iteration 3: log likelihood = -100.49167 Poisson regression Number of obs = 21 LR chi2(6) = 4167.23 Prob > chi2 = 0.0000 Log likelihood = -100.49167 Pseudo R2 = 0.9540 ─────────────┬──────────────────────────────────────────────────────────────── deaths │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── age │ 1-3 months │ -1.972627 .0916964 -21.51 0.000 -2.152349 -1.792906 3-6 months │ -2.161858 .0851481 -25.39 0.000 -2.328745 -1.994971 6-12 months │ -2.487891 .0755466 -32.93 0.000 -2.635959 -2.339822 1-2 years │ -3.004351 .0726789 -41.34 0.000 -3.146799 -2.861904 2-5 years │ -4.085932 .0756487 -54.01 0.000 -4.2342 -3.937663 5-10 years │ -5.355204 .1141125 -46.93 0.000 -5.57886 -5.131547 │ _cons │ -.7426813 .0422577 -17.58 0.000 -.8255049 -.6598577 logexp │ 1 (offset) ─────────────┴──────────────────────────────────────────────────────────────── . estimates store age . mata exp(st_matrix("e(b)")) 1 2 3 4 5 6 ┌───────────────────────────────────────────────────────────────────────────────────── 1 │ 1 .1390909507 .1151110572 .0830850445 .0495708909 .0168074734 └───────────────────────────────────────────────────────────────────────────────────── 7 8 ─────────────────────────────┐ 1 .0047235073 .4758363547 │ ─────────────────────────────┘ . estat gof Deviance goodness-of-fit = 72.64008 Prob > chi2(14) = 0.0000 Pearson goodness-of-fit = 76.7292 Prob > chi2(14) = 0.0000
The age model is equivalent to a piece-wise exponential survival model with no cohort effects. Compare the results with the gross effects in Table 7.3. Note the dramatic decrease in risk with age. At age one the risk of death is only 5% of what it was in the first month of life.
This model still doesn’t fit the data, as evidenced by the deviance or goodness of fit chi-squared. It is, however, a remarkable improvement over the null, as indicated by the model chi-squared or the Wald test,
. lrtest null . Likelihood-ratio test Assumption: null nested within age LR chi2(6) = 4167.23 Prob > chi2 = 0.0000 . testparm i.age ( 1) [deaths]2.age = 0 ( 2) [deaths]3.age = 0 ( 3) [deaths]4.age = 0 ( 4) [deaths]5.age = 0 ( 5) [deaths]6.age = 0 ( 6) [deaths]7.age = 0 chi2( 6) = 4689.27 Prob > chi2 = 0.0000
You can see why demographers prefer age-specific mortality rates :)
Now on to the additive model with main effects of age and cohort, which is equivalent to a proportional hazards model:
. poisson deaths i.age i.cohort, offset(logexp) Iteration 0: log likelihood = -67.794175 Iteration 1: log likelihood = -67.263248 Iteration 2: log likelihood = -67.263109 Iteration 3: log likelihood = -67.263109 Poisson regression Number of obs = 21 LR chi2(8) = 4233.69 Prob > chi2 = 0.0000 Log likelihood = -67.263109 Pseudo R2 = 0.9692 ─────────────┬──────────────────────────────────────────────────────────────── deaths │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── age │ 1-3 months │ -1.972688 .0916965 -21.51 0.000 -2.152409 -1.792966 3-6 months │ -2.16332 .0851488 -25.41 0.000 -2.330208 -1.996431 6-12 months │ -2.491674 .075551 -32.98 0.000 -2.639752 -2.343597 1-2 years │ -3.014052 .0727035 -41.46 0.000 -3.156548 -2.871556 2-5 years │ -4.115382 .0758262 -54.27 0.000 -4.263999 -3.966766 5-10 years │ -5.435887 .114768 -47.36 0.000 -5.660828 -5.210945 │ cohort │ 1960-67 │ -.3242407 .0573352 -5.66 0.000 -.4366156 -.2118657 1968-76 │ -.4783589 .0593256 -8.06 0.000 -.594635 -.3620828 │ _cons │ -.4484824 .0545389 -8.22 0.000 -.5553767 -.341588 logexp │ 1 (offset) ─────────────┴──────────────────────────────────────────────────────────────── . di exp(_b[2.cohort]), exp(_b[3.cohort]) // 1960-67 and 1968-76 .72307619 .61979973 . estat gof Deviance goodness-of-fit = 6.182966 Prob > chi2(12) = 0.9066 Pearson goodness-of-fit = 6.178637 Prob > chi2(12) = 0.9068
Note that this model fits reasonably well, with a deviance of 6.18 on 12 d.f., so the assumption of proportional hazards is consistent with the data.
Compare the results with the net effect estimates in Table 7.3, and note that the anomaly with the youngest cohort has dissappeared. The estimates now indicate a steady decline in mortality across cohorts. Taking the 1941-59 cohort as a baseline, mortality at every age from zero to ten was 28% lower for the 1960-67 cohort and 36% lower for the more recent 1968-76 cohort. Can you explain why this trend did not emerge until we controlled for age? Hint: the survey was conducted in 1976.
Here’s a likelihood ratio test for the cohort effect adjusted for age. Note that we compare the age model with the additive model that has age and cohort. That is followed by the Wald test.
. lrtest age . Likelihood-ratio test Assumption: age nested within . LR chi2(2) = 66.46 Prob > chi2 = 0.0000 . testparm i.cohort ( 1) [deaths]2.cohort = 0 ( 2) [deaths]3.cohort = 0 chi2( 2) = 68.59 Prob > chi2 = 0.0000
The cohort differences within age groups are highly significant.
Let us calculate the fitted life table shown in Table 7.4 of the lecture notes.
The predict
command following a Poisson regression
calculates the expected number of events, so we need to divide by
exposure to obtain fitted rates. An alternative is to use the
xb
and nooffset
options (you need both) to
obtain the linear predictor or log-hazard, which you can then
exponentiate to obtain the fitted hazard rate.
. predict events (option n assumed; predicted number of events) . gen hazard = events/exposure
At this point recall that the age intervals have different widths. We stored the widths in months in the variable width. We will now convert it to years, so it is in the same units as exposure.
. quietly replace width=width/12
Now we will sort the data by age within each cohort and calculate the cumulative hazard for each cohort as a running sum of the hazard times the interval width. We then use the fact that S(t)= exp{-Λ(t)} to obtain the survival function.
. bysort cohort (age): gen cumhaz = sum(hazard * width) . gen survival = exp(-cumhaz)
The last thing to do is print our handy work. I will use the
tabulate
command rather than list
to obtain a
suitable two-way layout. I specify the “mean” to list the single value
in each combination of age and cohort.
. tab age cohort, sum(survival) mean Means of survival Age (in │ Year of birth (cohort) groups) │ 1941-59 1960-67 1968-76 │ Total ───────────┼─────────────────────────────────┼────────── 0-1 month │ .94817483 .96225142 .96755451 │ .95932692 1-3 month │ .93424243 .95200676 .95871794 │ .94832238 3-6 month │ .91725492 .93945819 .94787562 │ .93486291 6-12 mont │ .89333057 .92167562 .93247539 │ .91582719 1-2 years │ .8657589 .90101755 .91453147 │ .8937693 2-5 years │ .83910966 .88087672 .89698023 │ .8723222 5-10 year │ .8275159 .8720594 .88927853 │ .86295128 ───────────┼─────────────────────────────────┼────────── Total │ .88934103 .91847795 .92963053 │ .91248317
We see that the probability of surviving to age one increased from 89.3% to 92.2% and then to 93.2% across cohorts. The complement of the probability of surviving to age one is known as the infant mortality rate (although it is a probability, not a rate) and is usually expressed per thousand births; it declined from 106.7 to 78.3 to 67.5 across cohorts.
For another example of piecewise exponential survival, this time applied to recidivism and using individual rather than group data, and illustrating the creating of person-year files, see this page.
Our software is able to fit some of the parametric models discussed in the bibliographic notes, such as the Weibull model. It also has non-parametric methods, including procedures for calculating Kaplan-Meier estimates and for fitting Cox regression models by partial likelihood. Finally, the data management facilities include facilities for generating person-year files. For more information see the survival course pages.
Somoza, J.L. (1980). “Illustrative Analysis: Infant and Child Mortality in Colombia.” WFS Scientific Reports, Number 10. Scanned version here
Updated fall 2022