Statistics and Population

Stata Logs

Home Lecture Notes Stata Logs R Logs Datasets Problem Sets

7.A Piecewise Exponential Models

This is an illustration of piecewise exponential survival models using individual-level data. We use Stata, relying on the commands stset and stsplit to create pseudo-observations, and poisson to fit the model using the Poisson equivalence. Stata can also fit this model using streg with distribution(exponential) on the split data.

The Data

The dataset we will consider is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum length of observation is 81 months. The data are available from the Stata website in Stata format.

. use https://www.stata.com/data/jwooldridge/eacsap/recid, clear

. desc, s

Contains data from https://www.stata.com/data/jwooldridge/eacsap/recid.dta
 Observations:         1,445                  
    Variables:            18                  28 Sep 1998 13:28
Sorted by:

The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

. gen fail = 1 - cens

. gen id = _n

. list id durat fail in 9    

     ┌───────────────────┐
     │ id   durat   fail │
     ├───────────────────┤
  9. │  9      54      1 │
     └───────────────────┘

Creating Pseudo-Observations

To create pseudo-observations for survival analysis using the piecewise exponential model we stset the data ,making sure we specify an id variable, and then use stsplit to split the data into single-year intervals of duration from 0-12 to 48-60, with an open-ended category 60+. The first command generates the built-in variables _t0 for entering time, _t for exit time and _d for failure. These are adjusted after the split to reflect what happens in each interval. We compute exposure as the difference between the exit and entering times. I also create a variable for the number of events, but this is not necessary as _d would serve the same purpose. We list these variables for individual 9 after the split to illustrate how the episodes are created.

. stset durat, fail(fail) id(id)

Survival-time data settings

           ID variable: id
         Failure event: fail!=0 & fail<.
Observed time interval: (durat[_n-1], durat]
     Exit on or before: failure

──────────────────────────────────────────────────────────────────────────
      1,445  total observations
          0  exclusions
──────────────────────────────────────────────────────────────────────────
      1,445  observations remaining, representing
      1,445  subjects
        552  failures in single-failure-per-subject data
     80,013  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =        81

. list id _t0 _t _d if id==9

      ┌────────────────────┐
      │ id   _t0   _t   _d │
      ├────────────────────┤
   9. │  9     0   54    1 │
      └────────────────────┘

. stsplit year, at(0 12 24 36 48 60 100)
(5,273 observations (episodes) created)

. gen exposure = _t - _t0

. gen events = _d

. list id _t0 _t _d year exposure events if id==9

      ┌───────────────────────────────────────────────┐
      │ id   _t0   _t   _d   year   exposure   events │
      ├───────────────────────────────────────────────┤
  41. │  9     0   12    0      0         12        0 │
  42. │  9    12   24    0     12         12        0 │
  43. │  9    24   36    0     24         12        0 │
  44. │  9    36   48    0     36         12        0 │
  45. │  9    48   54    1     48          6        1 │
      └───────────────────────────────────────────────┘

The sample observation, which goes back to prison after 54 months, contributes five episodes or pseudo-observations; one each for years one to four, with 12 months of exposure and no events, and another one for year five, with 6 months of exposure and one event. Note that the variable generated by Stata to identify episodes, here year, reflects the time at which the interval starts, the same as _t0.

A PWE Proportional Hazards Model

We are now ready to fit a proportional hazards model with a piecewise exponential baseline where the hazard changes from year to year. We use the same model as Wooldridge(2002), involving ten predictors, all fixed covariates. I specify the offset using the exposure() option. I could, of course, take logs and then use the offset() option.]{.stata}

. local predictors workprg priors tserved felon alcohol drugs black married educ
>  age

. poisson events i.year `predictors', exposure(exposure)

Iteration 0:   log likelihood = -2244.0023  
Iteration 1:   log likelihood = -2241.5492  
Iteration 2:   log likelihood = -2241.5353  
Iteration 3:   log likelihood = -2241.5353  

Poisson regression                                      Number of obs =  6,718
                                                        LR chi2(15)   = 312.36
                                                        Prob > chi2   = 0.0000
Log likelihood = -2241.5353                             Pseudo R2     = 0.0651

─────────────┬────────────────────────────────────────────────────────────────
      events │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
        year │
         12  │    .036532   .1093659     0.33   0.738    -.1778212    .2508851
         24  │  -.3738156   .1296172    -2.88   0.004    -.6278607   -.1197706
         36  │  -.8115436   .1564067    -5.19   0.000    -1.118095   -.5049921
         48  │  -.9382311   .1683272    -5.57   0.000    -1.268146   -.6083159
         60  │  -1.547178   .2033594    -7.61   0.000    -1.945755   -1.148601
             │
     workprg │   .0838291   .0907983     0.92   0.356    -.0941323    .2617906
      priors │   .0872458   .0134763     6.47   0.000     .0608327     .113659
     tserved │   .0130089   .0016863     7.71   0.000     .0097039    .0163139
       felon │  -.2839252   .1061534    -2.67   0.007     -.491982   -.0758684
     alcohol │   .4324425   .1057254     4.09   0.000     .2252245    .6396605
       drugs │   .2747141   .0978667     2.81   0.005     .0828989    .4665293
       black │    .433556   .0883658     4.91   0.000     .2603622    .6067497
     married │  -.1540477   .1092154    -1.41   0.158    -.3681059    .0600104
        educ │  -.0214162   .0194453    -1.10   0.271    -.0595283     .016696
         age │    -.00358   .0005223    -6.85   0.000    -.0046037   -.0025563
       _cons │  -3.830127    .280282   -13.67   0.000     -4.37947   -3.280785
ln(exposure) │          1  (exposure)
─────────────┴────────────────────────────────────────────────────────────────

. di 1 - exp(_b[felon])
.24717705

We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration, felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

Survival Probabilities

We now illustrate the calculation of survival probabilities, starting with the baseline hazard. There are different ways to do these calculations in Stata, but I will proceed from first principles using Mata.. We will retrieve the coefficients, add the constant and the age effects to obtain the log hazard, exponentiate to obtain hazards, multiply by 12 and sum to obtain the cumulative baseline hazard, and then exponentiate to obtain the baseline survival. This is not particularly meaningful as it would apply with all covariates set to zero, including age, which is measured in months.

. mata
───────────────────────────────────────────────── mata (type end to exit) ──────
:     b = st_matrix("e(b)")

:     logh = b[17] :+ b[1..5] // constant is last coefficient

:     h = exp(logh)

:     H = runningsum(12 * h)

:     S = exp(-H)

:     S
                 1             2             3             4             5
    ┌───────────────────────────────────────────────────────────────────────┐
  1 │  .7706798852   .5882188201   .4916958096   .4379748047   .3955311901  │
    └───────────────────────────────────────────────────────────────────────┘

: end
────────────────────────────────────────────────────────────────────────────────

We will now estimate the probability of staying out of prison for five years given average values of the predictors. First we calculate the mean of each predictor; we have to be careful to include only one observation per person, so we restrict the data to the first interval. The easiest way to compute means in Stata is to collapse.

. keep if year == 0
(5,273 observations deleted)

. save recid1, replace
file recid1.dta saved

. collapse `predictors'

Now that we have the means, we multiply each by the corresponding coefficient to obtain the linear predictor xb, exponentiate to obtain a relative risk, multiply by the baseline hazard, and then calculated the predicted survival.

. scalar xb = 0

. foreach var of local predictors {
  2.   scalar xb = xb + `var' * _b[`var']
  3. }

. di "xb=" xb
xb=-.79219682

. mata: exp( -H[5] * exp(st_numscalar("xb")) )
  .6570278064

Thus, the probability of staying out of prison for five years for the average person is 65.7%.

We now calculate this probability for felons and non-felons, keeping all other variables at their means. All we need to do is subtract from xb the coefficient of felon times the mean, which gives the linear predictor for a non-felon. We then add the coefficient of felon to get the linear predictor for a felon. In both cases the other variables stay at their means.

. scalar xb0 = xb - felon*_b[felon]

. scalar xb1 = xb0 + _b[felon]

. mata exp( -H[5] * exp(st_numscalar("xb0")) )
  .6317763048

. mata exp( -H[5] * exp(st_numscalar("xb1")) )
  .7077167906

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points. This is a marginal effect at the means.

An alternative calculation sets every person to be a felon or non-felon, leaving all other characteristics as they are, and then averages the predicted probability of surviving five years without returning to prison. To do this we need the file with the first episode for each person, which conveniently I saved. I’ll also store the cumulative hazard at duration 60 in scalar H5.

. use recid1, clear

. scalar drop _all // to reuse names xb0 and xb1

. mata st_numscalar("H5", H[5])

. gen xb0 = 0

. local predictors workprg priors tserved felon alcohol drugs black married educ
>  age

. foreach var of local predictors {
  2.   if "`var'" == "felon" continue
  3.   quietly replace xb0 = xb0 + `var' * _b[`var']
  4. }

. gen xb1 = xb0 + _b[felon]

. gen S0 = exp(-H5 * exp(xb0) )

. gen S1 = exp(-H5 * exp(xb1) )

. sum S0 S1 

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
          S0 │      1,445    .6118797    .1549424   .0021267   .9595686
          S1 │      1,445    .6857928    .1392872   .0097329   .9694076

The average probability of staying out of prison for five years is 68.6% for felons 61.2% for non-felons, a difference of 7.4 percentage points. This can be interpreted as an average marginal effect.

References

Wooldridge, Jeffrey M. (2010). Econometric Analysis of Cross Section and Panel Data. 2nd Edition. Cambridge, Massachusetts: The MIT Press.

Chung, C-F, P. Schmidt and A.D. Witte (1991). “Survival Analysis: A Survey”. Journal of Quantitative Criminology,7:59-98.

Updated fall 2022