Germán Rodríguez
Generalized Linear Models Princeton University

## 7.A Piecewise Exponential Models

This is an illustration of piecewise exponential survival models using individual-level data. We use R, relying on the functions `survSplit()` to create pseudo-observations and `glm()` to fit the model using the Poisson equivalence. I also use the package `dplyr` for data manipulation. The call to `survSplit()` now uses a formula interface. Hat tip to Jack Hoskins, who alerted me to this change starting with version 2.40-1.

### The Data

The dataset we will consider is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum length of observation is 81 months. The data are available from the Stata website in Stata format.

``````> library(haven)
> nrow(recid)``````
``[1] 1445``

The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

``````> library(dplyr)
> recid <- mutate(recid, fail = 1 - cens, id = row_number())
> filter(recid, id == 9) |> select(id, durat, fail)``````
``````# A tibble: 1 × 3
id durat  fail
<int> <dbl> <dbl>
1     9    54     1``````

### Creating Pseudo-Observations

To create pseudo-observations for survival analysis we will use the `survSplit()` function in the `survival` package. We will split the data into single-year intervals of duration, from 0-12 to 48-60, with an open-ended category 60+. The function codes the interval variable using integer codes, and we turn that into a factor for convenience. We calculate exposure time for each episode as the difference between duration at the start and end. We list these data for individual 9 to illustrate how the episodes are created.

``````> library(survival)
> breaks <- seq(12, 60, by=12)
> recidx <- survSplit(Surv(durat, fail) ~ ., data = recid,
+   cut = breaks, episode = "interval", start = "start")
> recidx <- mutate(recidx, exposure = durat - start,
+ interval = factor(interval,  labels =
+   paste("(", c(0,breaks), ",", c(breaks,100), "]", sep=""))) |>
+   rename(events = fail)
> nrow(recidx)``````
``[1] 6718``
``> filter(recidx, id==9) |> select(id,start,durat,interval, events, exposure)``
``````  id start durat interval events exposure
1  9     0    12   (0,12]      0       12
2  9    12    24  (12,24]      0       12
3  9    24    36  (24,36]      0       12
4  9    36    48  (36,48]      0       12
5  9    48    54  (48,60]      1        6``````

The sample observation, which goes back to prison after 54 months, contributes five episodes or pseudo-observations; one each for years one to four, with 12 months of exposure and no events, and another one for year five, with 6 months of exposure and one event.

### A PWE Proportional Hazards Model

We are now ready to fit a proportional hazards model with a piecewise exponential baseline where the hazard changes from year to year. We use the same model as Wooldridge(2002), involving ten predictors, all fixed covariates. I specify the offset using the `exposure()` option. I could, of course, take logs and then use the `offset()` option.]{.stata}

``````> fit <- glm(events~interval+workprg+priors+tserved+felon+alcohol+drugs+
+   black+married+educ+age+offset(log(exposure)),
+   data=recidx, family=poisson)
> summary(fit)``````
``````
Call:
glm(formula = events ~ interval + workprg + priors + tserved +
felon + alcohol + drugs + black + married + educ + age +
offset(log(exposure)), family = poisson, data = recidx)

Deviance Residuals:
Min       1Q   Median       3Q      Max
-1.6130  -0.4598  -0.3468  -0.2543   3.4746

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept)      -3.8301275  0.2802673 -13.666  < 2e-16 ***
interval(12,24]   0.0365320  0.1093618   0.334  0.73834
interval(24,36]  -0.3738156  0.1296119  -2.884  0.00393 **
interval(36,48]  -0.8115436  0.1564015  -5.189 2.12e-07 ***
interval(48,60]  -0.9382311  0.1683212  -5.574 2.49e-08 ***
interval(60,100] -1.5471779  0.2033489  -7.608 2.77e-14 ***
workprg           0.0838291  0.0907942   0.923  0.35586
priors            0.0872458  0.0134735   6.475 9.46e-11 ***
tserved           0.0130089  0.0016859   7.716 1.20e-14 ***
felon            -0.2839252  0.1061488  -2.675  0.00748 **
alcohol           0.4324425  0.1057211   4.090 4.31e-05 ***
drugs             0.2747141  0.0978635   2.807  0.00500 **
black             0.4335560  0.0883623   4.907 9.27e-07 ***
married          -0.1540477  0.1092119  -1.411  0.15838
educ             -0.0214162  0.0194440  -1.101  0.27071
age              -0.0035800  0.0005222  -6.855 7.13e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

Null deviance: 3691.4  on 6717  degrees of freedom
Residual deviance: 3379.1  on 6702  degrees of freedom
AIC: 4515.1

Number of Fisher Scoring iterations: 6``````
``> 1 - exp(coef(fit)["felon"])``
``````   felon
0.247177 ``````

We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration, felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

### Survival Probabilities

We now illustrate the calculation of survival probabilities, starting with the baseline hazard. There are different ways to do these calculations in Stata, but I will proceed from first principles using Mata.. We will retrieve the coefficients, add the constant and the age effects to obtain the log hazard, exponentiate to obtain hazards, multiply by 12 and sum to obtain the cumulative baseline hazard, and then exponentiate to obtain the baseline survival. This is not particularly meaningful as it would apply with all covariates set to zero, including age, which is measured in months.

``````> b <- coef(fit)
> h <- exp(b[1] + c(0, b[2:6]))
> H <- cumsum(12*h)
> S <- exp(-H)
> names(S)[1] <- names(H)[1] <- "interval(0,12])"
> S``````
`````` interval(0,12])  interval(12,24]  interval(24,36]  interval(36,48]  interval(48,60]
0.7706799        0.5882188        0.4916958        0.4379748        0.3955312
interval(60,100]
0.3741986 ``````

We will now estimate the probability of staying out of prison for five years given average values of the predictors. First we calculate the mean of each predictor; we have to be careful to include only one observation per person, so we restrict the data to the first interval.

``````> xvars <- names(coef(fit))[7:16]
> pset <- filter(recidx, interval == "(0,12]") |>  select(xvars) ``````
``````Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(xvars)` instead of `xvars` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.``````
``> means <- colMeans(pset)``

Now that we have the means, we multiply each by the corresponding coefficient to obtain the linear predictor `xb`, exponentiate to obtain a relative risk, multiply by the baseline hazard, and then calculated the predicted survival.

``````> xb <- sum(coef(fit)[xvars] * means)
> exp(-(H * exp(xb)))[5]``````
``````interval(48,60]
0.6570278 ``````

Thus, the probability of staying out of prison for five years for the average person is 65.7%.

We now calculate this probability for felons and non-felons, keeping all other variables at their means. All we need to do is subtract from `xb` the coefficient of felon times the mean, which gives the linear predictor for a non-felon. We then add the coefficient of felon to get the linear predictor for a felon. In both cases the other variables stay at their means.

``````> xb0 <- xb - coef(fit)["felon"] * means["felon"]
> xb1 <- xb0 + coef(fit)["felon"]
> data.frame(non.felon = exp(-H[5]*exp(xb0)), felon= exp(-H[5]*exp(xb1)))``````
``````                non.felon     felon
interval(48,60] 0.6317763 0.7077168``````

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points. This is a marginal effect at the means.

An alternative calculation sets every person to be a felon or non-felon, leaving all other characteristics as they are, and then averages the predicted probability of surviving five years without returning to prison.

``````> xb0 <- as.matrix(mutate(pset, felon=0)) %*% coef(fit)[xvars]
> xb1 <- as.matrix(mutate(pset, felon=1)) %*% coef(fit)[xvars]
> data.frame(non.felon=mean(exp(-H[5] * exp(xb0))),
+                felon=mean(exp(-H[5] * exp(xb1))))``````
``````  non.felon     felon
1 0.6118797 0.6857928``````

The average probability of staying out of prison for five years is 68.6% for felons 61.2% for non-felons, a difference of 7.4 percentage points. This can be interpreted as an average marginal effect.

### References

Wooldridge, Jeffrey M. (2010). Econometric Analysis of Cross Section and Panel Data. 2nd Edition. Cambridge, Massachusetts: The MIT Press.

Chung, C-F, P. Schmidt and A.D. Witte (1991). “Survival Analysis: A Survey”. Journal of Quantitative Criminology,7:59-98.

Updated fall 2022