Germán Rodríguez

Generalized Linear Models
Princeton University
This is an illustration of piecewise exponential survival models
using individual-level data. We use R, relying on the functions
`survSplit()`

to create pseudo-observations and
`glm()`

to fit the model using the Poisson equivalence. I
also use the package `dplyr`

for data manipulation. The call
to `survSplit()`

now uses a formula interface. Hat tip to
Jack Hoskins, who alerted me to this change starting with version
2.40-1.

The dataset we will consider is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum length of observation is 81 months. The data are available from the Stata website in Stata format.

```
> library(haven)
> recid <- read_dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta")
> nrow(recid)
```

`[1] 1445`

The file has a censoring indicator, which we subtract from 1 to get a failure indicator. We also create an id variable and list observation number 9, which goes back to prison after 54 months.

```
> library(dplyr)
> recid <- mutate(recid, fail = 1 - cens, id = row_number())
> filter(recid, id == 9) |> select(id, durat, fail)
```

```
# A tibble: 1 × 3
id durat fail
<int> <dbl> <dbl>
1 9 54 1
```

To create pseudo-observations for survival analysis we will use the
`survSplit()`

function in the `survival`

package.
We will split the data into single-year intervals of duration, from 0-12
to 48-60, with an open-ended category 60+. The function codes the
interval variable using integer codes, and we turn that into a factor
for convenience. We calculate exposure time for each episode as the
difference between duration at the start and end. We list these data for
individual 9 to illustrate how the episodes are created.

```
> library(survival)
> breaks <- seq(12, 60, by=12)
> recidx <- survSplit(Surv(durat, fail) ~ ., data = recid,
+ cut = breaks, episode = "interval", start = "start")
> recidx <- mutate(recidx, exposure = durat - start,
+ interval = factor(interval, labels =
+ paste("(", c(0,breaks), ",", c(breaks,100), "]", sep=""))) |>
+ rename(events = fail)
> nrow(recidx)
```

`[1] 6718`

`> filter(recidx, id==9) |> select(id,start,durat,interval, events, exposure)`

```
id start durat interval events exposure
1 9 0 12 (0,12] 0 12
2 9 12 24 (12,24] 0 12
3 9 24 36 (24,36] 0 12
4 9 36 48 (36,48] 0 12
5 9 48 54 (48,60] 1 6
```

The sample observation, which goes back to prison after 54 months, contributes five episodes or pseudo-observations; one each for years one to four, with 12 months of exposure and no events, and another one for year five, with 6 months of exposure and one event.

We are now ready to fit a proportional hazards model with a piecewise
exponential baseline where the hazard changes from year to year. We use
the same model as Wooldridge(2002), involving ten predictors, all fixed
covariates. I specify the offset using the `exposure()`

option. I could, of course, take logs and then use the
`offset()`

option.]{.stata}

```
> fit <- glm(events~interval+workprg+priors+tserved+felon+alcohol+drugs+
+ black+married+educ+age+offset(log(exposure)),
+ data=recidx, family=poisson)
> summary(fit)
```

```
Call:
glm(formula = events ~ interval + workprg + priors + tserved +
felon + alcohol + drugs + black + married + educ + age +
offset(log(exposure)), family = poisson, data = recidx)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.6130 -0.4598 -0.3468 -0.2543 3.4746
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.8301275 0.2802673 -13.666 < 2e-16 ***
interval(12,24] 0.0365320 0.1093618 0.334 0.73834
interval(24,36] -0.3738156 0.1296119 -2.884 0.00393 **
interval(36,48] -0.8115436 0.1564015 -5.189 2.12e-07 ***
interval(48,60] -0.9382311 0.1683212 -5.574 2.49e-08 ***
interval(60,100] -1.5471779 0.2033489 -7.608 2.77e-14 ***
workprg 0.0838291 0.0907942 0.923 0.35586
priors 0.0872458 0.0134735 6.475 9.46e-11 ***
tserved 0.0130089 0.0016859 7.716 1.20e-14 ***
felon -0.2839252 0.1061488 -2.675 0.00748 **
alcohol 0.4324425 0.1057211 4.090 4.31e-05 ***
drugs 0.2747141 0.0978635 2.807 0.00500 **
black 0.4335560 0.0883623 4.907 9.27e-07 ***
married -0.1540477 0.1092119 -1.411 0.15838
educ -0.0214162 0.0194440 -1.101 0.27071
age -0.0035800 0.0005222 -6.855 7.13e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 3691.4 on 6717 degrees of freedom
Residual deviance: 3379.1 on 6702 degrees of freedom
AIC: 4515.1
Number of Fisher Scoring iterations: 6
```

`> 1 - exp(coef(fit)["felon"])`

```
felon
0.247177
```

We see that the risk of recidivism is about the same in the first two years, but then decreases substantially with duration since release. At any given duration, felons have 25% lower risk of recidivism than non-felons with the same observed characteristics. Subjects imprisoned for alcohol or drug related offenses have much higher risk of recidivism, everything else being equal.

We now illustrate the calculation of survival probabilities, starting with the baseline hazard. There are different ways to do these calculations in Stata, but I will proceed from first principles using Mata.. We will retrieve the coefficients, add the constant and the age effects to obtain the log hazard, exponentiate to obtain hazards, multiply by 12 and sum to obtain the cumulative baseline hazard, and then exponentiate to obtain the baseline survival. This is not particularly meaningful as it would apply with all covariates set to zero, including age, which is measured in months.

```
> b <- coef(fit)
> h <- exp(b[1] + c(0, b[2:6]))
> H <- cumsum(12*h)
> S <- exp(-H)
> names(S)[1] <- names(H)[1] <- "interval(0,12])"
> S
```

```
interval(0,12]) interval(12,24] interval(24,36] interval(36,48] interval(48,60]
0.7706799 0.5882188 0.4916958 0.4379748 0.3955312
interval(60,100]
0.3741986
```

We will now estimate the probability of staying out of prison for five years given average values of the predictors. First we calculate the mean of each predictor; we have to be careful to include only one observation per person, so we restrict the data to the first interval.

```
> xvars <- names(coef(fit))[7:16]
> pset <- filter(recidx, interval == "(0,12]") |> select(xvars)
```

```
Note: Using an external vector in selections is ambiguous.
ℹ Use `all_of(xvars)` instead of `xvars` to silence this message.
ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
This message is displayed once per session.
```

`> means <- colMeans(pset)`

Now that we have the means, we multiply each by the corresponding
coefficient to obtain the linear predictor `xb`

, exponentiate
to obtain a relative risk, multiply by the baseline hazard, and then
calculated the predicted survival.

```
> xb <- sum(coef(fit)[xvars] * means)
> exp(-(H * exp(xb)))[5]
```

```
interval(48,60]
0.6570278
```

Thus, the probability of staying out of prison for five years for the average person is 65.7%.

We now calculate this probability for felons and non-felons, keeping
all other variables at their means. All we need to do is subtract from
`xb`

the coefficient of felon times the mean, which gives the
linear predictor for a non-felon. We then add the coefficient of felon
to get the linear predictor for a felon. In both cases the other
variables stay at their means.

```
> xb0 <- xb - coef(fit)["felon"] * means["felon"]
> xb1 <- xb0 + coef(fit)["felon"]
> data.frame(non.felon = exp(-H[5]*exp(xb0)), felon= exp(-H[5]*exp(xb1)))
```

```
non.felon felon
interval(48,60] 0.6317763 0.7077168
```

The predicted probability is 70.8% for felons and 63.2% for non-felons when all other characteristics are set to the mean, a difference of 7.6 percentage points. This is a marginal effect at the means.

An alternative calculation sets every person to be a felon or non-felon, leaving all other characteristics as they are, and then averages the predicted probability of surviving five years without returning to prison.

```
> xb0 <- as.matrix(mutate(pset, felon=0)) %*% coef(fit)[xvars]
> xb1 <- as.matrix(mutate(pset, felon=1)) %*% coef(fit)[xvars]
> data.frame(non.felon=mean(exp(-H[5] * exp(xb0))),
+ felon=mean(exp(-H[5] * exp(xb1))))
```

```
non.felon felon
1 0.6118797 0.6857928
```

The average probability of staying out of prison for five years is 68.6% for felons 61.2% for non-felons, a difference of 7.4 percentage points. This can be interpreted as an average marginal effect.

Wooldridge, Jeffrey M. (2010). *Econometric Analysis of Cross
Section and Panel Data*. 2nd Edition. Cambridge, Massachusetts: The
MIT Press.

Chung, C-F, P. Schmidt and A.D. Witte (1991). “Survival Analysis: A
Survey”. *Journal of Quantitative
Criminology*,**7**:59-98.

Updated fall 2022