By popular request, here’s an example using a piecewise exponential model with a regression spline, applied to the recidivism data. The first thing to do is read the data and create a failure indicator.
. use https://www.stata.com/data/jwooldridge/eacsap/recid.dta, clear . gen fail = 1 - cens
> library(haven) > recid <- read_dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta") > recid$fail <- 1 - recid$cens > nrow(recid) [1] 1445
Next I will split the data every three months, and create a new variable to reflect exposure (time lived in each interval)
. gen id = _n . stset durat, fail(fail) id(id) Survival-time data settings ID variable: id Failure event: fail!=0 & fail<. Observed time interval: (durat[_n-1], durat] Exit on or before: failure ────────────────────────────────────────────────────────────────────────── 1,445 total observations 0 exclusions ────────────────────────────────────────────────────────────────────────── 1,445 observations remaining, representing 1,445 subjects 552 failures in single-failure-per-subject data 80,013 total analysis time at risk and under observation At risk from t = 0 Earliest observed entry t = 0 Last observed exit t = 81 . stsplit trimester, at(3(3)78) (25,750 observations (episodes) created)
> library(survival) > recidx <- survSplit(Surv(durat, fail) ~ ., cut=seq(3,78,3), data=recid) > nrow(recidx) [1] 27195 > recidx$exposure <- recidx$durat - recidx$tstart
To model the baseline hazard I will use a regression spline with internal knots at 30, 40 and 60 weeks. These are approximately the quartiles of overall survival as calculated using Kaplan-Meier. To define the spline I used the mid-point of each interval, 1.5 months after the start. I then fit a Poisson model with the log of exposure as an offset
. gen mid = _t0 + 1.5 . bspline, xvar(mid) knots(0 30 40 60 80) p(3) gen(bspl) . gen expo = _t - _t0 . poisson _d bspl* workprg priors tserved felon drugs black married /// > educ age, exposure(expo) nocons Iteration 0: log likelihood = -39499.985 Iteration 1: log likelihood = -2902.2437 Iteration 2: log likelihood = -2828.6165 Iteration 3: log likelihood = -2804.828 Iteration 4: log likelihood = -2804.7856 Iteration 5: log likelihood = -2804.7856 Poisson regression Number of obs = 27,195 Wald chi2(16) = 12472.26 Log likelihood = -2804.7856 Prob > chi2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── _d │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── bspl1 │ -10.76743 3.01364 -3.57 0.000 -16.67406 -4.860806 bspl2 │ -2.446414 .7114559 -3.44 0.001 -3.840841 -1.051986 bspl3 │ -4.048105 .3876143 -10.44 0.000 -4.807815 -3.288395 bspl4 │ -4.701796 .383244 -12.27 0.000 -5.452941 -3.950652 bspl5 │ -4.827016 .5760656 -8.38 0.000 -5.956084 -3.697948 bspl6 │ -5.630131 1.271756 -4.43 0.000 -8.122728 -3.137534 bspl7 │ -13.58216 12.08349 -1.12 0.261 -37.26537 10.10105 workprg │ .0529768 .0903916 0.59 0.558 -.1241876 .2301411 priors │ .0949319 .0134198 7.07 0.000 .0686297 .1212342 tserved │ .0123086 .0016645 7.39 0.000 .0090462 .0155711 felon │ -.3046545 .104908 -2.90 0.004 -.5102704 -.0990386 drugs │ .2639392 .0979932 2.69 0.007 .0718761 .4560023 black │ .393758 .0876149 4.49 0.000 .2220359 .56548 married │ -.1347573 .1090351 -1.24 0.216 -.3484622 .0789475 educ │ -.0197291 .019438 -1.01 0.310 -.0578269 .0183686 age │ -.003322 .0005175 -6.42 0.000 -.0043362 -.0023078 ln(expo) │ 1 (exposure) ─────────────┴────────────────────────────────────────────────────────────────
> library(splines) > mf <- fail ~ offset(log(exposure)) + workprg + priors + tserved + felon + + drugs + black + married + educ + age + bs(tstart + 1.5, knots=c(30, 40, 60)) > pwe <- glm(mf, family=poisson, data=recidx) > coef(summary(pwe))[1:10,] Estimate Std. Error z value Pr(>|z|) (Intercept) -4.032652566 0.3017915359 -13.3623780 1.003077e-40 workprg 0.052976769 0.0903916357 0.5860804 5.578215e-01 priors 0.094931924 0.0134197589 7.0740410 1.504858e-12 tserved 0.012308637 0.0016645485 7.3945799 1.418560e-13 felon -0.304654512 0.1049079755 -2.9040167 3.684085e-03 drugs 0.263939139 0.0979931661 2.6934443 7.071795e-03 black 0.393757912 0.0876148714 4.4941904 6.983513e-06 married -0.134757315 0.1090350844 -1.2359078 2.164928e-01 educ -0.019729166 0.0194379965 -1.0149794 3.101156e-01 age -0.003321962 0.0005174536 -6.4198264 1.364298e-10
You may want to check that the coefficients for our predictors are very similar to those obtained using a Cox model. As for the shape of the hazard, we can predict using the spline coefficients.
. preserve // for safety . egen tag = tag(mid) . keep if tag // one obs per interval (27,168 observations deleted) . gen shat = 0 . forvalues i=1/7 { 2. quietly replace shat = shat + bspl`i' * _b[bspl`i'] 3. } . expand 2 (27 observations created) . gen first = _n <= 27 . replace mid = mid - 1.5 if first (27 real changes made) . replace mid = mid + 1.5 if !first (27 real changes made) . sort mid first . line shat mid, xtitle(duration) . graph export pweSpline.png, width(500) replace file pweSpline.png saved as PNG format
> library(ggplot2) > m <- seq(1.5, 79.5, 3) > y <- bs(m, knots=c(30,40,60)) %*% coef(pwe)[11:16] > xy <- data.frame( duration = sort(c(m-1.5, m+1.5)), hazard = rep(y, rep(2,length(y)))) > ggplot(xy, aes(duration, hazard)) + geom_line() > ggsave("pweSpliner.png", width=500/72, height=400/72, dpi=72)
Showing once agan that the hazard appears to rise initially and then decline steadily, so a log-normal is not a bad fit. Note that plotting the spline would be a bit misleading, as we really didn’t fit a curve, but rather a piecewise constant hazard where the value in each three-month interval is taken from the curve. The graph above takes this into account.