The dataset considered here is analyzed in Wooldridge (2002) and credited to Chung, Schmidt and Witte (1991). The data pertain to a random sample of convicts released from prison between July 1, 1977 and June 30, 1978. Of interest is the time until they return to prison. The information was collected retrospectively by looking at records in April 1984, so the maximum possible length of observation is 81 months.
Wooldridge fits Weibull and log-normal models using the following predictors:
Variable | Description |
---|---|
workprg | an indicator of participation in a work program |
priors | the number of previous convictions |
tserved | the time served rounded to months |
felon | and indicator of felony sentences |
alcohol | an indicator of alcohol problems |
drugs | an indicator of drug use history |
black | an indicator for African Americans |
married | an indicator if married when incarcerated |
educ | the number of years of schooling, and |
age | in months |
We will reproduce his results using R and Stata. The data are
available in binary format from the Stata website and consists of 1445
observations on 18 variables. The duration variable is called
durat
and represents time in months until return to prison
or end of follow up. The censoring indicator is called cens
and is coded 1 if the observation was censored (i.e. the individual had
not returned to prison).
We create a new variable fail
coded 1 for failures.
. use https://www.stata.com/data/jwooldridge/eacsap/recid, clear . desc, short Contains data from https://www.stata.com/data/jwooldridge/eacsap/recid.dta Observations: 1,445 Variables: 18 28 Sep 1998 13:28 Sorted by: . gen fail = 1 - cens
> library(haven) > library(dplyr) > recid <- read_dta("https://www.stata.com/data/jwooldridge/eacsap/recid.dta") > recid <- mutate(recid, fail = 1 - cens) > head(recid) # A tibble: 6 × 19 black alcohol drugs super married felon workprg property person priors educ <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 0 1 0 1 1 0 1 0 0 0 7 2 1 0 0 1 0 1 1 1 0 0 12 3 0 0 0 0 0 0 1 1 0 0 9 4 0 0 1 1 0 1 1 1 0 2 9 5 0 0 1 1 0 0 0 0 0 0 9 6 1 0 0 1 0 0 1 0 0 1 12 # … with 8 more variables: rules <dbl>, age <dbl>, tserved <dbl>, follow <dbl>, # durat <dbl>, cens <dbl>, ldurat <dbl>, fail <dbl>
Let us first fit a proportional hazards model with a Weibull
baseline, using stset
to set the data
and survreg
to fit the model. To avoid repetition we will
save the predictors in a macro. which we can do
using survreg
with dist
set to “weibull”. For
convenience we save the model formula so we can reuse it
later.
. stset durat, failure(fail) Survival-time data settings Failure event: fail!=0 & fail<. Observed time interval: (0, durat] Exit on or before: failure ────────────────────────────────────────────────────────────────────────── 1,445 total observations 0 exclusions ────────────────────────────────────────────────────────────────────────── 1,445 observations remaining, representing 552 failures in single-record/single-failure data 80,013 total analysis time at risk and under observation At risk from t = 0 Earliest observed entry t = 0 Last observed exit t = 81 . local predictors workprg priors tserved felon alcohol drugs /// > black married educ age . streg `predictors', distrib(weibull) nolog Failure _d: fail Analysis time _t: durat Weibull PH regression No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80,013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── _t │ Haz. ratio Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── workprg │ 1.095148 .0992728 1.00 0.316 .9168814 1.308074 priors │ 1.092848 .014683 6.61 0.000 1.064445 1.122008 tserved │ 1.013655 .0017037 8.07 0.000 1.010321 1.017 felon │ .7412054 .0785485 -2.83 0.005 .6021898 .9123128 alcohol │ 1.564179 .165389 4.23 0.000 1.271406 1.92437 drugs │ 1.325064 .1296765 2.88 0.004 1.093791 1.605237 black │ 1.574149 .1390031 5.14 0.000 1.32398 1.871587 married │ .8593436 .0938794 -1.39 0.165 .6937084 1.064527 educ │ .9769709 .0189724 -1.20 0.230 .9404845 1.014873 age │ .9962823 .000523 -7.09 0.000 .9952577 .997308 _cons │ .0333035 .0100249 -11.30 0.000 .0184613 .0600781 ─────────────┼──────────────────────────────────────────────────────────────── /ln_p │ -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 ─────────────┼──────────────────────────────────────────────────────────────── p │ .8058644 .0313601 .7466852 .8697338 1/p │ 1.240904 .0482896 1.149777 1.339252 ─────────────┴──────────────────────────────────────────────────────────────── Note: _cons estimates baseline hazard.
By default Stata exponentiates the coefficients to show relative
risks. Use the option nohr
for no hazard ratios to
obtain the coefficients.
. streg, nohr Weibull PH regression No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80,013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── _t │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── workprg │ .0908893 .0906478 1.00 0.316 -.0867772 .2685558 priors │ .0887867 .0134355 6.61 0.000 .0624535 .1151198 tserved │ .0135625 .0016808 8.07 0.000 .0102682 .0168567 felon │ -.2994775 .105974 -2.83 0.005 -.5071826 -.0917723 alcohol │ .4473611 .1057353 4.23 0.000 .2401236 .6545985 drugs │ .2814605 .0978644 2.88 0.004 .0896499 .4732711 black │ .4537147 .0883037 5.14 0.000 .2806426 .6267867 married │ -.1515864 .1092454 -1.39 0.165 -.3657035 .0625307 educ │ -.0232984 .0194196 -1.20 0.230 -.0613601 .0147633 age │ -.0037246 .000525 -7.09 0.000 -.0047536 -.0026956 _cons │ -3.402094 .3010177 -11.30 0.000 -3.992077 -2.81211 ─────────────┼──────────────────────────────────────────────────────────────── /ln_p │ -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 ─────────────┼──────────────────────────────────────────────────────────────── p │ .8058644 .0313601 .7466852 .8697338 1/p │ 1.240904 .0482896 1.149777 1.339252 ─────────────┴────────────────────────────────────────────────────────────────
> library(survival) > mf <- Surv(durat, fail) ~ workprg + priors + tserved + felon + + alcohol + drugs + black + married + educ + age > wf <- survreg(mf, data=recid, dist="weibull")
By default R reports coefficients in the accelerated failure time metric. I wrote a simple function to convert them to a proportional hazards metric, which you can download from this website as shown below. I also work with p = 1/scale.
> source("https://grodri.github.io/survival/aft2ph.R") > phfit <- aft2ph(wf); phfit term estimate std.error statistic p.value 1 (Intercept) -3.402093509 0.3010177314 -11.301970 1.283063e-29 2 workprg 0.090889277 0.0906478280 1.002664 3.160232e-01 3 priors 0.088786675 0.0134355280 6.608350 3.886262e-11 4 tserved 0.013562467 0.0016807811 8.069145 7.079190e-16 5 felon -0.299477494 0.1059739596 -2.825954 4.714009e-03 6 alcohol 0.447361071 0.1057353368 4.230951 2.327050e-05 7 drugs 0.281460523 0.0978643554 2.876027 4.027153e-03 8 black 0.453714680 0.0883036888 5.138117 2.775052e-07 9 married -0.151586401 0.1092454397 -1.387576 1.652661e-01 10 educ -0.023298411 0.0194196040 -1.199737 2.302416e-01 11 age -0.003724612 0.0005249981 -7.094526 1.297959e-12 12 log.p -0.215839833 0.0389148545 -5.546464 2.915049e-08
This reproduces Table 22.1 in Wooldridge. All but three of the predictors have significant coefficients, the exceptions being participation in a work program, marital status, and education.
To obtain hazard ratios we exponentiate the coefficients. We also report the Weibull parameter p, while R uses scale = 1/p.
> transmute(phfit, term, hazard.ratio = exp(estimate)) term hazard.ratio 1 (Intercept) 0.03330348 2 workprg 1.09514774 3 priors 1.09284750 4 tserved 1.01365485 5 felon 0.74120540 6 alcohol 1.56417898 7 drugs 1.32506369 8 black 1.57414880 9 married 0.85934363 10 educ 0.97697090 11 age 0.99628232 12 log.p 0.80586436 > exp(tail(phfit,1)$estimate) [1] 0.8058644
The Weibull parameter p is 0.8, indicating that the risk of recidivism declines over time, about 21% per month. The hypothesis that the risk is constant over time would be soundly rejected.
The exponentiated coefficient of drugs indicates that former inmates with a history of drug use have 32.5% higher risk or returning to jail at any given time than peers with identical characteristics but no history of drug use.
We can also work with the Weibull model in an accelerated failure
time framework, which we can do by simply adding the
time
option: which is in fact the
default in R. We’ll use the summary()
function. (An
alternative is to use tidy
to produce a data
frame.
. streg `predictors', distrib(weibull) time nolog Failure _d: fail Analysis time _t: durat Weibull AFT regression No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80,013 LR chi2(10) = 165.48 Log likelihood = -1633.0325 Prob > chi2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── _t │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── workprg │ -.1127848 .1125346 -1.00 0.316 -.3333486 .107779 priors │ -.1101757 .0170675 -6.46 0.000 -.1436273 -.0767241 tserved │ -.0168297 .0021303 -7.90 0.000 -.021005 -.0126544 felon │ .3716227 .1319951 2.82 0.005 .112917 .6303284 alcohol │ -.555132 .1322427 -4.20 0.000 -.8143229 -.295941 drugs │ -.3492654 .1218801 -2.87 0.004 -.5881461 -.1103847 black │ -.5630162 .110817 -5.08 0.000 -.7802135 -.3458189 married │ .1881041 .1357519 1.39 0.166 -.0779647 .4541729 educ │ .0289111 .0241153 1.20 0.231 -.0183541 .0761763 age │ .0046219 .0006648 6.95 0.000 .0033189 .0059249 _cons │ 4.22167 .3413114 12.37 0.000 3.552712 4.890628 ─────────────┼──────────────────────────────────────────────────────────────── /ln_p │ -.2158398 .0389149 -5.55 0.000 -.2921115 -.1395681 ─────────────┼──────────────────────────────────────────────────────────────── p │ .8058644 .0313601 .7466852 .8697338 1/p │ 1.240904 .0482896 1.149777 1.339252 ─────────────┴────────────────────────────────────────────────────────────────
> summary(wf) Call: survreg(formula = mf, data = recid, dist = "weibull") Value Std. Error z p (Intercept) 4.221670 0.341311 12.37 < 2e-16 workprg -0.112785 0.112535 -1.00 0.3162 priors -0.110176 0.017067 -6.46 1.1e-10 tserved -0.016830 0.002130 -7.90 2.8e-15 felon 0.371623 0.131995 2.82 0.0049 alcohol -0.555132 0.132243 -4.20 2.7e-05 drugs -0.349265 0.121880 -2.87 0.0042 black -0.563016 0.110817 -5.08 3.8e-07 married 0.188104 0.135752 1.39 0.1659 educ 0.028911 0.024115 1.20 0.2306 age 0.004622 0.000665 6.95 3.6e-12 Log(scale) 0.215840 0.038915 5.55 2.9e-08 Scale= 1.24 Weibull distribution Loglik(model)= -3192.1 Loglik(intercept only)= -3274.8 Chisq= 165.48 on 10 degrees of freedom, p= 2.4e-30 Number of Newton-Raphson Iterations: 5 n= 1445
The substantive results are the same as before, which is not surprising given that it is really the same model. You may want to verity that the AFT parameters are exactly the same as the PH parameters with opposite sign and divided by p. For example the coefficient for drugs is -0.28/0.8 = -0.35.
The coefficients may be exponentiated, which allows interpreting them
in terms of time ratios. You can do this in
Stata using the tr
option. We will focus on one
coefficient.
> exp(coef(wf)) (Intercept) workprg priors tserved felon alcohol 68.1472035 0.8933429 0.8956767 0.9833111 1.4500858 0.5739965 drugs black married educ age 0.7052060 0.5694888 1.2069592 1.0293331 1.0046326
Exponentiating the drug coefficient that see that former inmates with a history of drug use spend 29% less time out of prison than peers with the same observed characteristics by no history of drug use. This is easier to see substracting one from the exponentiated coefficient
. di exp(_b[drugs]) - 1 -.29479404
> exp(coef(wf)["drugs"]) - 1 drugs -0.294794
We can also say that time outside of prison passes 42% faster for former inmates with a history of drug use than for those without, everything else being equal, which you can verify by changing signs and exponentiating:
. di exp(-_b[drugs]) - 1 .41802546
> exp(-coef(wf)["drugs"]) - 1 drugs 0.4180255
The two interpretations are, of course, equivalent.
The Weibull model allows the hazard to increase or decrease with time, but at a constant rate. Wooldridge notes that the log-normal distribution provides a better fit to the data. We can fit a log-normal model with the same tools, just by changing the distribution to “lognormal”.
. streg `predictors', distrib(lognormal) nolog Failure _d: fail Analysis time _t: durat Lognormal AFT regression No. of subjects = 1,445 Number of obs = 1,445 No. of failures = 552 Time at risk = 80,013 LR chi2(10) = 166.74 Log likelihood = -1597.059 Prob > chi2 = 0.0000 ─────────────┬──────────────────────────────────────────────────────────────── _t │ Coefficient Std. err. z P>|z| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── workprg │ -.0625714 .1200369 -0.52 0.602 -.2978394 .1726965 priors │ -.1372528 .0214587 -6.40 0.000 -.179311 -.0951946 tserved │ -.0193305 .0029779 -6.49 0.000 -.0251671 -.0134939 felon │ .4439944 .1450865 3.06 0.002 .1596302 .7283586 alcohol │ -.6349088 .1442165 -4.40 0.000 -.9175681 -.3522496 drugs │ -.2981599 .1327355 -2.25 0.025 -.5583168 -.0380031 black │ -.5427175 .1174427 -4.62 0.000 -.772901 -.312534 married │ .3406835 .139843 2.44 0.015 .0665962 .6147707 educ │ .0229195 .0253974 0.90 0.367 -.0268584 .0726975 age │ .0039103 .0006062 6.45 0.000 .0027221 .0050984 _cons │ 4.099386 .3475349 11.80 0.000 3.41823 4.780542 ─────────────┼──────────────────────────────────────────────────────────────── /lnsigma │ .5935861 .0344122 17.25 0.000 .5261395 .6610327 ─────────────┼──────────────────────────────────────────────────────────────── sigma │ 1.810469 .0623022 1.692386 1.936791 ─────────────┴────────────────────────────────────────────────────────────────
We do not need to specify time
, as this distribution is
only available in the AFT metric.
> lnf <- survreg(mf, data=recid, dist="lognormal") > summary(lnf) Call: survreg(formula = mf, data = recid, dist = "lognormal") Value Std. Error z p (Intercept) 4.099386 0.347535 11.80 < 2e-16 workprg -0.062572 0.120037 -0.52 0.6022 priors -0.137253 0.021459 -6.40 1.6e-10 tserved -0.019331 0.002978 -6.49 8.5e-11 felon 0.443995 0.145087 3.06 0.0022 alcohol -0.634909 0.144217 -4.40 1.1e-05 drugs -0.298160 0.132736 -2.25 0.0247 black -0.542718 0.117443 -4.62 3.8e-06 married 0.340684 0.139843 2.44 0.0148 educ 0.022920 0.025397 0.90 0.3668 age 0.003910 0.000606 6.45 1.1e-10 Log(scale) 0.593586 0.034412 17.25 < 2e-16 Scale= 1.81 Log Normal distribution Loglik(model)= -3156.1 Loglik(intercept only)= -3239.5 Chisq= 166.74 on 10 degrees of freedom, p= 1.3e-30 Number of Newton-Raphson Iterations: 4 n= 1445
We see that the log-likelihood is indeed higher for the log-normal model, -1597.1 compared to -1633.0-3156.1 compared to -3192.1 for the Weibull, so we now have a better fit to the data.
You may be interested to know that Stata and R report different log-likelihoods. For example the log-normal log-likelihood is -1597.1 in Stata and -3156.1 in R. This is because R works with the distribution of time and Stata with the distribution of log(time). The difference turns out to be the Jacobian of the transformation, which is the sum of log(t) over all failures, or 1559.1. Differences between log-likelihood are not affected by this constant term.
Most of the effects are robust to the choice of distribution, but note that the protective effect of marriage is now significant. The coefficient for drugs, at -0.30, is smaller in magnituded and less significant than before.
Fitting a generalized gamma model leads to similar conclusions, except that the effect of drugs is no longer significant. This result suggest that there may be an interaction between drug history and duration, as the effect depends on how the hazard is specified. We will return to this issue later.
The generalized gama model is available in Stata
as part of streg
. not
available in R’s survival
package, but you will find it in
the package flexsurv
, which also allows fitting Gompertz
models.
Wooldridge, Jeffrey M. (2010). Econometric Analysis of Cross Section and Panel Data. 2nd Edition. Cambridge, Massachusetts: The MIT Press.
Chung, C-F, P. Schmidt and A.D. Witte (1991). “Survival Analysis: A Survey”. Journal of Quantitative Criminology 7:59-98.