--- title: Cox Regression author: Germán Rodríguez date: 12 December 2017 header-includes: - - - - --- ::: {.container} We continue our analysis of the Gehan data by fitting a proportional hazards model. This is the same dataset used as an example in Cox's original paper: Cox, D.R. (1972) Regression Models and Life tables, (with discussion) *Journal of the Royal Statistical Society*, __34__: 187–220.

```r/ # set up a couple of libraries and a function to ggfy generic plots library(dplyr) library(ggplot2) ggfy <- function(obj, ...) { p <- par(c("mar","mgp","tck","cex")) par(mar=c(3,3,2,1), mgp=c(2,.7,0), tck=-.01, cex = 0.7) plot(obj, ...) u <- par("usr") rect(u[1], u[3], u[2], u[4], col="#ebebeb", border="#ebebeb") grid(lty=1, col="white") par(new=TRUE) plot(obj, axes=FALSE, ...) par(p) invisible(NULL) } ``` The first task is to read [and `stset`]{.stata} the data. I also create a dummy variable for `treated`. ```s infile group weeks relapse using /// https://grodri.github.io/datasets/gehan.raw, clear gen treated = group == 2 stset weeks, failure(relapse) ``` ```r gehan <- read.table("https://grodri.github.io/datasets/gehan.dat") names(gehan) summarize(gehan, events = sum(relapse), exposure = sum(weeks)) gehan <- mutate(gehan, treated = as.numeric(group == "treated")) ``` Here's a run fitting a Cox model with all the defaults ```s stcox treated di _b[treated] ``` ```r library(survival) cm <- coxph(Surv(weeks, relapse) ~ treated, data = gehan) cm ``` [Stata reports hazard ratios unless you specify the option `nohr`.]{.stata} [R reports log-relative risks, but also exponentiates the coefficients to obtain hazard ratios.]{.r} We see that the treatment reduced the risk of relapse by almost 80% at any duration. ## Handling Ties There are various options for handling ties. Cox's original proposal relies on the discrete partial likelihood. A closely-related alternative due to Kalbfleisch and Prentice uses the marginal likelihood of the ranks. Both methods are computationally intensive. A good fast approximation is due to Efron, and a simpler and faster, though somewhat less accurate, method is due to Breslow and Peto. See the notes for details. In terms of our software, [Stata implements all four using the options `exactp`, `exactm`, `efron` and `breslow`. The default is `breslow` but I recommend you always use `efron`.]{.stata} [R implements all but the marginal likelihood, using the argument `ties` with possible values "efron", "breslow", and "exact".]{.r} Let us compare them all. ```s estimates store breslow quietly stcox treated, efron estimates store efron quietly stcox treated, exactm estimates store exactm quietly stcox treated, exactp estimates store exactp estimates table breslow efron exactm exactp ``` ```r cmb <- coxph(Surv(weeks, relapse) ~ treated, data = gehan, ties="breslow") cmp <- coxph(Surv(weeks, relapse) ~ treated, data = gehan, ties="exact") data.frame(breslow = coef(cmb), efron = coef(cm), exact = coef(cmp)) ``` As you can see, Efron's approximation is closer to the exact partial likelihood than Breslow's. [The marginal likelihood is even closer.]{.stata} Cox reported a log-likelihood of -1.65 in his paper, which he obtained by evaluating the likelihood in a grid of points. The more exact calculations here yield -1.63, so he did pretty well by hand (see page 197 in the paper). ## Proportionality of Hazards One way to test proportionality of hazards is to introduce interactions with duration. In his original paper Cox tried a linear interaction with time. We will do the same, except that he worked with *t - 10* to achieve more orthogonality and we will use *t*.
### Stata's tvc() and texp() Stata makes it very easy to introduce interactions with time by providing two options: 1. `tvc(varlist)`, to specify the variable(s) we want to interact with time, and 1. `texp(expression)`, to specify a function of time `_t`, typically just time, `texp(_t)`, or log-time, `texp(log(_t)`. Stata will then create a variable equal to the product of the variable specified in `tvc()` by the time expression specified in `texp()` and add it to the model. Let us use this technique to interact treatment and time ```s stcox treated, tvc(treated) texp(_t) efron ``` We see that there's no evidence that the treatment effect changes linearly with time. BTW we didn't really have to specify `texp(_t)` because that's the default. Another possibility is to allow different treatment effects at early and late durations, say before and after 10 weeks. This is easily done by changing the time expression: ```s stcox treated, tvc(treated) texp(_t > 10) efron ``` The point estimates indicate a 73% reduction in risk in the first ten weeks and an additional 44% reduction after ten weeks, for a total of 85% in the later period. However, the difference in treatment effects between the two periods is not significant.
### Splitting at Failures Because only times with observed failures contribute to the partial likelihood, we can introduce arbitrary interactions by splitting the data at each failure time. As a sanity check, we verify that we obtain the same estimate as before ```s gen id = _n // need an id variable to split streset, id(id) stsplit , at(failures) quietly stcox treated, efron di _b[treated] estimates store ph ``` ```r failure_times <- sort(unique(gehan$weeks[gehan$relapse])) gehanx <- survSplit(gehan, cut = failure_times, event = "relapse", start = "t0", end = "weeks") coef(coxph(Surv(t0, weeks, relapse) ~ treated, data=gehanx)) ``` We now introduce a linear interaction with time using our dummy variable for `treated`. [(You could specify the model as `group * t0`. R will omit `t0` because it is implicit in the baseline hazard and complain that the model matrix is singular, but the results will be correct. My approach is a bit cleaner.)]{.r} ```s stcox treated c.treated#c._t, efron lrtest ph . stjoin // back to normal ``` ```r cmx <- coxph(Surv(t0, weeks, relapse) ~ treated + treated:weeks, data=gehanx) summary(cmx) c( logLik(cmx), logLik(cm)) ``` We get a Wald test for the interaction term of z = 0.01 and twice a difference in log-likelihoods of 0.00, so clearly there is no evidence of an interaction between treatment and time at risk. [Note that these are exactly the same results we got with `tvc()` and `texp()`]{.stata} ## Split at 10 As an alternative we could allow different treatment effects before and after 10 weeks. We could use the current dataset, but all we really need is to split at 10, so we'll do just that: ```s stsplit dur, at(10) quietly stcox treated, efron // so we can do lrtest estimates store ph gen after10 = dur == 10 stcox treated c.treated#c.after10, efron lrtest ph . drop dur after10 stjoin ``` ```r gehan10 <- survSplit(gehan, cut = 10, event = "relapse", start = "t0", end = "weeks") %>% mutate(after10 = as.numeric(t0 == 10), treated = as.numeric(group == "treated")) cm10 <- coxph(Surv(t0, weeks, relapse) ~ treated + treated:after10, data=gehan10) summary(cm10) chisq <- 2*(logLik(cm10) - logLik(cm)); chisq ``` The Wald test now yields -0.73 (a chi-squared of 0.53), and the likelihood ratio test concurs, with a chi-squared of 0.54 on one d.f. The risk ratio is a bit larger after 10 weeks, but the difference is still not significant. [Note that these are exactly the same results we got with `tvc()` and `texp()`.]{.stata} ## Schoenfeld Residuals Another way to check for proportionality of hazards is to use Schoenfeld residuals (and their scaled counterparts). You can obtain an overall test using the Schoenfeld residuals, or a variable-by-variable test based on the scaled variant. In this case with just one predictor there is only one test, but we'll see later an example with several predictors. Stata and R offer several possible transformations of time for the test, including a user-specified function, but chose different defaults. In Stata the default is time, but one of the options is `km` for the Kaplan-Meier estimate of overall survival. In R the default transform is "km" for the K-M estimate, but one of the options is "identity". ```s quietly stcox treated, efron estat phtest ``` ```r zph <- cox.zph(cm , transform="identity") zph ``` The test shows that there is no evidence against the proportional hazards assumption. If there had been, we could get a hint of the nature of the time dependence by plotting the (scaled) residuals against time and using a smoother to glean the trend, if any. [In R the `cox.zph` class has a `plot()` method which uses a spline smoother. I specified `df=2` because of the small sample size.]{.r} ```s estat phtest, plot(treated) graph export phplot.png, width(500) replace ``` ```r png("phplotr.png", width=500, height=400) ggfy(zph, df=2) dev.off() ``` ![](phplot.png){.img-responsive .img-center .stata} ![](phplotr.png){.img-responsive .img-center .r} The residuals show no time trend at all, showing that the treatment hazard ratio is fairly constant over time. (We will confirm this result below with a plot of cumulative hazards that provides more direct evidence.) ## Baseline Survival The emphasis in the Cox model is on hazard ratios, but one can calculate a Kaplan-Meier or a Nelson-Aalen estimate of the baseline survival, as shown in the notes. The baseline is defined as the case where all covariate values are zero, and this may not make sense in your data. A popular alternative is to estimate the "baseline" at average values of all covariates. In our case a much better approach is to estimate and plot the estimated survival functions for the two groups. [Stata makes this very easy via the `stcurve` command]{.stata} ```s stcurve, surv at(treated=0) at(treated=1) ``` It is instructive to compute these "by hand" and compare them with separate Kaplan-Meier estimates for each group, which I will plot using different symbols for treated and controls. The plots connect the point estimates using a step function. ```s predict S0, basesurv // control (not mean!) gen S1 = S0^exp(_b[treated]) // treated sts gen KM = s, by(treated) // two Kaplan-Meiers twoway (scatter S0 _t, c(J) ms(none) sort) /// baseline (scatter S1 _t , c(J) ms(none) sort) /// treated (scatter KM _t if treated, msymbol(circle_hollow)) /// KM treated (scatter KM _t if !treated, msymbol(X)) /// KM base , legend(off) /// title(Kaplan-Meier and Proportional Hazards Estimates) graph export coxkm.png, width(500) replace ``` ```r sf <- survfit(cm, newdata=list(treated=c(1,0))) km <- survfit(Surv(weeks, relapse) ~ treated, data=gehan) dsf <- data.frame(time = rep(c(0,sf$time), 2), survival = c(1, sf$surv[,1], 1, sf$surv[,2]), group = factor(rep(c("treated","control"), rep(length(sf$time) + 1,2)))) dkm <- data.frame(time = km$time, survival = km$surv, group = factor(rep(c("treated","control"), km$strata))) ggplot(dsf, aes(time, survival, color = group)) + geom_step() + geom_point(data = dkm, aes(time, survival, shape=group),color="black") + scale_shape_manual(values = c(1, 4)) ggsave("coxkmr.png", width=500/72, height=400/72, dpi=72) ``` ![](coxkm.png){.img-responsive .img-center .stata} ![](coxkmr.png){.img-responsive .img-center .r} The figure looks just like Figure 1 in Cox's paper. If the purpose of the graph is to check the proportional hazards assumption a much better alternative is to plot the log-log transformation of the survival function, namely -log(-log(S(t)), against log(t) for each group. Under the proportional hazards assumption the resulting curves should be parallel. This plot is useful because the eye is much better at judging whether curves are parallel than whether they are proportional. ```s stphplot, by(treated) legend(off) title(Plot of -log(-log(S(t)))) graph export coxphplot.png, width(500) replace ``` ```r dkm <- mutate(dkm, lls = -log(-log(survival))) ggplot(dkm, aes(log(time), lls, color=group)) + geom_point() + geom_line() + ylab("-log(-log(S(t)))") ggsave("coxphplotr.png", width=500/72, height=400/72, dpi=72) ``` ![](coxphplot.png){.img-responsive .img-center .stata} ![](coxphplotr.png){.img-responsive .img-center .r} The two lines look quite parallel indeed. :::