Germán Rodríguez
Demographic Methods Princeton University

Cox’s Proportional Hazards Model

We continue our analysis of the leukemia remission times introduced in the context of the Kaplan-Meier estimator. This is the dataset used as an example in Cox’s original paper: Cox, D.R. (1972) Regression Models and Life tables, (with discussion) Journal of the Royal Statistical Society, 34: 187–220.

. use https://grodri.github.io/datasets/gehan, clear    
(Dataset used in Cox's 1972 paper, JRSS 34:187-220)

. stset weeks, failure(relapse)

Survival-time data settings

         Failure event: relapse!=0 & relapse<.
Observed time interval: (0, weeks]
     Exit on or before: failure

──────────────────────────────────────────────────────────────────────────
         42  total observations
          0  exclusions
──────────────────────────────────────────────────────────────────────────
         42  observations remaining, representing
         30  failures in single-record/single-failure data
        541  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =        35
> library(survival)
> library(dplyr)
> library(ggplot2)
> gehan <- read.table("https://grodri.github.io/datasets/gehan.dat")
> summarize(gehan, events = sum(relapse), exposure = sum(weeks))
  events exposure
1     30      541

We fit a model that assumes a proportional effect of treatment at all durations. There are different ways of handling ties and we choose the Efron method, which is an option in Stata and the default in R.

. // gen treated = group == "treated"
. stcox treated, efron

        Failure _d: relapse
  Analysis time _t: weeks

Iteration 0:   log likelihood =  -93.18427
Iteration 1:   log likelihood = -85.018612
Iteration 2:   log likelihood = -85.008426
Iteration 3:   log likelihood = -85.008425
Refining estimates:
Iteration 0:   log likelihood = -85.008425

Cox regression with Efron method for ties

No. of subjects =  42                                   Number of obs =     42
No. of failures =  30
Time at risk    = 541
                                                        LR chi2(1)    =  16.35
Log likelihood = -85.008425                             Prob > chi2   = 0.0001

─────────────┬────────────────────────────────────────────────────────────────
          _t │ Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     treated │   .2076035    .085615    -3.81   0.000     .0925128    .4658729
─────────────┴────────────────────────────────────────────────────────────────
> gehan <- mutate(gehan, treated = as.numeric(group == "treated"))
> cm <- coxph(Surv(weeks, relapse) ~ treated, data = gehan)
> cm
Call:
coxph(formula = Surv(weeks, relapse) ~ treated, data = gehan)

           coef exp(coef) se(coef)      z        p
treated -1.5721    0.2076   0.4124 -3.812 0.000138

Likelihood ratio test=16.35  on 1 df, p=5.261e-05
n= 42, number of events= 30 
> exp(coef(cm)) - 1
   treated 
-0.7923965 

The results show that at any given duration since remission, the risk of relapse is 79% lower in the treated group.

To check the assumption of proportionality of hazards one may introduce interactions with duration. In a two-group analysis like this it is also possible to plot the Kaplan-Meier estimates. Here’s the plot in Cox’s paper

. predict S0, basesurv            // control (not mean!)

. gen S1 = S0^exp(_b[treated])    // treated

. sts gen KM = s, by(treated)     // two Kaplan-Meiers

. twoway (scatter S0  _t, c(J) ms(none) sort)  /// baseline
>   (scatter S1 _t , c(J) ms(none) sort)     /// treated
>   (scatter KM _t if treated, msymbol(circle_hollow)) /// KM treated
>   (scatter KM _t if !treated, msymbol(X)) /// KM base
> , legend(off) title(Kaplan-Meier and Proportional Hazards Estimates)   

. graph export coxkm.png, width(500) replace          
file coxkm.png saved as PNG format

> km <- survfit(Surv(weeks, relapse) ~ treated, data=gehan)
> sf <- survfit(cm, newdata=list(treated=c(0, 1)))   
> dsf <- data.frame(time = rep(c(0,sf$time), 2),
+   survival = c(1, sf$surv[,1], 1, sf$surv[,2]), 
+   group = factor(rep(c("control","treated"),
+   rep(length(sf$time) + 1, 2))))
> dkm <- data.frame(time = km$time, 
+   survival = km$surv,
+   group = factor(rep(c("control","treated"),
+   km$strata)))    
> ggplot(dsf, aes(time, survival, color = group)) + geom_step() +
+   geom_point(data = dkm, aes(time, survival, shape=group), color="black") +
+   scale_shape_manual(values = c(4, 1))  # x and o
> ggsave("coxkmr.png", width=500/72, height=400/72, dpi=72) 

Another way to check proportionality of hazards is to plot the log-log transformation of the survival functions versus log time, that is log(−log(S(t)) versus log(t), as the lines should then be parallel if the assumption holds.

. stphplot, by(treated) legend(off) title(Plot of log-log Survival)

        Failure _d: relapse
  Analysis time _t: weeks

. graph export coxphplot.png, width(500) replace
file coxphplot.png saved as PNG format

> dkm <- mutate(dkm,  lls = -log(-log(survival)))
> ggplot(dkm, aes(log(time), lls, color=group)) + geom_point() +
+   geom_line() +  ylab("-log(-log(S(t)))")
> ggsave("coxphplotr.png", width=500/72, height=400/72, dpi=72)

The two lines look quite parallel indeed, showing a good fit of the proportional hazards assumption.

A more detailed treatment of these topics may be found in my survival analysis course. The discussion here is an excerpt from the pages dealing with Kaplan-Meier and Cox regression.