We’ll illustrate the Kaplan-Meier estimator with the classic dataset used by Cox in his seminal paper on proportional hazard models. The data shows the length of remission in weeks for two groups of leukemia patients, treated and controls.
We start by reading the data, available in the datasets section as a
plain text file gehan.dat and a Stata file
gehan.dta.
. use https://grodri.github.io/datasets/gehan, clear
(Dataset used in Cox's 1972 paper, JRSS 34:187-220)
. stset weeks, failure(relapse) // define as survival data
Survival-time data settings
Failure event: relapse!=0 & relapse<.
Observed time interval: (0, weeks]
Exit on or before: failure
──────────────────────────────────────────────────────────────────────────
42 total observations
0 exclusions
──────────────────────────────────────────────────────────────────────────
42 observations remaining, representing
30 failures in single-record/single-failure data
541 total analysis time at risk and under observation
At risk from t = 0
Earliest observed entry t = 0
Last observed exit t = 35
> library(survival)
> library(dplyr)
> library(ggplot2)
> gehan <- read.table("https://grodri.github.io/datasets/gehan.dat")
> summarize(gehan, events = sum(relapse), exposure = sum(weeks))
events exposure
1 30 541
As you can see, 30 of the 42 patients had a relapse, with a total exposure time of 541 weeks. The weekly relapse rate is 7.8%.
Next we compute and plot the Kaplan-Meir estimator of the survival function in each of the two groups.
. sts graph, by(group)
Failure _d: relapse
Analysis time _t: weeks
. graph export kmg.png, width(500) replace
file kmg.png saved as PNG format

> kmg <- survfit(Surv(weeks, relapse) ~ group, data=gehan)
> #plot(kmg)
> a = 1:kmg$strata[1]
> kmdf <- data.frame(group = rep(c("control", "treated"), kmg$strata+1),
+ time = c(0, kmg$time[a], 0, kmg$time[-a]),
+ surv = c(1, kmg$surv[a], 1, kmg$surv[-a])) %>% group_by(group)
> ggplot(kmdf, aes(time, surv, color=group)) + geom_step()
> ggsave("kmgr.png", width=500/72, height=400/72, dpi=72)

The graph shows that after 23 weeks all patients in the control group had relapsed, but about half those in the treated group remained in remission. We can list the survival function.
. sts list, by(group)
Failure _d: relapse
Analysis time _t: weeks
Kaplan–Meier survivor function
By variable: group
At Net Survivor Std.
Time risk Fail lost function error [95% conf. int.]
────────────────────────────────────────────────────────────────────────
control
1 21 2 0 0.9048 0.0641 0.6700 0.9753
2 19 2 0 0.8095 0.0857 0.5689 0.9239
3 17 1 0 0.7619 0.0929 0.5194 0.8933
4 16 2 0 0.6667 0.1029 0.4254 0.8250
5 14 2 0 0.5714 0.1080 0.3380 0.7492
8 12 4 0 0.3810 0.1060 0.1831 0.5778
11 8 2 0 0.2857 0.0986 0.1166 0.4818
12 6 2 0 0.1905 0.0857 0.0595 0.3774
15 4 1 0 0.1429 0.0764 0.0357 0.3212
17 3 1 0 0.0952 0.0641 0.0163 0.2612
22 2 1 0 0.0476 0.0465 0.0033 0.1970
23 1 1 0 0.0000 . . .
treated
6 21 3 1 0.8571 0.0764 0.6197 0.9516
7 17 1 0 0.8067 0.0869 0.5631 0.9228
9 16 0 1 0.8067 0.0869 0.5631 0.9228
10 15 1 1 0.7529 0.0963 0.5032 0.8894
11 13 0 1 0.7529 0.0963 0.5032 0.8894
13 12 1 0 0.6902 0.1068 0.4316 0.8491
16 11 1 0 0.6275 0.1141 0.3675 0.8049
17 10 0 1 0.6275 0.1141 0.3675 0.8049
19 9 0 1 0.6275 0.1141 0.3675 0.8049
20 8 0 1 0.6275 0.1141 0.3675 0.8049
22 7 1 0 0.5378 0.1282 0.2678 0.7468
23 6 1 0 0.4482 0.1346 0.1881 0.6801
25 5 0 1 0.4482 0.1346 0.1881 0.6801
32 4 0 2 0.4482 0.1346 0.1881 0.6801
34 2 0 1 0.4482 0.1346 0.1881 0.6801
35 1 0 1 0.4482 0.1346 0.1881 0.6801
────────────────────────────────────────────────────────────────────────
Note: Net lost equals the number lost minus the number who entered.
> summary(kmg)
Call: survfit(formula = Surv(weeks, relapse) ~ group, data = gehan)
group=control
time n.risk n.event survival std.err lower 95% CI upper 95% CI
1 21 2 0.9048 0.0641 0.78754 1.000
2 19 2 0.8095 0.0857 0.65785 0.996
3 17 1 0.7619 0.0929 0.59988 0.968
4 16 2 0.6667 0.1029 0.49268 0.902
5 14 2 0.5714 0.1080 0.39455 0.828
8 12 4 0.3810 0.1060 0.22085 0.657
11 8 2 0.2857 0.0986 0.14529 0.562
12 6 2 0.1905 0.0857 0.07887 0.460
15 4 1 0.1429 0.0764 0.05011 0.407
17 3 1 0.0952 0.0641 0.02549 0.356
22 2 1 0.0476 0.0465 0.00703 0.322
23 1 1 0.0000 NaN NA NA
group=treated
time n.risk n.event survival std.err lower 95% CI upper 95% CI
6 21 3 0.857 0.0764 0.720 1.000
7 17 1 0.807 0.0869 0.653 0.996
10 15 1 0.753 0.0963 0.586 0.968
13 12 1 0.690 0.1068 0.510 0.935
16 11 1 0.627 0.1141 0.439 0.896
22 7 1 0.538 0.1282 0.337 0.858
23 6 1 0.448 0.1346 0.249 0.807
The convention is to report the survival function immediately after each time. In the control group there are no censored observations, and the Kaplan-Meier estimate is simply the proportion alive after each distinct failure time. You can also check that the standard error is the usual binomial estimate. For example just after 8 weeks there are 8 out of 21 alive, the proportion is 8/21 = 0.381 and the standard error is v0.381(1 - 0.381)/21) = 0.106.
In the treated group 12 cases are censored and 9 relapse. Can you compute the estimate by hand? The distinct times of death are 6, 7, 10, 13, 16, 22 and 23. The counts of relapses are 3, 1, 1, 1, 1, 1, 1. When there are ties between event and censoring times it is customary to assume that the event occurred first; that is, observations censored at t are assumed to be exposed at that time, effectively censored just after t. The counts of censored observations after each death time (but before the next) are 1, 1, 2, 0, 3, 0 and 5. When the last observation is censored the K-M estimate is greater than zero, and is usually considered undefined after that.