Germán Rodríguez
Survival Analysis Princeton University

A Shared Frailty Model

We will show how to fit a shared frailty model using data on child mortality in Guatemala. The data come from a retrospective survey conducted in 1974-76 by the Instituto de Nutrición de Centroamérica y Panamá (INCAP) and RAND, and include data on multiple children per family.

The data were first analyzed ignoring the clustering by Pebley and Stupp (1987) Reproductive Patterns and Child Mortality in Guatemala. Demography,24:43-60 [1]. (This link requires access to JSTOR.)

The data were then reanalized by Guo and Rodríguez (1992), Estimating a Multivariate Proportional Hazards Model for Clustered Data Using the EM Algorithm, with an Application to Child Survival in Guatemala, Journal of the American Statistical Association, 87:969-976 [2], using what we now call a shared frailty model.

The dataset is available as a Stata file in long format (one record per child):

. use https://grodri.github.io/datasets/pebleystupp, clear
(Child mortality in Guatemala)

. describe

Contains data from https://grodri.github.io/datasets/pebleystupp.dta
 Observations:         3,120                  Child mortality in Guatemala
    Variables:            19                  8 Mar 2005 20:48
                                              (_dta has notes)
────────────────────────────────────────────────────────────────────────────────
Variable      Storage   Display    Value
    name         type    format    label      Variable label
────────────────────────────────────────────────────────────────────────────────
momid           float   %9.0g                 
kid             byte    %9.0g                 
kids            byte    %8.0g                 
time            float   %9.0g                 
income          float   %9.0g                 
death           byte    %8.0g                 
sex             byte    %8.0g                 
mage            float   %9.0g                 
borde           float   %9.0g                 
home            byte    %8.0g                 
pdead           byte    %8.0g                 
edyears         float   %9.0g                 
p0014           byte    %8.0g                 
p1523           byte    %8.0g                 
p2435           byte    %8.0g                 
p36up           byte    %8.0g                 
f0011           byte    %8.0g                 
f1223           byte    %8.0g                 
kidid           float   %9.0g                 
────────────────────────────────────────────────────────────────────────────────
Sorted by: momid  kid

A Piece-wise Exponential Model

The first step is to reproduce exactly the results in Pebley and Stupp. We start by stseting the data, noting that time’ is age at death or at interview, death is an indicator of whether the child died, and kidid’ uniquely identifies each kid. (This is 10 times the momid in the original dataset plus the kid number.)

. stset time, fail(death) id(kidid)

Survival-time data settings

           ID variable: kidid
         Failure event: death!=0 & death<.
Observed time interval: (time[_n-1], time]
     Exit on or before: failure

──────────────────────────────────────────────────────────────────────────
      3,120  total observations
          0  exclusions
──────────────────────────────────────────────────────────────────────────
      3,120  observations remaining, representing
      3,120  subjects
        403  failures in single-failure-per-subject data
  131,512.5  total analysis time at risk and under observation
                                                At risk from t =         0
                                     Earliest observed entry t =         0
                                          Last observed exit t =        60

Pebley and Stupp used a piece-wise exponential model where the hazard is constant in intervals with boundaries at 1, 6, 12 and 24 months. We can fit that kind of model in Stata using the standard Poisson trick, or using streg’ to fit an exponential model with dummy variables for the duration categories. We will do the latter because it allows us to introduce gamma frailty later. So we stsplit the dataset and generate dummies for all categories. (Note that the original paper shows the hazard for each age, so we generate dummies for all age categories and omit the constant.)

. stsplit dur, at(0 1 6 12 24 60)
(10,474 observations (episodes) created)

. gen a0      = dur==0

. gen a1to5   = dur==1

. gen a6to11  = dur==6

. gen a12to23 = dur==12

. gen a24up   = dur==24

. local age "a0 a1to5 a6to11 a12to23 a24up"

The model includes mother’s age and age squared, a linear term on birth order, and an indicator of whether the previous kid in the family died. There are also indicators for the length of the previous birth interval with categories 0-14, 15-23, 24-35 and 36 or more, with first births serving as the reference cell. We need to compute only one of these variables. It is useful to create macros for all:

. gen mage2 = mage^2

. local fixed "mage mage2 borde pdead"

. local previous "p0014 p1523 p2435 p36up"

The model also includes a time-varying covariate with time-varying effects. The variable in question is the length of the following birth interval, with indicators f0011’ for intervals under a year and f1223’ for intervals of one to two years. These dummies capture very short and short intervals, where a child would face competition from a subsequent sibling. Pebley and Stupp treat these as time-varying covariates because a sibling is assumed to affect the life of the index child only after it is born. So we consider very short intervals (< 12) only at ages 12 months and higher, and short intervals (12-23 months) only at ages 24 months and higher. This is the time-varying part. But the effect of very short intervals is also allowed to vary, with intervals < 12 having different effects at ages 12-23 and 24 or more. This means that we need to create three variables that interact the length of the following interval with the age of the child:

. gen i011a1223 = f0011 * (dur == 12)

. gen i011a24p  = f0011 * (dur == 24)

. gen i1223a24p = f1223 * (dur == 24)

. local follow "i011a1223 i011a24p i1223a24p"

We are now ready to fit the model. You should verify that this reproduces exactly the results in Pebley and Stupp, which is quite remarkable.

. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant

        Failure _d: death
  Analysis time _t: time
       ID variable: kidid

Iteration 0:   log likelihood =  -2819.899  
Iteration 1:   log likelihood = -2115.6375  
Iteration 2:   log likelihood = -1864.5861  
Iteration 3:   log likelihood = -1849.9913  
Iteration 4:   log likelihood = -1848.7872  
Iteration 5:   log likelihood = -1848.7838  
Iteration 6:   log likelihood = -1848.7838  

Exponential PH regression

No. of subjects =     3,120                            Number of obs =  13,594
No. of failures =       403
Time at risk    = 131,512.5
                                                       Wald chi2(16) = 9905.61
Log likelihood = -1848.7838                            Prob > chi2   =  0.0000

─────────────┬────────────────────────────────────────────────────────────────
          _t │ Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
          a0 │   .3375566   .2532262    -1.45   0.148     .0775884    1.468575
       a1to5 │   .0245332   .0186095    -4.89   0.000     .0055473    .1084995
      a6to11 │   .0303346   .0229222    -4.63   0.000     .0068981    .1333968
     a12to23 │   .0180935   .0136752    -5.31   0.000     .0041132    .0795915
       a24up │   .0032229   .0025034    -7.39   0.000     .0007032    .0147712
        mage │   .8612339   .0501274    -2.57   0.010     .7683828    .9653051
       mage2 │   1.002581    .001035     2.50   0.013     1.000555    1.004612
       borde │   1.063634   .0355569     1.85   0.065     .9961777    1.135658
       pdead │   1.103248   .1649263     0.66   0.511     .8230493    1.478839
       p0014 │    1.71365   .3640741     2.54   0.011     1.130004     2.59875
       p1523 │   .8845005   .1648386    -0.66   0.510     .6138541    1.274474
       p2435 │   .7718101   .1424211    -1.40   0.160     .5375754    1.108107
       p36up │   .6764905    .141174    -1.87   0.061     .4493947    1.018346
   i011a1223 │   2.246906   1.609712     1.13   0.258     .5517892    9.149482
    i011a24p │   4.933551   3.633516     2.17   0.030     1.164816    20.89594
   i1223a24p │    1.07556   .4053648     0.19   0.847     .5138396    2.251343
─────────────┴────────────────────────────────────────────────────────────────
Note: No constant term was estimated in the main equation.

. estimates store pwe

A Shared-Frailty Model

We now introduce a shared frailty term at the mother’s level to allow for intra-family correlation in child survival. Following standard practice, we assume that the frailty term has a gamma distribution independent of observed covariates. Guo and Rodriguez derive an EM algorithm to fit this model (and Guo’s dissertation includes the Fortran program used to fit the model). I now show how to reproduce their results using streg. All we need to do is specify the frailty distribution and the variable that identifies the clusters:

. streg `age' `fixed' `previous' `follow', dist(exponential) noconstant ///
>     frailty(gamma) shared(momid)

        Failure _d: death
  Analysis time _t: time
       ID variable: kidid

Fitting exponential model ...

Fitting full model:
Iteration 0:   log likelihood = -1864.1367  
Iteration 1:   log likelihood = -1848.4841  
Iteration 2:   log likelihood = -1847.2191  
Iteration 3:   log likelihood =  -1847.139  
Iteration 4:   log likelihood = -1847.1382  
Iteration 5:   log likelihood = -1847.1382  

Exponential PH regression
Gamma shared frailty                               Number of obs     =  13,594
Group variable: momid                              Number of groups  =     851
                                                   Obs per group:   
No. of subjects =     3,120                                      min =       1
No. of failures =       403                                      avg =      16
Time at risk    = 131,512.5                                      max =      40
                                                   Wald chi2(16)     = 8796.54
Log likelihood = -1847.1382                        Prob > chi2       =  0.0000

─────────────┬────────────────────────────────────────────────────────────────
          _t │ Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
          a0 │   .3710262   .2874212    -1.28   0.201     .0812846    1.693562
       a1to5 │   .0271691   .0212782    -4.60   0.000     .0058538    .1261004
      a6to11 │   .0337345   .0263341    -4.34   0.000     .0073047    .1557928
     a12to23 │   .0202894   .0158516    -4.99   0.000     .0043878    .0938193
       a24up │   .0036443   .0029231    -7.00   0.000     .0007566    .0175539
        mage │   .8559554   .0514063    -2.59   0.010     .7609048    .9628796
       mage2 │   1.002689   .0010689     2.52   0.012     1.000596    1.004786
       borde │   1.059148   .0376965     1.61   0.106     .9877825     1.13567
       pdead │   .9272842   .1655893    -0.42   0.672     .6534463    1.315878
       p0014 │     1.7738   .3859226     2.63   0.008     1.158004    2.717059
       p1523 │   .9075818   .1718846    -0.51   0.609     .6261508    1.315505
       p2435 │   .7960679    .149776    -1.21   0.225     .5505554    1.151063
       p36up │   .6898139   .1463948    -1.75   0.080     .4550791    1.045628
   i011a1223 │   2.210407   1.593665     1.10   0.271     .5379858    9.081837
    i011a24p │   4.959532   3.677954     2.16   0.031       1.1593    21.21708
   i1223a24p │   1.076996   .4067595     0.20   0.844     .5137269    2.257852
─────────────┼────────────────────────────────────────────────────────────────
    /lntheta │  -1.540714    .632364    -2.44   0.015    -2.780125   -.3013038
─────────────┼────────────────────────────────────────────────────────────────
       theta │    .214228   .1354701                      .0620307     .739853
─────────────┴────────────────────────────────────────────────────────────────
Note: Estimates are transformed only in the first equation to hazard ratios.
LR test of theta=0: chibar2(01) = 3.29                 Prob >= chibar2 = 0.035
Note: No constant term was estimated in the main equation.

. estimates store gamma

You may verify that the results agree quite closely with the paper. (The labels for the complex interaction terms involving the following birth interval are out of order in the published paper, but OK in Guo’s dissertation.)

The estimated variance of frailty of 0.21 (or 0.22 in the paper) implies modest association between the lifetimes of siblings. Following Oakes, it corresponds to a rank correlation of 0.09. Even this small association, however, translates into substantial hazard ratios in the sense of Clayton. The conditional hazard for a child knowing that a sibling died at age a is 22% higher than if the sibling survived to age a, holding all covariates constant. Knowing that two siblings died raises the hazard by 44%. Note that to test the significance of the variance we have to be careful because the test is on a boundary of the parameter space. In the JASA paper we obtain a t-ratio of 1.6 for the variance, which agrees with Stata (0.214/0.135 = 1.6, although working in the log-scale would be better). Stata’s likelihood ratio test gives a chi-squared of 3.3 with an estimated p-value of 0.035, essentially the same result.

. estimates table pwe gamma, t eform

─────────────┬──────────────────────────
    Variable │    pwe         gamma     
─────────────┼──────────────────────────
_t           │
          a0 │  .33755656    .37102625  
             │      -1.45        -1.28  
       a1to5 │  .02453322    .02716914  
             │      -4.89        -4.60  
      a6to11 │  .03033465    .03373445  
             │      -4.63        -4.34  
     a12to23 │  .01809347    .02028935  
             │      -5.31        -4.99  
       a24up │  .00322289    .00364434  
             │      -7.39        -7.00  
        mage │  .86123393    .85595543  
             │      -2.57        -2.59  
       mage2 │  1.0025814    1.0026886  
             │       2.50         2.52  
       borde │  1.0636339    1.0591481  
             │       1.85         1.61  
       pdead │  1.1032484    .92728419  
             │       0.66        -0.42  
       p0014 │  1.7136501    1.7737999  
             │       2.54         2.63  
       p1523 │  .88450054    .90758181  
             │      -0.66        -0.51  
       p2435 │  .77181013    .79606791  
             │      -1.40        -1.21  
       p36up │  .67649048    .68981395  
             │      -1.87        -1.75  
   i011a1223 │  2.2469057    2.2104071  
             │       1.13         1.10  
    i011a24p │  4.9335514    4.9595321  
             │       2.17         2.16  
   i1223a24p │  1.0755599    1.0769956  
             │       0.19         0.20  
─────────────┼──────────────────────────
    /lntheta │                 .214228  
             │                   -2.44  
─────────────┴──────────────────────────
                             Legend: b/t

The other aspect of interest in the results is that the estimates of the covariate effects are remarkably stable. The one change worth mentioning is the coefficient for ‘previous child died’, which changes sign. This variable was clearly acting as a proxy for unobserved family effects. Note also that the t-ratios for observed covariate effects have generally declined in magnitude, which is consistent with the notion that correlated observations have less information than independent observations.

Guo and Rodríguez also fit a non-parametric frailty model using a two-point distribution. This distribution is not built into Stata, but the model can be fit in Stata with only modest effort by following the method described in Section 3.2 of the paper. Details are left as an exercise for the reader.

We continue our analysis of this dataset with a discussion of model interpretation, including calculation of predicted risks and probabilities.