Germán Rodríguez
Demographic Methods Princeton University

Rates and Standardization

We will work through the example in Preston et al., sections 2.2 and 2.3.

Sample Data

I copied the counts of mid-year population and deaths by Age for Sweden and Kazakhstan from Table 2.1 into a text file that is available in the datasets section. The file is in “long” format. As usual, I show the analysis in R and Stata.

. infile str10 country str5 ageg pop deaths ///
>   using https://grodri.github.io/datasets/preston21long.dat, clear
(38 observations read)

. list in 1/3

     ┌─────────────────────────────────────┐
     │    country   ageg      pop   deaths │
     ├─────────────────────────────────────┤
  1. │ Kazakhstan      0   174078     3720 │
  2. │ Kazakhstan   01-4   754758     1220 │
  3. │ Kazakhstan   05-9   879129      396 │
     └─────────────────────────────────────┘
> p21 <- read.table("https://grodri.github.io/datasets/preston21long.dat",
+   header=FALSE, col.names = c("country", "ageg", "pop", "deaths"))
> head(p21, 3)
     country ageg    pop deaths
1 Kazakhstan    0 174078   3720
2 Kazakhstan 01-4 754758   1220
3 Kazakhstan 05-9 879129    396

The first thing to do is to calculate the age-specific rates, dividing deaths by population and multiplying by 1000:

. gen rates = 1000 * deaths / pop
> library(dplyr)
> p21 <- mutate(p21, rates = 1000 * deaths / pop)

Crude death rates are just a weighted average of age-specific rates using the population in each age group as the weight. We can easily compute them like this:

. tabstat rates [fw=pop], by(country)

Summary for variables: rates
Group variable: country 

   country │      Mean
───────────┼──────────
Kazakhstan │  7.423042
    Sweden │  10.54756
───────────┼──────────
     Total │  8.470285
───────────┴──────────
> group_by(p21, country) |>
+   summarize(crude = weighted.mean(rates, pop))
# A tibble: 2 × 2
  country    crude
  <chr>      <dbl>
1 Kazakhstan  7.42
2 Sweden     10.5 

The interesting result here is that mortality appears to be lower in Kazakhstan than in Sweden.

Standardized Rates

Following Preston et al., we will standardize the rates using the average of the two population compositions as the standard. To do this we calculate first the percent distribution in each country and then average the percents. While we are at it we will also calculate the average rates.

. egen pcpop = pc(pop), by(country)

. egen avgcomp = mean(pcpop), by(ageg)
> p21 <- group_by(p21, country) |>
+   mutate(pct = pop/sum(pop))
> avg <- group_by(p21, ageg) |>
+   summarize(pop = mean(pct), rates = mean(rates))

Now we can easily compute the standardized rates using the average composition as the standard. Because we assigned the grouped data to p21 we can skip the grouping step.

. tabstat rates [aw=avgcomp], by(country)

Summary for variables: rates
Group variable: country 

   country │      Mean
───────────┼──────────
Kazakhstan │    11.882
    Sweden │  7.374094
───────────┼──────────
     Total │  9.628045
───────────┴──────────
> summarize(p21, std = weighted.mean(rates, avg$pop))
# A tibble: 2 × 2
  country      std
  <chr>      <dbl>
1 Kazakhstan 11.9 
2 Sweden      7.37

We see that once we adjust for age composition the mortality rate is in fact higher in Kzakhstan than in Sweden, actually 61.1% higher.

Indirect Standarization

Frequently we don’t have age-specific rates, but can easily obtain the age distribution. We can still do a form of standardization by applying the rates of one country (or any other standard) to the two age distributions. Let us use the rates for Sweden and apply them to both countries. To do this we sort by age and country, and for each age pick the second value

. bysort age (country): gen swrates = rates[2]

. tabstat swrates [fw=pop], by(country)

Summary for variables: swrates
Group variable: country 

   country │      Mean
───────────┼──────────
Kazakhstan │  4.200627
    Sweden │  10.54756
───────────┼──────────
     Total │  6.327926
───────────┴──────────
> sweden <- filter(p21, country == "Sweden")
> summarize(p21, istd = weighted.mean(sweden$rates, pop))
# A tibble: 2 × 2
  country     istd
  <chr>      <dbl>
1 Kazakhstan  4.20
2 Sweden     10.5 

So if Kazakhstan had Swedish rates the CDR would be 4.2. The ratio of the actual CDR to the expected rate under the standard age structure is known as the standardized mortality ratio (SMR) and turns out to be

. quietly sum rates [fw=pop] if country == "Kazakhstan"

. scalar kzcmr = r(mean)

. quietly sum swrates [fw=pop] if country == "Kazakhstan"

. di "SMR = ", kzcmr/r(mean)
SMR =  1.7671271
> filter(p21, country == "Kazakhstan") |>
+   summarize(istd = weighted.mean(sweden$rates,pop), 
+   crude = weighted.mean(rates,pop), smr = crude/istd)
# A tibble: 1 × 4
  country     istd crude   smr
  <chr>      <dbl> <dbl> <dbl>
1 Kazakhstan  4.20  7.42  1.77

Mortality in Kazakhstan is in fact 77% higher than it would be at Swedish rates with the observed age composition.

The indirect standardized rate for Kazakhstan is defined as the product of the SMR by the crude death rate for Sweden, 7.767 × 10.548 or 18.639. For more information see this handout.

Note: Stata has commands dstdize and istdize for direct and indirect standardization. The Stata Forum has an entry showing how to use those commands to reproduce the calculations on this page.

Decomposition of Differences between Rates

Preston et al. show how to decompose a difference between two rates into a part due to compositional differences and a part due to differences in rates. The two components are obtained by applying the average composition to the observed rates, and by applying the observed compositions to the average rates. An important property f this decomposition is that it is exact.

. egen avgrates = mean(rates), by(ageg)

. tabstat avgrates [fw=pop], by(country)

Summary for variables: avgrates
Group variable: country 

   country │      Mean
───────────┼──────────
Kazakhstan │  5.811834
    Sweden │  13.44426
───────────┼──────────
     Total │   8.36999
───────────┴──────────
> dbr <- summarize(p21,
+   crude = weighted.mean(rates,pop),
+   comp = weighted.mean(rates, avg$pop), 
+   rates = weighted.mean(avg$rates, pop)); dbr
# A tibble: 2 × 4
  country    crude  comp rates
  <chr>      <dbl> <dbl> <dbl>
1 Kazakhstan  7.42 11.9   5.81
2 Sweden     10.5   7.37 13.4 

We see that with the same rates the older composition of Sweden would result in much higher mortality than observed. We can now verify the decomposition:

. display "Difference = " 10.55 - 7.42 ///
>   "; Composition = " 13.44 - 5.81 ///
>   "; Rates = " 7.37 - 11.88    
Difference = 3.13; Composition = 7.63; Rates = -4.51
> c(-1, 1) %*% as.matrix(data.frame(dbr[,-1]))
        crude      comp    rates
[1,] 3.124519 -4.507903 7.632422

So the observed difference of 3.13 points in the CBR between Sweden and Kazakhstan results from a compositional effect of +7.63 and a difference in rates of -4.51.

I wrote a Stata program that can do these calculations for data in “wide” format. In net-aware Stata type net from https://grodri.github.io/demography and check out the ddrate command.

Age-Patterns of Mortality

So far we have focused on attempts to summarize mortality in a single number, but why not look at the complete set of rates? I would like to plot the rates using the mid-points of the ages groups (with 87.5 for 85+). To compute these I first convert age from a string to numbers (1 to 19) and then figure out the midpoint of each group.

. encode ageg, gen(agei)    // "05-9" becomes 3

. gen agem = 5*agei - 7.5   // 3 becomes 7.5

. replace agem = 0.5 if ageg == "0"
(2 real changes made)

. replace agem = 3   if ageg == "01-4"
(2 real changes made)
> p21 <- mutate(p21, 
+   agei = as.numeric(as.factor(ageg)), 
+   age = ifelse(agei > 2, 5*agei - 7.5, ifelse(agei == 1, 0, 3)))

We then plot the rates using a log scale (otherwise we wouldn’t see much)

. twoway (line rate agem if country=="Sweden")   ///
>    (line rate agem if country=="Kazakhstan")   ///
>   , yscale(log) ylabel(50 100)                 ///
>   title("Age-Specific Mortality Rates")        ///
>   xtitle("Age") ytitle("log(ASMR)")            ///
>   legend(order(2 "Kazakhstan" 1 "Sweden") ring(0) pos(5) cols(1))

. graph export swkzasmr.png, replace width(500) height(400)
file swkzasmr.png saved as PNG format

> library(ggplot2)
> ggplot(p21, aes(x=age, y=log(rates), color=country)) + geom_line()
> ggsave("swkzasmrr.png", width=500/72, height=400/72, dpi=72) 

We see clearly that mortality is higher in Kazakhstan than in Sweden at every age. The lines are not exactly parallel, in fact the relative difference seems to become smaller at older ages. (We will discuss later how heterogeneity can produce this type of convergence.)

The Modern Regression Approach

The aim of standardization is to control for a compositional variable. This, of course, is also one of the aims of regression analysis. I now show how one could analyze the data using Poisson regression, a method appropriate for count data.

This regression technique assumes that the number of deaths has mean (and variance) given by the product of a rate and exposure time. The log of the mean is then the log of the rate (which is modeled using a linear predictor, just as in linear regression) times the log of exposure (which is defined as an offset, or a fixed part of the linear predictor). That’s why the code below computes the log of the mid-year population counts. We also represent country using a dummy variable for Kazakhstan.

. gen logexpo = log(pop)

. gen kz  = country == "Kazakhstan"

. poisson deaths i.agei kz, offset(logexpo) nolog

Poisson regression                                   Number of obs =        38
                                                     LR chi2(19)   = 315966.62
                                                     Prob > chi2   =    0.0000
Log likelihood = -1274.853                           Pseudo R2     =    0.9920

─────────────┬────────────────────────────────────────────────────────────────
      deaths │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
        agei │
       01-4  │  -2.600295   .0322872   -80.54   0.000    -2.663577   -2.537013
       05-9  │  -3.823097   .0509118   -75.09   0.000    -3.922882   -3.723311
      10-14  │  -4.003534   .0571946   -70.00   0.000    -4.115633   -3.891434
      15-19  │   -3.29342    .043102   -76.41   0.000    -3.377899   -3.208942
      20-24  │   -2.99395   .0395719   -75.66   0.000    -3.071509    -2.91639
      25-29  │   -3.02696     .03777   -80.14   0.000    -3.100988   -2.952932
      30-34  │  -2.743422    .033986   -80.72   0.000    -2.810034   -2.676811
      35-39  │  -2.436497   .0318358   -76.53   0.000    -2.498894     -2.3741
      40-44  │  -1.985014   .0285733   -69.47   0.000    -2.041017   -1.929012
      45-49  │  -1.588591   .0280586   -56.62   0.000    -1.643584   -1.533597
      50-54  │  -1.231015   .0229505   -53.64   0.000    -1.275997   -1.186032
      55-59  │  -.6591381   .0220602   -29.88   0.000    -.7023753   -.6159008
      60-64  │  -.3400267   .0199076   -17.08   0.000    -.3790448   -.3010086
      65-69  │   .2502514   .0188483    13.28   0.000     .2133094    .2871935
      70-74  │   .6572624   .0186292    35.28   0.000     .6207498     .693775
      75-79  │   1.094818   .0178766    61.24   0.000      1.05978    1.129855
      80-84  │   1.719851    .017546    98.02   0.000     1.685462    1.754241
        85+  │   2.451387   .0170575   143.71   0.000     2.417955    2.484819
             │
          kz │   .4794218   .0062738    76.42   0.000     .4671254    .4917181
       _cons │  -4.445508   .0166385  -267.18   0.000    -4.478119   -4.412897
     logexpo │          1  (offset)
─────────────┴────────────────────────────────────────────────────────────────

. di exp(_b[kz])
1.6151402
> p21 <- mutate(p21, logexpo = log(pop), kz = as.numeric(country=="Kazakhstan"))
> fit <- glm(deaths ~ ageg + kz + offset(log(pop)), data = p21, family = poisson)
> coef(summary(fit))
              Estimate Std. Error    z value      Pr(>|z|)
(Intercept) -4.4455076 0.01663845 -267.18275  0.000000e+00
ageg01-4    -2.6002957 0.03228717  -80.53649  0.000000e+00
ageg05-9    -3.8230970 0.05091182  -75.09253  0.000000e+00
ageg10-14   -4.0035340 0.05719463  -69.99842  0.000000e+00
ageg15-19   -3.2934201 0.04310201  -76.40989  0.000000e+00
ageg20-24   -2.9939502 0.03957193  -75.65843  0.000000e+00
ageg25-29   -3.0269600 0.03777003  -80.14185  0.000000e+00
ageg30-34   -2.7434226 0.03398604  -80.72205  0.000000e+00
ageg35-39   -2.4364968 0.03183579  -76.53326  0.000000e+00
ageg40-44   -1.9850146 0.02857332  -69.47090  0.000000e+00
ageg45-49   -1.5885907 0.02805860  -56.61690  0.000000e+00
ageg50-54   -1.2310144 0.02295053  -53.63773  0.000000e+00
ageg55-59   -0.6591382 0.02206023  -29.87903 3.685603e-196
ageg60-64   -0.3400269 0.01990755  -17.08030  2.080363e-65
ageg65-69    0.2502514 0.01884833   13.27711  3.142725e-40
ageg70-74    0.6572621 0.01862924   35.28122 1.139967e-272
ageg75-79    1.0948175 0.01787660   61.24306  0.000000e+00
ageg80-84    1.7198508 0.01754596   98.01978  0.000000e+00
ageg85+      2.4513869 0.01705754  143.71277  0.000000e+00
kz           0.4794215 0.00627376   76.41694  0.000000e+00
> exp(coef(fit)["kz"])
     kz 
1.61514 

Exponentiating the coefficient of Kazakhstan, we see that on average age-specific mortality is 61.5% higher than in Sweden, a result very similar to the 61.2% that we obtained using direct standardization.

An advantage of the regression approach is that one can easily control for multiple confounders. Also, we can test for the presence of an interaction, which would question the validity of the additive model underlying direct standardization. In this case, adding the interaction would make the model saturated for this data, so we just test goodness of fit.

. estat gof 

         Deviance goodness-of-fit =  2218.545
         Prob > chi2(18)          =    0.0000

         Pearson goodness-of-fit  =  2026.899
         Prob > chi2(18)          =    0.0000
> data.frame(deviance = deviance(fit), df = fit$df.residual)
  deviance df
1 2218.549 18

The deviance of 2218 on 18 df is highly significant, so we conclude that the rates are not proportional. With data based on national counts, however, even small differences can be statistically significant. This interaction represents less than one percent of the combined effect of age and country.

. di r(chi2_d)/e(chi2)
.00702145
> deviance(fit)/(fit$null.deviance - deviance(fit))
[1] 0.007021467

If the discussion in the last section doesn’t make much sense to you, don’t worry. Just make a note to come back and read it again after taking a course on generalized linear models.

Updated 22-Jan-2016.