Germán Rodríguez
Generalized Linear Models Princeton University

Solutions to Problem Set 4
Doctor Visits

Cameron and Trivedi (2009) have some interesting data on the number of office-based doctor visits by adults aged 25-64 based on the 2002 Medical Expenditure Panel Survey. We will use data for the most recent wave, available at https://grodri.github.io/datasets/docvis.dta.

. use https://grodri.github.io/datasets/docvis.dta, clear
(Doctor visits from 2002 MEPS, Cameron and Trivedi (2009))
> library(haven)
> dv <- read_dta("https://grodri.github.io/datasets/docvis.dta")

[1] A Poisson Model

(a) Fit a Poisson regression model with the number of doctor visits (docvis), as the outcome. We will use the same predictors as Cameron and Trivedi, namely health insurance status (private), health status (chronic), gender (female) and income (income), but will add two indicators of ethnicity (black and hispanic). There are many more variables one could add, but we’ll keep things simple.

. glm docvis private chronic female income black hispanic, ///
>   family(poisson) nolog

Generalized linear models                         Number of obs   =      4,412
Optimization     : ML                             Residual df     =      4,405
                                                  Scale parameter =          1
Deviance         =  27870.99397                   (1/df) Deviance =   6.327127
Pearson          =  55945.39004                   (1/df) Pearson  =   12.70043

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   8.332044
Log likelihood   = -18373.48862                   BIC             =  -9096.133

─────────────┬────────────────────────────────────────────────────────────────
             │                 OIM
      docvis │ Coefficient  std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     private │   .7203441   .0281465    25.59   0.000      .665178    .7755102
     chronic │   1.068487   .0158639    67.35   0.000     1.037395     1.09958
      female │   .4775823   .0160303    29.79   0.000     .4461636    .5090011
      income │   .0030057   .0002476    12.14   0.000     .0025204     .003491
       black │  -.1867826   .0365022    -5.12   0.000    -.2583256   -.1152395
    hispanic │  -.3510353   .0232585   -15.09   0.000    -.3966211   -.3054495
       _cons │  -.0499702   .0306643    -1.63   0.103    -.1100711    .0101307
─────────────┴────────────────────────────────────────────────────────────────

. estimates store poi
> mp <- glm(docvis ~ private + chronic + female + income + black + hispanic,
+   data = dv, family = poisson())
> summary(mp)

Call:
glm(formula = docvis ~ private + chronic + female + income + 
    black + hispanic, family = poisson(), data = dv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.7947  -2.0468  -1.1881   0.2755  24.2026  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.0499702  0.0306640  -1.630    0.103    
private      0.7203442  0.0281462  25.593  < 2e-16 ***
chronic      1.0684873  0.0158638  67.354  < 2e-16 ***
female       0.4775823  0.0160303  29.793  < 2e-16 ***
income       0.0030057  0.0002476  12.139  < 2e-16 ***
black       -0.1867826  0.0365022  -5.117  3.1e-07 ***
hispanic    -0.3510353  0.0232584 -15.093  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 36984  on 4411  degrees of freedom
Residual deviance: 27871  on 4405  degrees of freedom
AIC: 36761

Number of Fisher Scoring iterations: 6

(b) Interpret the coefficient of black and test its significance using a Wald test and a likelihood ratio test.

. di exp(_b[black]) - 1
-.17037592

. di (_b[black]/_se[black])^2
26.183881

. quietly glm docvis private chronic female income hispanic, ///
>   family(poisson)

. lrtest . poi

Likelihood-ratio test
Assumption: . nested within poi

 LR chi2(1) =  27.67
Prob > chi2 = 0.0000
> b <- coef(mp)
> exp(b["black"])
    black 
0.8296241 
> se <- sqrt(diag(vcov(mp)))
> (b["black"]/se["black"])^2
   black 
26.18392 
> mpb <- update(mp,  ~ . - black)
> anova(mpb, mp)    
Analysis of Deviance Table

Model 1: docvis ~ private + chronic + female + income + hispanic
Model 2: docvis ~ private + chronic + female + income + black + hispanic
  Resid. Df Resid. Dev Df Deviance
1      4406      27899            
2      4405      27871  1    27.67

Blacks report 17% fewer visits to the doctor than white with the same insurance, health status, gender and income. The z-test of -5.12 in the output is equivalent to a χ2 of 26.18 on one d.f. and is highly significant. The likelihood ratio test obtained by fitting the model without black gives a similar χ2 of 27.67 on one d.f.

(c) Compute a 95% confidence interval for the effect of private insurance and interpret this result in terms of doctor visits.

. estimates restore poi
(results poi are active now)

. scalar cv = invnormal(0.975)

. di exp(_b[private] - cv * _se[private])
1.9448367

. di exp(_b[private] + cv * _se[private])
2.1716999
> ci <- b["private"] + c(-1, 1) * qnorm(0.975) * se["private"]
> exp(ci)
[1] 1.944838 2.171699

We can obtain a 95% confidence interval by exponentiating the bounds reported in the output. In Stata you can use the eform option. We find that respondents with private insurance visit the doctor between 1.94 and 2.17 times as often as respondents without insurance who have the same observed characteristics, namely gender, ethnicity, health status and income.

(d) Compute the deviance and Pearson chi-squared statistics for this model. Does the model fit the data? Is there evidence of overdispersion?

> deviance(mp)
[1] 27870.99
> pr <- residuals(mp, "pearson")
> sum(pr^2)
[1] 55945.39

The deviance of 27,870 on 4,405 d.f. is highly significant and the Pearson χ2 of 55,945 is even worse. The model clearly does not fit the data. There is overwhelming evidence of overdispersion. In terms of Pearson’s χ2, the variance is 13 times the mean.

(e) Predict the proportion expected to have exactly zero doctor visits and compare with the observed proportion. You will find the formula for Poisson probabilities in the notes. The probability of zero is simply eμ.

. predict mupoi
(option mu assumed; predicted mean docvis)

. gen zpoi = exp(-mupoi)

. gen zobs = docvis == 0

. sum zpoi zobs

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
        zpoi │      4,412    .1127876    .1392959   2.89e-07   .5118896
        zobs │      4,412    .3640073    .4812052          0          1
> mup <- exp(predict(mp))
> zp <- exp(-mup)
> c(mean(zp), mean(dv$docvis == 0))
[1] 0.1127876 0.3640073

The Poisson model substantially underestimates the probability of zero doctor visits, predicting 11.3% when in fact 36.4% of respondents report zero visits.

[2] Poisson Overdispersion

(a) Suppose the variance is proportional to the mean rather than equal to the mean. Estimate the proportionality parameter using Pearson’s chi-squared and use this estimate to correct the standard errors.

We know from the previous result that the proportionality factor is 12.7. We therefore need to inflate the standard errors by a factor of √12.7 = 3.564 or almost four. In Stata we can do this calculation using the scale(x2) option.

. glm docvis private chronic female income black hispanic, ///
>   family(poisson) scale(x2) nolog

Generalized linear models                         Number of obs   =      4,412
Optimization     : ML                             Residual df     =      4,405
                                                  Scale parameter =          1
Deviance         =  27870.99397                   (1/df) Deviance =   6.327127
Pearson          =  55945.39004                   (1/df) Pearson  =   12.70043

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   8.332044
Log likelihood   = -18373.48862                   BIC             =  -9096.133

─────────────┬────────────────────────────────────────────────────────────────
             │                 OIM
      docvis │ Coefficient  std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     private │   .7203441   .1003075     7.18   0.000      .523745    .9169432
     chronic │   1.068487   .0565351    18.90   0.000     .9576805    1.179294
      female │   .4775823   .0571282     8.36   0.000     .3656131    .5895516
      income │   .0030057   .0008824     3.41   0.001     .0012762    .0047352
       black │  -.1867826   .1300854    -1.44   0.151    -.4417453    .0681801
    hispanic │  -.3510353   .0828879    -4.24   0.000    -.5134926    -.188578
       _cons │  -.0499702   .1092804    -0.46   0.647    -.2641558    .1642154
─────────────┴────────────────────────────────────────────────────────────────
(Standard errors scaled using square root of Pearson X2-based dispersion.)

. scalar phi = e(dispers_ps) // for later use
> phi <- sum(pr^2)/df.residual(mp); phi
[1] 12.70043
> summary(update(mp, family=quasipoisson()))

Call:
glm(formula = docvis ~ private + chronic + female + income + 
    black + hispanic, family = quasipoisson(), data = dv)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-4.7947  -2.0468  -1.1881   0.2755  24.2026  

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.0499702  0.1092799  -0.457 0.647501    
private      0.7203442  0.1003069   7.181 8.06e-13 ***
chronic      1.0684873  0.0565352  18.899  < 2e-16 ***
female       0.4775823  0.0571283   8.360  < 2e-16 ***
income       0.0030057  0.0008824   3.406 0.000665 ***
black       -0.1867826  0.1300858  -1.436 0.151118    
hispanic    -0.3510353  0.0828879  -4.235 2.33e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for quasipoisson family taken to be 12.70051)

    Null deviance: 36984  on 4411  degrees of freedom
Residual deviance: 27871  on 4405  degrees of freedom
AIC: NA

Number of Fisher Scoring iterations: 6

(b) What happens to the significance of the black coefficient once we allow for extra-Poisson variation? Could we test this coefficient using a likelihood ratio test. Explain.

> seo <- sqrt(phi) * se
> b["black"] / seo["black"]
    black 
-1.435847 

Once we adjust for overdispersion the black coefficient is no longer significant, with a z-statistic of -1.44 equivalent to a χ2 of just 2.07. We have no evidence that blacks differ from comparable whites in the number of doctor visits. We can’t do a likelihood ratio test because we haven’t specified a likelihood.

(c) Compare the standard errors adjusted for over-dispersion with the robust or “sandwich” estimator of the standard errors. To obtain robust standard errors we follow the procedure outlined in this log.

. estimates store odp

. glm docvis private chronic female income black hispanic, ///
>   family(poisson) vce(robust) nolog

Generalized linear models                         Number of obs   =      4,412
Optimization     : ML                             Residual df     =      4,405
                                                  Scale parameter =          1
Deviance         =  27870.99397                   (1/df) Deviance =   6.327127
Pearson          =  55945.39004                   (1/df) Pearson  =   12.70043

Variance function: V(u) = u                       [Poisson]
Link function    : g(u) = ln(u)                   [Log]

                                                  AIC             =   8.332044
Log pseudolikelihood = -18373.48862               BIC             =  -9096.133

─────────────┬────────────────────────────────────────────────────────────────
             │               Robust
      docvis │ Coefficient  std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     private │   .7203441   .1133685     6.35   0.000      .498146    .9425423
     chronic │   1.068487   .0562956    18.98   0.000       .95815    1.178825
      female │   .4775823   .0582232     8.20   0.000     .3634669    .5916978
      income │   .0030057    .001117     2.69   0.007     .0008164     .005195
       black │  -.1867826   .1169954    -1.60   0.110    -.4160893    .0425241
    hispanic │  -.3510353    .086672    -4.05   0.000    -.5209094   -.1811612
       _cons │  -.0499702   .1230406    -0.41   0.685    -.2911253    .1911849
─────────────┴────────────────────────────────────────────────────────────────

. estimates table poi odp ., se

─────────────┬───────────────────────────────────────
    Variable │    poi          odp         Active    
─────────────┼───────────────────────────────────────
     private │  .72034413    .72034413    .72034413  
             │  .02814649     .1003075    .11336849  
     chronic │  1.0684873    1.0684873    1.0684873  
             │  .01586387    .05653512    .05629556  
      female │  .47758234    .47758234    .47758234  
             │  .01603029     .0571282    .05822322  
      income │  .00300568    .00300568    .00300568  
             │  .00024761    .00088241    .00111701  
       black │ -.18678259   -.18678259   -.18678259  
             │  .03650223    .13008541    .11699538  
    hispanic │ -.35103531   -.35103531   -.35103531  
             │  .02325851    .08288788    .08667204  
       _cons │ -.04997019   -.04997019   -.04997019  
             │   .0306643    .10928039    .12304056  
─────────────┴───────────────────────────────────────
                                         Legend: b/se
> library(sandwich)
> vce <- vcovHC(mp, type="HC1")
> ser <- sqrt(diag(vce))
> cbind(se, seo, ser)                      
                      se          seo         ser
(Intercept) 0.0306640486 0.1092794988 0.123124498
private     0.0281462298 0.1003065813 0.113445771
chronic     0.0158638443 0.0565350315 0.056333916
female      0.0160302626 0.0571281073 0.058262881
income      0.0002476067 0.0008824125 0.001117772
black       0.0365022047 0.1300853221 0.117074850
hispanic    0.0232584294 0.0828876037 0.086731010

The adjusted and robust estimates of standard errors are very similar and both much larger than the Poisson standard errors. (In case you are interested, the ratio of the robust to Poisson standard errors in this model is between 3.2 and 4.5.)

[3] A Negative Binomial Model

(a) Fit a negative binomial regression model using the same outcome and predictors as in part 1.a. Comment on any remarkable changes in the coefficients.

. nbreg docvis private chronic female income black hispanic, nolog

Negative binomial regression                           Number of obs =   4,412
                                                       LR chi2(6)    = 1119.19
Dispersion: mean                                       Prob > chi2   =  0.0000
Log likelihood = -9829.3167                            Pseudo R2     =  0.0539

─────────────┬────────────────────────────────────────────────────────────────
      docvis │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     private │   .8086593   .0607191    13.32   0.000      .689652    .9276665
     chronic │   1.119804    .045522    24.60   0.000     1.030583    1.209026
      female │    .544408   .0446949    12.18   0.000     .4568075    .6320084
      income │   .0037342   .0008076     4.62   0.000     .0021515     .005317
       black │  -.3055959   .0985613    -3.10   0.002    -.4987724   -.1124193
    hispanic │  -.3898981   .0563762    -6.92   0.000    -.5003934   -.2794028
       _cons │   -.200886   .0680787    -2.95   0.003    -.3343179   -.0674542
─────────────┼────────────────────────────────────────────────────────────────
    /lnalpha │   .5306513   .0290657                      .4736836     .587619
─────────────┼────────────────────────────────────────────────────────────────
       alpha │   1.700039   .0494128                      1.605899    1.799698
─────────────┴────────────────────────────────────────────────────────────────
LR test of alpha=0: chibar2(01) = 1.7e+04              Prob >= chibar2 = 0.000

. scalar sig2 = e(alpha) // for later use

. estimates store nbreg

. estimates table poi nbreg

─────────────┬──────────────────────────
    Variable │    poi         nbreg     
─────────────┼──────────────────────────
docvis       │
     private │  .72034413    .80865928  
     chronic │  1.0684873    1.1198042  
      female │  .47758234    .54440796  
      income │  .00300568    .00373425  
       black │ -.18678259   -.30559588  
    hispanic │ -.35103531   -.38989811  
       _cons │ -.04997019   -.20088605  
─────────────┼──────────────────────────
    /lnalpha │               .53065128  
─────────────┴──────────────────────────

We need glm.nb() in the MASS package.

> library(MASS)
> mnb <- glm.nb(docvis ~ private + chronic + female + income + black + hispanic, 
+   data = dv)
> summary(mnb)

Call:
glm.nb(formula = docvis ~ private + chronic + female + income + 
    black + hispanic, data = dv, init.theta = 0.5882217452, link = log)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9182  -1.1895  -0.5376   0.0767   9.0036  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.2008860  0.0678193  -2.962  0.00306 ** 
private      0.8086593  0.0621755  13.006  < 2e-16 ***
chronic      1.1198042  0.0459214  24.385  < 2e-16 ***
female       0.5444080  0.0447150  12.175  < 2e-16 ***
income       0.0037342  0.0007852   4.756 1.98e-06 ***
black       -0.3055959  0.0994855  -3.072  0.00213 ** 
hispanic    -0.3898981  0.0571454  -6.823 8.92e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Negative Binomial(0.5882) family taken to be 1)

    Null deviance: 5885.8  on 4411  degrees of freedom
Residual deviance: 4571.7  on 4405  degrees of freedom
AIC: 19675

Number of Fisher Scoring iterations: 1


              Theta:  0.5882 
          Std. Err.:  0.0171 

 2 x log-likelihood:  -19658.6330 
> bnb <- coef(mnb)
> cbind(b, bnb)
                       b          bnb
(Intercept) -0.049970228 -0.200886049
private      0.720344181  0.808659282
chronic      1.068487258  1.119804214
female       0.477582342  0.544407965
income       0.003005676  0.003734245
black       -0.186782591 -0.305595880
hispanic    -0.351035309 -0.389898111

The estimates are very similar except for ethnicity, where the coefficient of black reflects a much larger negative effect, going from -0.187 to -0.306. Another change of note, but of smaller magnitude, is the coefficient of insurance, which now reflects a larger effect.

(b) Interpret the coefficient of black and test its significance using a Wald test and a likelihood ratio test. Compare your results with parts 1.b and 2.b

. di exp(_b[black]) - 1
-.26331573

. di (_b[black] / _se[black])^2
9.6135184

. estimates store nbreg

. quietly nbreg docvis private chronic female income hispanic

. lrtest . nbreg

Likelihood-ratio test
Assumption: . nested within nbreg

 LR chi2(1) =   9.07
Prob > chi2 = 0.0026
> exp(bnb["black"])
    black 
0.7366843 
> senb <- sqrt(diag(vcov(mnb)))
> (bnb["black"]/senb["black"])^2
   black 
9.435735 
> mnbb <- update(mnb, ~ . - black)
> anova(mnbb, mnb)
Likelihood ratio tests of Negative Binomial Models

Response: docvis
                                                   Model     theta Resid. df
1         private + chronic + female + income + hispanic 0.5866302      4406
2 private + chronic + female + income + black + hispanic 0.5882217      4405
     2 x log-lik.   Test    df LR stat.     Pr(Chi)
1       -19667.71                                  
2       -19658.63 1 vs 2     1 9.074465 0.002592035

We estimate that blacks have 26.3% fewer visits to the doctor than comparable whites. The effect is significant, with a Wald test of z = 3.10, equivalent to a χ2 of 9.61 on one d.f., and a likelihood ratio χ2 of 9.07, also on one d.f.

The magnitude of the effect is larger than estimated under a Poisson model. The standard error is larger than the Poisson, but comparable to the overdispersed Poisson. On balance the effect turns out to be significant.

(c) Predict the percent of respondents with zero doctor visits according to this model and compare with part 2.c. You will find a formula for negative binomial probabilities in the addendum to the notes. The probability of zero is (β/(μ + β))α where α = β = 1/σ2.

. estimates restore nbreg
(results nbreg are active now)

. predict munb
(option n assumed; predicted number of events)

. scalar ab = exp(- _b[/lnalpha])

. gen znb = (ab/(munb + ab))^ab 

. sum znb

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         znb │      4,412    .3658973    .1374207   .1313962   .6768509
> ab <- mnb$theta
> munb <- exp(predict(mnb))
> znb <- (ab/(munb + ab))^ab
> mean(znb)
[1] 0.3658973

We predict that 36.6% of respondents will have zero doctor visits. Much better than the Poisson estimate of 11.3% and remarkably close to the observed value of 36.4%

(d) Interpret the estimate of σ2 in this model and test its significance, noting carefully the distribution of the criterion.

> -2 * (logLik(mp) - logLik(mnb))
'log Lik.' 17088.34 (df=7)

The estimate of 1.7 reflects substantial unobserved heterogeneity in doctor visits, even after we take into account the available indicators of insurance and health status, gender, income and ethnicity. The χ2 statistic of 17,000 is clearly significant, exceeding by far the conservative critical value of 3.84, and hence even more significant if we treated it as a 50:50 mixture of χ2’s with 0 and 1 d.f.

One way to assess the magnitud of this effect is to compute quartiles of the gamma distribution with mean 1 and variance 1.7

. mata: invgammap(1/1.7, (1..3):/4) :* 1.7    
                 1             2             3
    ┌───────────────────────────────────────────┐
  1 │  .1396390685   .5186830327    1.35320038  │
    └───────────────────────────────────────────┘
> qgamma(1:3/4, shape = mnb$theta, scale = 1/mnb$theta)
[1] 0.1396333 0.5186743 1.3531969

In terms of unobserved characteristics we see that respondents at the first quartile had 86% fewer visits than expected, those at the median had 48% fewer than expected, and those at the third quartile had 35% more than expected. The fact that the median is so far below the mean indicates a very long right tail, as shown in the next figure

. gen theta = _n*6/100 in 1/100
(4,312 missing values generated)

. gen den = gammaden(1/1.7, 1.7, 0, theta)
(4,312 missing values generated)

. line den theta

. graph export ps4fig1.png, width(500) replace
file ps4fig1.png saved as PNG format
> qgamma(1:3/4, shape = mnb$theta, scale = 1/mnb$theta)
[1] 0.1396333 0.5186743 1.3531969
> library(ggplot2)
> z <- seq(0, 6, .1)
> gd <- data.frame(z,  
+   den = dgamma(z, shape=mnb$theta, scale=1/mnb$theta)) 
> ggplot(gd, aes(z, den)) + geom_line()
> ggsave("ps4fig1r.png", width=500/72, height = 400/72, dpi=72)

Gamma Density with Variance 1.7

(e) Use predicted values from this model to divide the sample into twenty groups of about equal size, compute the mean and variance of docvis in each group, and plot these values. Superimpose curves representing the over-dispersed Poisson and negative binomial variance functions and comment.

. egen nbg = cut(munb), group(20)

. preserve

. collapse (mean) docvis (sd) sd=docvis, by(nbg)

. gen var = sd^2

. scatter var docvis ///
>   || function y = phi*x, range(1 10) ///
>   || function y = x * (1 + sig2*x), range(1 10) ///
>      xtitle(mean) ytitle(variance) legend(off)

. graph export ps4fig2.png, width(500) replace
file ps4fig2.png saved as PNG format

. restore
> g <- cut(munb, breaks=quantile(munb, seq(0, 100, 5)/100))
> mv <- data.frame( 
+   mean = tapply(dv$docvis, g, mean),
+   var  = tapply(dv$docvis, g, var)    )
> sig2 <- 1/mnb$theta
> mc  <- seq(0, 10, .1)
> pv <- data.frame(mc = mc,
+   odv = mc * phi,
+   nbv = mc * (1 + sig2 * mc))
> p <- ggplot(mv, aes(mean, var)) + geom_point()
> odvar <- function(x) phi * x
> nbvar <- function(x) x * (1 + sig2 * x)
> p + stat_function(fun = odvar) + stat_function(fun=nbvar) + xlim(0.1, 10.5)
> ggsave("ps4fig2r.png", width=500/72, height=400/72, dpi=72)

Poisson and Negative Binomial Variance Functions

The situation at the high end is not clear at all, as one of the groups happens to have a much larger variance than its neighbors. The quadratic function comes closer to this point at the expense of a poorer fit through most of the range. On balance it seems to provide a better compromise at the high end, so I would say that the negative binomial is marginally better than the overdispersed Poisson.

[4] A Zero-Inflated Poisson Model

(a) Try a zero-inflated Poisson model with the same predictors of part 1a in both the Poisson and inflate equations.

. local predictors private chronic female income black hispanic

. zip docvis `predictors', inflate(`predictors')

Fitting constant-only model:

Iteration 0:   log likelihood = -19602.098  
Iteration 1:   log likelihood = -17533.867  
Iteration 2:   log likelihood = -17341.513  
Iteration 3:   log likelihood = -17340.808  
Iteration 4:   log likelihood = -17340.808  

Fitting full model:

Iteration 0:   log likelihood = -17340.808  
Iteration 1:   log likelihood = -16021.733  
Iteration 2:   log likelihood = -15956.936  
Iteration 3:   log likelihood =  -15956.73  
Iteration 4:   log likelihood =  -15956.73  

Zero-inflated Poisson regression                       Number of obs =   4,412
Inflation model: logit                                 Nonzero obs   =   2,806
                                                       Zero obs      =   1,606
                                                       LR chi2(6)    = 2768.16
Log likelihood = -15956.73                             Prob > chi2   =  0.0000

─────────────┬────────────────────────────────────────────────────────────────
      docvis │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
docvis       │
     private │   .3278247   .0289218    11.33   0.000     .2711389    .3845104
     chronic │   .6826474   .0160441    42.55   0.000     .6512016    .7140933
      female │   .2814155   .0162471    17.32   0.000     .2495718    .3132593
      income │    .001533    .000256     5.99   0.000     .0010312    .0020347
       black │  -.1865396   .0373371    -5.00   0.000    -.2597189   -.1133603
    hispanic │  -.2080369   .0237216    -8.77   0.000    -.2545304   -.1615433
       _cons │   .9731268   .0326301    29.82   0.000     .9091731    1.037081
─────────────┼────────────────────────────────────────────────────────────────
inflate      │
     private │  -1.129579   .0945051   -11.95   0.000    -1.314806   -.9443526
     chronic │  -1.755146   .0938851   -18.69   0.000    -1.939158   -1.571135
      female │  -.8811376    .075545   -11.66   0.000    -1.029203   -.7330721
      income │  -.0082545   .0014829    -5.57   0.000    -.0111608   -.0053481
       black │   .0891372   .1638255     0.54   0.586    -.2319549    .4102292
    hispanic │   .4284308   .0886405     4.83   0.000     .2546986     .602163
       _cons │   1.258985   .1059057    11.89   0.000     1.051414    1.466556
─────────────┴────────────────────────────────────────────────────────────────

We need the zeroinfl() function in the pscl package.

> library(pscl)
> mzip <- zeroinfl(docvis ~ private + chronic + female + income + black + hispanic,
+   data = dv)
> summary(mzip)

Call:
zeroinfl(formula = docvis ~ private + chronic + female + income + black + 
    hispanic, data = dv)

Pearson residuals:
    Min      1Q  Median      3Q     Max 
-2.7489 -0.9389 -0.5410  0.1442 58.5034 

Count model coefficients (poisson with log link):
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.9731268  0.0326277  29.825  < 2e-16 ***
private      0.3278247  0.0289209  11.335  < 2e-16 ***
chronic      0.6826474  0.0160441  42.548  < 2e-16 ***
female       0.2814155  0.0162471  17.321  < 2e-16 ***
income       0.0015330  0.0002555   6.000 1.97e-09 ***
black       -0.1865396  0.0373371  -4.996 5.85e-07 ***
hispanic    -0.2080369  0.0237211  -8.770  < 2e-16 ***

Zero-inflation model coefficients (binomial with logit link):
             Estimate Std. Error z value Pr(>|z|)    
(Intercept)  1.258985   0.105904  11.888  < 2e-16 ***
private     -1.129579   0.094506 -11.953  < 2e-16 ***
chronic     -1.755147   0.093885 -18.695  < 2e-16 ***
female      -0.881138   0.075544 -11.664  < 2e-16 ***
income      -0.008254   0.001483  -5.566 2.60e-08 ***
black        0.089137   0.163825   0.544    0.586    
hispanic     0.428431   0.088641   4.833 1.34e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

Number of iterations in BFGS optimization: 17 
Log-likelihood: -1.596e+04 on 14 Df

(b) Predict the proportion of respondents with zero doctor visits according to this model and compare with 1.e and 3.c. (Don’t forget that there are two ways of having an outcome of zero in this model.)

. predict pz, pr

. predict xbz, xb

. gen muz = exp(xbz)

. gen zfitz = pz + (1-pz) * exp(-muz)

. sum zfitz

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
       zfitz │      4,412    .3639919    .2355671   .0190273   .8620839
> pr <- predict(mzip, type = "zero")  # π
> mu <- predict(mzip, type = "count") # μ
> przip <- pr + (1 - pr) * exp(-mu)   # also predict(mzip, type="prob")[,1]
> mean(przip)
[1] 0.3639919

We predict that 36.4% of respondents would have no doctor visits, which not surprisingly, is almost exactly the observed proportion.

(c) Interpret the coefficients of black in the two equations. Is the effect related to whether blacks visit the doctor at all? To how often they visit?

. di exp(_b[inflate:black]), exp(_b[docvis:black])
1.0932306 .82982569

The “always zero” class is often hard to interpret. In this case it could represent lack of access to health care, but it could also represent people who are in pretty good health and don’t need to see a doctor. I’ll couch the interpretation in terms of access to health care, which seems more credible, but recognize that this choice is debatable.

Blacks are slightly more likely than comparable whites to lack access to a doctor, but the difference is not significant. Among those with access to health care, blacks have 17% fewer visits than comparable whites. Clearly most of the difference comes from the number of visits.

Notes: one way to elaborate on this issue would be to predict zero visits setting everyone to white and then to black, but this was not asked. I get probabilities of 34.1% for white and 36.3% for blacks. One can also predict the expected number of visits by combining the two equations: I get 4.2 for whites and 3.4 for blacks. Latinos have even fewer adjusted visits, 3.1 on average, resulting both from less access to health care and fewer visits from those with access.

[5] Model Selection

Considering the results obtained so far and bearing in mind parsimony and goodness of fit, which of the models used here provides the best description of the data? Make sure you provide a clear justification of your choice.

Turns out this is fairly simple because the log-likehoods are -18,374.5 for Poisson, -15,956.73 for the zero-inflated model, which by the way uses twice as many parameters, and -9,829.31 for the negative binomial model, which uses just one more parameter than the Poisson. The clear winner is the negative binomial.

Moreover, the model does a pretty good job reproducing the excess zeroes without the need for a separate equation, which also makes it a winner in terms of interpretation; I find the idea of unobserved heterogeneity of frailty much easier to interpret than a latent class that never sees a doctor.