Germán Rodríguez
Generalized Linear Models Princeton University

Solutions to Problem Set 3

Cameron and Trivedi (2010) have an interesting dataset based on wave 5 of the Health and Retirement Study (HRS), a survey conducted in 2002 as part of a panel sponsored by NIH. The sample consists of Medicare beneficiaries and the question of interest is whether or not they purchase supplemental insurance (ins). The explanatory variables include socio-economic and demographic factors and an indicator of health status. The data are available from the Stata website, but for your convenience also at https://grodri.github.io/datasets/mus14data.dta.

We start by reading the data

. use https://grodri.github.io/datasets/mus14data, clear
(Data from Cameron and Trivedi's (2009) Microeconometrics Using Stata )
> library(haven)
> ct <- read_dta("https://grodri.github.io/datasets/mus14data.dta")

[1] Probabilities, Odds and Logits

(a) What’s the proportion of respondents who have supplemental insurance? What are the odds of having insurance? What’s the logit? Construct a 95% confidence interval for the logit and translate it back into corresponding intervals for the odds and probability.

. sum ins

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         ins │      3,206    .3870867    .4871597          0          1

. di r(mean)/(1-r(mean))
.63155216

. di log(r(mean)/(1-r(mean)))
-.45957474

. scalar opri = r(mean) // save for later use
> library(dplyr)
> p <- summarize(ct, mean(ins)); p
# A tibble: 1 × 1
  `mean(ins)`
        <dbl>
1       0.387
> p/(1 - p)
  mean(ins)
1 0.6315522
> log(p/(1 - p))
   mean(ins)
1 -0.4595747

The respondents with insurance are 38.7%. The odds of insurance are 0.63 to one, roughly two persons with insurance for every three without. The logit is -0.46. I gave a formula for the variance in class, but we can use the null model to do the calculations

. logit ins

Iteration 0:   log likelihood = -2139.7712  
Iteration 1:   log likelihood = -2139.7712  

Logistic regression                                     Number of obs =  3,206
                                                        LR chi2(0)    =   0.00
                                                        Prob > chi2   =      .
Log likelihood = -2139.7712                             Pseudo R2     = 0.0000

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
       _cons │  -.4595747   .0362589   -12.67   0.000    -.5306409   -.3885086
─────────────┴────────────────────────────────────────────────────────────────

. scalar z = invnormal(0.975)

. di exp(_b[_cons] - z * _se[_cons]), exp(_b[_cons] - z * _se[_cons])
.58822787 .58822787

. di invlogit(_b[_cons] - z * _se[_cons]), invlogit(_b[_cons] + z * _se[_cons])
.37036743 .40407637
> m1 <- glm(ins ~ 1, family=binomial, data = ct)
> summary(m1)

Call:
glm(formula = ins ~ 1, family = binomial, data = ct)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9895  -0.9895  -0.9895   1.3778   1.3778  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.45957    0.03626  -12.68   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 4279.5  on 3205  degrees of freedom
AIC: 4281.5

Number of Fisher Scoring iterations: 4

> delta <- qnorm(0.975) * sqrt(vcov(m1))
> ci <-c(coef(m1) - delta, coef(m1) + delta)
> exp(ci)
[1] 0.5882279 0.6780674
> plogis(ci)
[1] 0.3703674 0.4040764

The odds are between 0.588 and 0.678 with 95% confidence. The probability is between 37.0 and 40.4% with 95% confidence.

(b) Hispanics are less likely to have supplemental insurance than others. Estimate the proportions with insurance among hispanics and others and calculate the difference. Estimate and interpret the odds ratio, and test its significance using a Wald test and a likelihood ratio test.

. tabstat ins, by(hisp)

Summary for variables: ins
Group variable: hisp 

    hisp │      Mean
─────────┼──────────
       0 │  .4056509
       1 │  .1502146
─────────┼──────────
   Total │  .3870867
─────────┴──────────

. logit ins hisp

Iteration 0:   log likelihood = -2139.7712  
Iteration 1:   log likelihood =  -2106.442  
Iteration 2:   log likelihood = -2106.0567  
Iteration 3:   log likelihood = -2106.0559  
Iteration 4:   log likelihood = -2106.0559  

Logistic regression                                     Number of obs =  3,206
                                                        LR chi2(1)    =  67.43
                                                        Prob > chi2   = 0.0000
Log likelihood = -2106.0559                             Pseudo R2     = 0.0158

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
        hisp │  -1.350945   .1871284    -7.22   0.000     -1.71771   -.9841799
       _cons │  -.3819741   .0373513   -10.23   0.000    -.4551813   -.3087669
─────────────┴────────────────────────────────────────────────────────────────

. di exp(_b[hisp])
.25899543

. di (_b[hisp]/_se[hisp])^2    
52.118952

. di e(chi2)
67.4306
> group_by(ct, hisp) %>% summarize(mean(ins))
# A tibble: 2 × 2
   hisp `mean(ins)`
  <dbl>       <dbl>
1     0       0.406
2     1       0.150
> m2 <-  glm(ins ~ hisp, family=binomial, data = ct)
> exp(coef(m2)["hisp"])
     hisp 
0.2589954 
> deviance(m1) - deviance(m2)
[1] 67.4306

The proportion with insurance is 15.0% among hispanics and 40.6% among others. The difference is 25.6 percentage points. The odds ratio is 0.259 (by hand from the proportions or exponentiating the coefficient of hispanic), so the odds of having insurance among hispanics are just about a quarter the odds of non-hispanics. The Wald test given in the output as z = -7.22 is equivalent to a chi-squared of 52.12 on one d.f. The likelihood ratio test us 67.43, also on one d.f.

(c) Let us jump directly to the model used by Cameron and Trivedi. Fit a logit model using retirement status (retire), age, health status (hstatusg, coded 1 for good, very good or excellent and 0 otherwise), household income (hhincome), education in years (educyear), the indicator for married, and the indicator of hispanic ethnicity (hisp). You should find that all but age have significant effects at the five percent level.

. logit ins retire age hstatusg hhincome educyear married hisp

Iteration 0:   log likelihood = -2139.7712  
Iteration 1:   log likelihood = -1996.7434  
Iteration 2:   log likelihood = -1994.8864  
Iteration 3:   log likelihood = -1994.8784  
Iteration 4:   log likelihood = -1994.8784  

Logistic regression                                     Number of obs =  3,206
                                                        LR chi2(7)    = 289.79
                                                        Prob > chi2   = 0.0000
Log likelihood = -1994.8784                             Pseudo R2     = 0.0677

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      retire │   .1969297   .0842067     2.34   0.019     .0318875    .3619718
         age │  -.0145955   .0112871    -1.29   0.196    -.0367178    .0075267
    hstatusg │   .3122654   .0916739     3.41   0.001     .1325878     .491943
    hhincome │   .0023036    .000762     3.02   0.003       .00081    .0037972
    educyear │   .1142626   .0142012     8.05   0.000     .0864288    .1420963
     married │    .578636   .0933198     6.20   0.000     .3957327    .7615394
        hisp │  -.8103059   .1957522    -4.14   0.000    -1.193973   -.4266387
       _cons │  -1.715578   .7486219    -2.29   0.022     -3.18285   -.2483064
─────────────┴────────────────────────────────────────────────────────────────

. 
> m3 <- glm(ins ~ retire + age + hstatusg + hhincome + educyear + married + hisp,
+   family = binomial, data = ct)
> summary(m3)

Call:
glm(formula = ins ~ retire + age + hstatusg + hhincome + educyear + 
    married + hisp, family = binomial, data = ct)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-2.456  -1.009  -0.703   1.224   2.373  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.715578   0.748622  -2.292 0.021926 *  
retire       0.196930   0.084207   2.339 0.019354 *  
age         -0.014596   0.011287  -1.293 0.195969    
hstatusg     0.312265   0.091674   3.406 0.000659 ***
hhincome     0.002304   0.000762   3.023 0.002503 ** 
educyear     0.114263   0.014201   8.046 8.55e-16 ***
married      0.578636   0.093320   6.201 5.63e-10 ***
hisp        -0.810306   0.195751  -4.139 3.48e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 3989.8  on 3198  degrees of freedom
AIC: 4005.8

Number of Fisher Scoring iterations: 4

All coefficients other than age are indeed significant at the 5% level, as indicated by the z-statistics.

(d) Interpret carefully the odds ratio for hispanics, comparing it with the result of part (b). Test the significance of the ethnicity effect using a Wald test and a likelihood ratio test.

. di exp(_b[hisp])
.44472199

. di (_b[hisp]/_se[hisp])^2
17.135029

. estimates store ct

. quietly logit ins retire age hstatusg hhincome educyear married 

. lrtest . ct

Likelihood-ratio test
Assumption: . nested within ct

 LR chi2(1) =  19.46
Prob > chi2 = 0.0000
> b <- coef(m3)
> exp(b["hisp"])
    hisp 
0.444722 
> se = sqrt(vcov(m3)["hisp","hisp"])
> (b["hisp"]/se)^2
    hisp 
17.13528 
> deviance(update(m3, ~. - hisp)) - deviance(m3)
[1] 19.46244

The odds ratio for hispanics adjusted for the other variables is 0.445. That means that the odds of hispanics having insurance are 45% (or a bit below half) the odds of non-hispanics with the same retirement status, age, heath status, household income, education and marital status. Clearly hispanics are somewhat disadvantaged on at least some of these variables, so adjustment reduces the observed disparity, which nevertheless remains substantial. The ratio is significantly different from one, with a Wald chi-squared of 17.14 and a likelihood ratio chi-squared of 19.46 on one d.f.

(e) Turns out these estimates are sensitive to the specification of household income. An obvious alternative is log-income, but some households have no income. Instead we will group this variable into quartiles, which has been done in qhhinc. Refit the model using this alternative specification and comment on the odds ratio for hispanics. Use this alternative specification in what follows.

. logit ins retire age hstatusg i.qhhinc educyear married hisp

Iteration 0:   log likelihood = -2139.7712  
Iteration 1:   log likelihood = -1910.9252  
Iteration 2:   log likelihood = -1903.6387  
Iteration 3:   log likelihood = -1903.6075  
Iteration 4:   log likelihood = -1903.6075  

Logistic regression                                     Number of obs =  3,206
                                                        LR chi2(9)    = 472.33
                                                        Prob > chi2   = 0.0000
Log likelihood = -1903.6075                             Pseudo R2     = 0.1104

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      retire │   .1809582    .086045     2.10   0.035     .0123131    .3496033
         age │   -.021298   .0117381    -1.81   0.070    -.0443042    .0017082
    hstatusg │   .1046211   .0963592     1.09   0.278    -.0842394    .2934815
             │
      qhhinc │
         Q2  │   1.304387   .1399469     9.32   0.000     1.030096    1.578678
         Q3  │   1.814432   .1472722    12.32   0.000     1.525784    2.103081
         Q4  │   1.837259   .1550189    11.85   0.000     1.533428     2.14109
             │
    educyear │   .0637015     .01502     4.24   0.000     .0342628    .0931402
     married │   .0174329   .1060881     0.16   0.869    -.1904959    .2253617
        hisp │  -.6294994   .2024917    -3.11   0.002    -1.026376   -.2326229
       _cons │  -1.317642    .777195    -1.70   0.090    -2.840916    .2056324
─────────────┴────────────────────────────────────────────────────────────────

. di exp(_b[hisp])
.53285848
> ct <- mutate(ct, qhhinc = factor(qhhinc))
> m4 <- glm(ins ~ retire + age + hstatusg + qhhinc + educyear + married + hisp,
+  family = binomial, data = ct)
> summary(m4)

Call:
glm(formula = ins ~ retire + age + hstatusg + qhhinc + educyear + 
    married + hisp, family = binomial, data = ct)

Deviance Residuals: 
   Min      1Q  Median      3Q     Max  
-1.462  -1.041  -0.515   1.115   2.443  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.31764    0.77719  -1.695  0.09000 .  
retire       0.18096    0.08604   2.103  0.03546 *  
age         -0.02130    0.01174  -1.814  0.06961 .  
hstatusg     0.10462    0.09636   1.086  0.27759    
qhhinc2      1.30439    0.13994   9.321  < 2e-16 ***
qhhinc3      1.81443    0.14727  12.321  < 2e-16 ***
qhhinc4      1.83726    0.15502  11.852  < 2e-16 ***
educyear     0.06370    0.01502   4.241 2.22e-05 ***
married      0.01743    0.10609   0.164  0.86947    
hisp        -0.62950    0.20248  -3.109  0.00188 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4279.5  on 3205  degrees of freedom
Residual deviance: 3807.2  on 3196  degrees of freedom
AIC: 3827.2

Number of Fisher Scoring iterations: 4

> b = coef(m4)
> exp(b["hisp"])
     hisp 
0.5328585 

The alternative specification results in an adjusted odds ratio of 0.533, further reducing the disparity between hispanics and others. It appears that the adjustment is somewhat sensitive to the specification of income. (As I remarked on class, the model based on quartiles of household income seems to fit a bit better than the original. Using log income leads to a similar estimate.)

[2] Marginal and Adjusted Effects

We will estimate several types of marginal effects. I will refer to the calculation based on the derivative as “continuous” and to the unit change as “discrete”.

(a) Estimate the marginal effect of hispanic using the continuous formula at the mean of all covariates. Try a quick approximation using the formula evaluated at the overall probability. This is known as the marginal effect at the mean.

. quietly sum ins

. di opri * (1 - opri) * _b[hisp]
-.14934911
> p * (1 - p) * b["hisp"]
   mean(ins)
1 -0.1493491

We estimate the difference between hispanics and non-hispanics, at the overall probability of 38.7%, to be 14.9% percentage points. This is much lower than the original (unadjusted) difference of 25.6 points.

(b) Predict insurance for hispanics and others with the other covariates set to their means, and calculate the difference. This is the unit change version of the marginal effect at the mean.

. forvalues q = 2/4 {
  2.   gen hhincq`q' = qhhinc == `q'
  3. }

. quietly logit ins retire age hstatusg hhincq2-hhincq4 educyear married hisp

. preserve

. collapse retire age hstatusg hhincq2-hhincq4 educyear married hisp

. quietly replace hisp = 0

. predict pr0, pr

. quietly replace hisp = 1

. predict pr1, pr

. di pr1, pr0, pr1 - pr0
.23580998 .36672541 -.13091543

. restore
> X = model.matrix(m4)
> Xbar = colMeans(X)
> Xbar["hisp"] <- 1
> pr1 <- plogis( Xbar %*% b )
> Xbar["hisp"] <- 0
> pr0 <- plogis( Xbar %*% b )
> c(pr1, pr0, pr1 - pr0)
[1]  0.2358100  0.3667254 -0.1309154

We find predicted probabilities of 23.6% for hispanics and 36.7% for others, a difference of 13.1 percentage points. A bit smaller than the quick approximation of (a) but in the same ballpark.

(c) Predict insurance for everyone with all variables as they are, and compute the marginal effect of hispanic for each respondent using the continuous approximation. This is known as the average marginal effect.

. predict pr, pr

. gen mec = pr * (1-pr) * _b[hisp]

. sum mec

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         mec │      3,206   -.1295251    .0391199  -.1573748  -.0189168
> pri <- plogis( X %*% b )
> mec <- pri * (1 - pri) * b["hisp"]
> mean(mec)
[1] -0.1295251

The average marginal effect is 12.9 percentage points using the continuous approximation.

(d) Make a copy of hisp for safekeeping. Now set hisp to 1 for everyone, predict the probability of having insurance and average. Next set hisp to 0 for everyone, predict again and average. The difference between these two means is the discrete version of the average marginal effect. Don’t forget to set hisp back to its true value.

. preserve

. quietly replace hisp = 0

. predict pr0, pr

. quietly replace hisp = 1

. predict pr1, pr

. gen med = pr1 - pr0

. sum med

    Variable │        Obs        Mean    Std. dev.       Min        Max
─────────────┼─────────────────────────────────────────────────────────
         med │      3,206   -.1233619    .0414112  -.1560881  -.0246634

. restore
> X[, "hisp"] <- 1
> pr1 <- plogis( X %*% b )
> X[, "hisp"] <- 0
> pr0 <- plogis( X %*% b )
> colMeans(cbind(pr1, pr0, pr1 - pr0))
[1]  0.2698466  0.3932085 -0.1233619

The average marginal effect is 12.3 percentage points using the unit change. (Of all calculations, this is probably the one that makes the most sense. Many analysts use (c) for continuous variables and (d) for discrete factors.

(e) The last two approaches avoid setting variables such as married to its mean of 0.773, which is not meaningful. Another approach is to select a combination of predictors of interest and predict for that case. Cameron and Trivedi use a 75-year old married person with good health status, 12 years of education and an income in the third quartile. Calculate the predicted probabilities if the person is hispanic and if not, and compute the difference.

An easy way to do the calculation is to create an observation with the desired attributes, or use one in the dataset if the combination exists. We have plenty of cases except for age. (I forgot to specify a value for retire. Given the age it makes sense to assume yes.)

. preserve

. keep if retire & hstatus == 1 & educyear == 12 & qhhinc == 3 & married    
(3,042 observations deleted)

. keep in 1/2
(162 observations deleted)

. quietly replace age  = 75

. quietly replace hisp = _n == 1

. predict prp, pr

. list hisp prp

     ┌─────────────────┐
     │ hisp        prp │
     ├─────────────────┤
  1. │    1   .3401491 │
  2. │    0   .4917181 │
     └─────────────────┘

. restore
> proto <- matrix(c(
+   1, 1, 75, 1, 0, 1, 0, 12, 1, 1,
+   1, 1, 75, 1, 0, 1, 0, 12, 1, 0), 2, 10, byrow = TRUE)
> xb <- proto %*% coef(m4)
> plogis(xb)
          [,1]
[1,] 0.3401491
[2,] 0.4917181

We predict for the prototype person a probability of 34% if hispanic and 49% if not, a difference of 15 percentage points.

Note: Stata’s powerful margins command can do these calculations, but you should do them “by hand”, so you know exactly what’s being done.

[3] Goodness of Fit

(a) With individual data the deviance does not have a χ2 distribution, and even with household income grouped into categories there are too many covariate patterns to trust the asymptotics. Calculate the Hosmer-Lemeshow goodness of fit test using ten groups of approximately equal size based on predicted probabilities.

. estat gof, group(10) table
note: obs collapsed on 10 quantiles of estimated probabilities.

Goodness-of-fit test after logistic model
Variable: ins

  Table collapsed on quantiles of estimated probabilities
  ┌───────┬────────┬───────┬───────┬───────┬───────┬───────┐
  │ Group │   Prob │ Obs_1 │ Exp_1 │ Obs_0 │ Exp_0 │ Total │
  ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
  │     1 │ 0.1169 │    20 │  26.8 │   301 │ 294.2 │   321 │
  │     2 │ 0.1497 │    40 │  42.9 │   283 │ 280.1 │   323 │
  │     3 │ 0.3223 │    77 │  68.1 │   241 │ 249.9 │   318 │
  │     4 │ 0.3824 │   122 │ 114.5 │   199 │ 206.5 │   321 │
  │     5 │ 0.4367 │   131 │ 130.5 │   190 │ 190.5 │   321 │
  ├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
  │     6 │ 0.4922 │   155 │ 150.6 │   166 │ 170.4 │   321 │
  │     7 │ 0.5236 │   173 │ 167.1 │   156 │ 161.9 │   329 │
  │     8 │ 0.5448 │   164 │ 167.8 │   149 │ 145.2 │   313 │
  │     9 │ 0.5817 │   171 │ 179.2 │   148 │ 139.8 │   319 │
  │    10 │ 0.6568 │   188 │ 193.5 │   132 │ 126.5 │   320 │
  └───────┴────────┴───────┴───────┴───────┴───────┴───────┘

 Number of observations =  3,206
       Number of groups =     10
Hosmer–Lemeshow chi2(8) =   6.43
            Prob > chi2 = 0.5988
> source("https://grodri.github.io/glms/r/hosmer.R")
> hosmer(ct$ins, fitted(m4))
                0   1      e.0       e.1
[0.031,0.117] 301  20 294.2278  26.77222
(0.117,0.15]  283  40 280.0543  42.94574
(0.15,0.322]  241  77 249.8864  68.11363
(0.322,0.382] 199 122 206.4775 114.52255
(0.382,0.437] 190 131 190.5101 130.48987
(0.437,0.492] 166 155 170.3789 150.62107
(0.492,0.524] 156 173 161.9098 167.09021
(0.524,0.545] 149 164 145.2456 167.75442
(0.545,0.582] 148 171 139.8297 179.17029
(0.582,0.657] 132 188 126.4800 193.52002
             test groups   chi.sq    pvalue
1 Hosmer-Lemeshow     10 6.433716 0.5987689

We get a chi-squared of 6.43 on 8 d.f. which is quite reassuring, with a p-value of 0.6.

(b) Compute predicted probabilities of having insurance and predict that a respondent will have insurance if the predicted probability is 0.5 or more. Tabulate predicted versus actual outcomes. What’s the overall error rate? Comment on the numbers of false positives and false negatives. How well would be do if we just predicted that nobody would buy supplemental insurance? How much reduction in error did the model achieve?

. predict pri, pr

. gen preins = pri >= 0.5

. tab preins ins

           │          ins
    preins │         0          1 │     Total
───────────┼──────────────────────┼──────────
         0 │     1,421        587 │     2,008 
         1 │       544        654 │     1,198 
───────────┼──────────────────────┼──────────
     Total │     1,965      1,241 │     3,206 
> ct <- mutate(ct, predins = fitted(m4) >= 0.5)
> group_by(ct, predins, ins) %>% tally()
# A tibble: 4 × 3
# Groups:   predins [2]
  predins   ins     n
  <lgl>   <dbl> <int>
1 FALSE       0  1421
2 FALSE       1   587
3 TRUE        0   544
4 TRUE        1   654

We predict correctly 654 + 1421 cases or 64.7%, so the error rate is 35.3%. Our false positives are too high, we get 45% wrong (544 of 1198 predicted to have insurance). False negatives are a bit better, we get 29% wrong (587 of 2008 predicted not to have insurance). The overall proportion was 38.7%, so if we predicted that everyone buys supplementary insurance we would get only 38.7% wrong. With the model we reduce that to 35.3%. The proportionate reduction in error is only 9% (or 8.86%).

[4] Probits and OLS

(a) Fit the model using a probit specification, compute the continuous marginal effect using the overall probability of having insurance, and compare with the results of part 2 (a).

. probit ins retire age hstatusg hhincq2-hhincq4 educyear married hisp

Iteration 0:   log likelihood = -2139.7712  
Iteration 1:   log likelihood = -1904.9874  
Iteration 2:   log likelihood = -1901.9111  
Iteration 3:   log likelihood = -1901.9053  
Iteration 4:   log likelihood = -1901.9053  

Probit regression                                       Number of obs =  3,206
                                                        LR chi2(9)    = 475.73
                                                        Prob > chi2   = 0.0000
Log likelihood = -1901.9053                             Pseudo R2     = 0.1112

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      z    P>|z|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      retire │   .1090073   .0522262     2.09   0.037     .0066458    .2113687
         age │  -.0125164   .0071196    -1.76   0.079    -.0264706    .0014378
    hstatusg │   .0722337   .0580082     1.25   0.213    -.0414602    .1859276
     hhincq2 │    .751445   .0786195     9.56   0.000     .5973537    .9055363
     hhincq3 │   1.067105   .0835774    12.77   0.000     .9032965    1.230914
     hhincq4 │   1.081014   .0884047    12.23   0.000     .9077443    1.254284
    educyear │   .0395948   .0090395     4.38   0.000     .0218777     .057312
     married │   .0170883   .0633161     0.27   0.787     -.107009    .1411855
        hisp │  -.3790227   .1162247    -3.26   0.001     -.606819   -.1512265
       _cons │    -.81568   .4718992    -1.73   0.084    -1.740585    .1092254
─────────────┴────────────────────────────────────────────────────────────────

. scalar eta = invnormal(opri)

. di normalden(eta) * _b[hisp]
-.14511058
> m5 <- update(m4, family = binomial(probit))
> xb = qnorm(as.numeric(p))
> dnorm(xb) * coef(m5)["hisp"]
      hisp 
-0.1451103 

We estimate the marginal effect at the mean to be 14.5 percentage points using the derivative. This is very close to the comparable logit estimate of 14.9 obtained in part 2.a (BTW the coefficient of -0.38 tells us that the (net) utility of buying supplementary insurance of hispanics is 0.38 standard deviations less than comparable non-hispanics.)

(b) Fit the same model using OLS. The coefficient of hisp is often justified as a quick estimate of the average marginal effect of ethnicity adjusted for the other variables. How does it compare with the logit and probit estimates?

. reg ins retire age hstatusg hhincq2-hhincq4 educyear married hisp

      Source │       SS           df       MS      Number of obs   =     3,206
─────────────┼──────────────────────────────────   F(9, 3196)      =     53.40
       Model │  99.4266617         9  11.0474069   Prob > F        =    0.0000
    Residual │  661.198728     3,196  .206883207   R-squared       =    0.1307
─────────────┼──────────────────────────────────   Adj R-squared   =    0.1283
       Total │   760.62539     3,205  .237324615   Root MSE        =    .45484

─────────────┬────────────────────────────────────────────────────────────────
         ins │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
      retire │   .0364401   .0177052     2.06   0.040     .0017254    .0711549
         age │  -.0038497   .0023577    -1.63   0.103    -.0084724     .000773
    hstatusg │   .0200136   .0192637     1.04   0.299    -.0177568    .0577841
     hhincq2 │    .214232   .0247282     8.66   0.000     .1657472    .2627169
     hhincq3 │   .3435194   .0269492    12.75   0.000     .2906799    .3963589
     hhincq4 │   .3554378   .0287588    12.36   0.000     .2990503    .4118254
    educyear │   .0115776   .0029295     3.95   0.000     .0058336    .0173215
     married │   .0055474    .020842     0.27   0.790    -.0353177    .0464124
        hisp │  -.0793971   .0329527    -2.41   0.016    -.1440076   -.0147866
       _cons │   .2434602   .1563759     1.56   0.120     -.063147    .5500673
─────────────┴────────────────────────────────────────────────────────────────
> update(m4, family = gaussian)

Call:  glm(formula = ins ~ retire + age + hstatusg + qhhinc + educyear + 
    married + hisp, family = gaussian, data = ct)

Coefficients:
(Intercept)       retire          age     hstatusg      qhhinc2      qhhinc3  
   0.243460     0.036440    -0.003850     0.020014     0.214232     0.343519  
    qhhinc4     educyear      married         hisp  
   0.355438     0.011578     0.005547    -0.079397  

Degrees of Freedom: 3205 Total (i.e. Null);  3196 Residual
Null Deviance:	    760.6 
Residual Deviance: 661.2 	AIC: 4059

OLS estimates the average marginal effect as 7.9 percentage points, much lower than previous estimates. The value is more directly comparable to the average marginal effect than the marginal effect at the average, but it misses both.