Cameron and Trivedi (2010) have an interesting dataset based on wave
5 of the Health and Retirement Study (HRS), a survey conducted in 2002
as part of a panel sponsored by NIH. The sample consists of Medicare
beneficiaries and the question of interest is whether or not they
purchase supplemental insurance (ins). The explanatory
variables include socio-economic and demographic factors and an
indicator of health status. The data are available from the Stata
website, but for your convenience also at
https://grodri.github.io/datasets/mus14data.dta.
We start by reading the data
. use https://grodri.github.io/datasets/mus14data, clear (Data from Cameron and Trivedi's (2009) Microeconometrics Using Stata )
> library(haven)
> ct <- read_dta("https://grodri.github.io/datasets/mus14data.dta")
(a) What’s the proportion of respondents who have supplemental insurance? What are the odds of having insurance? What’s the logit? Construct a 95% confidence interval for the logit and translate it back into corresponding intervals for the odds and probability.
. sum ins
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
ins │ 3,206 .3870867 .4871597 0 1
. di r(mean)/(1-r(mean))
.63155216
. di log(r(mean)/(1-r(mean)))
-.45957474
. scalar opri = r(mean) // save for later use
> library(dplyr)
> p <- summarize(ct, mean(ins)); p
# A tibble: 1 × 1
`mean(ins)`
<dbl>
1 0.387
> p/(1 - p)
mean(ins)
1 0.6315522
> log(p/(1 - p))
mean(ins)
1 -0.4595747
The respondents with insurance are 38.7%. The odds of insurance are 0.63 to one, roughly two persons with insurance for every three without. The logit is -0.46. I gave a formula for the variance in class, but we can use the null model to do the calculations
. logit ins
Iteration 0: log likelihood = -2139.7712
Iteration 1: log likelihood = -2139.7712
Logistic regression Number of obs = 3,206
LR chi2(0) = 0.00
Prob > chi2 = .
Log likelihood = -2139.7712 Pseudo R2 = 0.0000
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
_cons │ -.4595747 .0362589 -12.67 0.000 -.5306409 -.3885086
─────────────┴────────────────────────────────────────────────────────────────
. scalar z = invnormal(0.975)
. di exp(_b[_cons] - z * _se[_cons]), exp(_b[_cons] - z * _se[_cons])
.58822787 .58822787
. di invlogit(_b[_cons] - z * _se[_cons]), invlogit(_b[_cons] + z * _se[_cons])
.37036743 .40407637
> m1 <- glm(ins ~ 1, family=binomial, data = ct)
> summary(m1)
Call:
glm(formula = ins ~ 1, family = binomial, data = ct)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.9895 -0.9895 -0.9895 1.3778 1.3778
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.45957 0.03626 -12.68 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 4279.5 on 3205 degrees of freedom
AIC: 4281.5
Number of Fisher Scoring iterations: 4
> delta <- qnorm(0.975) * sqrt(vcov(m1))
> ci <-c(coef(m1) - delta, coef(m1) + delta)
> exp(ci)
[1] 0.5882279 0.6780674
> plogis(ci)
[1] 0.3703674 0.4040764
The odds are between 0.588 and 0.678 with 95% confidence. The probability is between 37.0 and 40.4% with 95% confidence.
(b) Hispanics are less likely to have supplemental insurance than others. Estimate the proportions with insurance among hispanics and others and calculate the difference. Estimate and interpret the odds ratio, and test its significance using a Wald test and a likelihood ratio test.
. tabstat ins, by(hisp)
Summary for variables: ins
Group variable: hisp
hisp │ Mean
─────────┼──────────
0 │ .4056509
1 │ .1502146
─────────┼──────────
Total │ .3870867
─────────┴──────────
. logit ins hisp
Iteration 0: log likelihood = -2139.7712
Iteration 1: log likelihood = -2106.442
Iteration 2: log likelihood = -2106.0567
Iteration 3: log likelihood = -2106.0559
Iteration 4: log likelihood = -2106.0559
Logistic regression Number of obs = 3,206
LR chi2(1) = 67.43
Prob > chi2 = 0.0000
Log likelihood = -2106.0559 Pseudo R2 = 0.0158
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
hisp │ -1.350945 .1871284 -7.22 0.000 -1.71771 -.9841799
_cons │ -.3819741 .0373513 -10.23 0.000 -.4551813 -.3087669
─────────────┴────────────────────────────────────────────────────────────────
. di exp(_b[hisp])
.25899543
. di (_b[hisp]/_se[hisp])^2
52.118952
. di e(chi2)
67.4306
> group_by(ct, hisp) %>% summarize(mean(ins))
# A tibble: 2 × 2
hisp `mean(ins)`
<dbl> <dbl>
1 0 0.406
2 1 0.150
> m2 <- glm(ins ~ hisp, family=binomial, data = ct)
> exp(coef(m2)["hisp"])
hisp
0.2589954
> deviance(m1) - deviance(m2)
[1] 67.4306
The proportion with insurance is 15.0% among hispanics and 40.6% among others. The difference is 25.6 percentage points. The odds ratio is 0.259 (by hand from the proportions or exponentiating the coefficient of hispanic), so the odds of having insurance among hispanics are just about a quarter the odds of non-hispanics. The Wald test given in the output as z = -7.22 is equivalent to a chi-squared of 52.12 on one d.f. The likelihood ratio test us 67.43, also on one d.f.
(c) Let us jump directly to the model used by Cameron and Trivedi.
Fit a logit model using retirement status (retire),
age, health status (hstatusg, coded 1 for
good, very good or excellent and 0 otherwise), household income
(hhincome), education in years (educyear), the
indicator for married, and the indicator of hispanic
ethnicity (hisp). You should find that all but age have
significant effects at the five percent level.
. logit ins retire age hstatusg hhincome educyear married hisp
Iteration 0: log likelihood = -2139.7712
Iteration 1: log likelihood = -1996.7434
Iteration 2: log likelihood = -1994.8864
Iteration 3: log likelihood = -1994.8784
Iteration 4: log likelihood = -1994.8784
Logistic regression Number of obs = 3,206
LR chi2(7) = 289.79
Prob > chi2 = 0.0000
Log likelihood = -1994.8784 Pseudo R2 = 0.0677
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
retire │ .1969297 .0842067 2.34 0.019 .0318875 .3619718
age │ -.0145955 .0112871 -1.29 0.196 -.0367178 .0075267
hstatusg │ .3122654 .0916739 3.41 0.001 .1325878 .491943
hhincome │ .0023036 .000762 3.02 0.003 .00081 .0037972
educyear │ .1142626 .0142012 8.05 0.000 .0864288 .1420963
married │ .578636 .0933198 6.20 0.000 .3957327 .7615394
hisp │ -.8103059 .1957522 -4.14 0.000 -1.193973 -.4266387
_cons │ -1.715578 .7486219 -2.29 0.022 -3.18285 -.2483064
─────────────┴────────────────────────────────────────────────────────────────
.
> m3 <- glm(ins ~ retire + age + hstatusg + hhincome + educyear + married + hisp,
+ family = binomial, data = ct)
> summary(m3)
Call:
glm(formula = ins ~ retire + age + hstatusg + hhincome + educyear +
married + hisp, family = binomial, data = ct)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.456 -1.009 -0.703 1.224 2.373
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.715578 0.748622 -2.292 0.021926 *
retire 0.196930 0.084207 2.339 0.019354 *
age -0.014596 0.011287 -1.293 0.195969
hstatusg 0.312265 0.091674 3.406 0.000659 ***
hhincome 0.002304 0.000762 3.023 0.002503 **
educyear 0.114263 0.014201 8.046 8.55e-16 ***
married 0.578636 0.093320 6.201 5.63e-10 ***
hisp -0.810306 0.195751 -4.139 3.48e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 3989.8 on 3198 degrees of freedom
AIC: 4005.8
Number of Fisher Scoring iterations: 4
All coefficients other than age are indeed significant at the 5% level, as indicated by the z-statistics.
(d) Interpret carefully the odds ratio for hispanics, comparing it with the result of part (b). Test the significance of the ethnicity effect using a Wald test and a likelihood ratio test.
. di exp(_b[hisp]) .44472199 . di (_b[hisp]/_se[hisp])^2 17.135029 . estimates store ct . quietly logit ins retire age hstatusg hhincome educyear married . lrtest . ct Likelihood-ratio test Assumption: . nested within ct LR chi2(1) = 19.46 Prob > chi2 = 0.0000
> b <- coef(m3)
> exp(b["hisp"])
hisp
0.444722
> se = sqrt(vcov(m3)["hisp","hisp"])
> (b["hisp"]/se)^2
hisp
17.13528
> deviance(update(m3, ~. - hisp)) - deviance(m3)
[1] 19.46244
The odds ratio for hispanics adjusted for the other variables is 0.445. That means that the odds of hispanics having insurance are 45% (or a bit below half) the odds of non-hispanics with the same retirement status, age, heath status, household income, education and marital status. Clearly hispanics are somewhat disadvantaged on at least some of these variables, so adjustment reduces the observed disparity, which nevertheless remains substantial. The ratio is significantly different from one, with a Wald chi-squared of 17.14 and a likelihood ratio chi-squared of 19.46 on one d.f.
(e) Turns out these estimates are sensitive to the specification of
household income. An obvious alternative is log-income, but some
households have no income. Instead we will group this variable into
quartiles, which has been done in qhhinc. Refit the model
using this alternative specification and comment on the odds ratio for
hispanics. Use this alternative specification in what follows.
. logit ins retire age hstatusg i.qhhinc educyear married hisp
Iteration 0: log likelihood = -2139.7712
Iteration 1: log likelihood = -1910.9252
Iteration 2: log likelihood = -1903.6387
Iteration 3: log likelihood = -1903.6075
Iteration 4: log likelihood = -1903.6075
Logistic regression Number of obs = 3,206
LR chi2(9) = 472.33
Prob > chi2 = 0.0000
Log likelihood = -1903.6075 Pseudo R2 = 0.1104
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
retire │ .1809582 .086045 2.10 0.035 .0123131 .3496033
age │ -.021298 .0117381 -1.81 0.070 -.0443042 .0017082
hstatusg │ .1046211 .0963592 1.09 0.278 -.0842394 .2934815
│
qhhinc │
Q2 │ 1.304387 .1399469 9.32 0.000 1.030096 1.578678
Q3 │ 1.814432 .1472722 12.32 0.000 1.525784 2.103081
Q4 │ 1.837259 .1550189 11.85 0.000 1.533428 2.14109
│
educyear │ .0637015 .01502 4.24 0.000 .0342628 .0931402
married │ .0174329 .1060881 0.16 0.869 -.1904959 .2253617
hisp │ -.6294994 .2024917 -3.11 0.002 -1.026376 -.2326229
_cons │ -1.317642 .777195 -1.70 0.090 -2.840916 .2056324
─────────────┴────────────────────────────────────────────────────────────────
. di exp(_b[hisp])
.53285848
> ct <- mutate(ct, qhhinc = factor(qhhinc))
> m4 <- glm(ins ~ retire + age + hstatusg + qhhinc + educyear + married + hisp,
+ family = binomial, data = ct)
> summary(m4)
Call:
glm(formula = ins ~ retire + age + hstatusg + qhhinc + educyear +
married + hisp, family = binomial, data = ct)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.462 -1.041 -0.515 1.115 2.443
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.31764 0.77719 -1.695 0.09000 .
retire 0.18096 0.08604 2.103 0.03546 *
age -0.02130 0.01174 -1.814 0.06961 .
hstatusg 0.10462 0.09636 1.086 0.27759
qhhinc2 1.30439 0.13994 9.321 < 2e-16 ***
qhhinc3 1.81443 0.14727 12.321 < 2e-16 ***
qhhinc4 1.83726 0.15502 11.852 < 2e-16 ***
educyear 0.06370 0.01502 4.241 2.22e-05 ***
married 0.01743 0.10609 0.164 0.86947
hisp -0.62950 0.20248 -3.109 0.00188 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4279.5 on 3205 degrees of freedom
Residual deviance: 3807.2 on 3196 degrees of freedom
AIC: 3827.2
Number of Fisher Scoring iterations: 4
> b = coef(m4)
> exp(b["hisp"])
hisp
0.5328585
The alternative specification results in an adjusted odds ratio of 0.533, further reducing the disparity between hispanics and others. It appears that the adjustment is somewhat sensitive to the specification of income. (As I remarked on class, the model based on quartiles of household income seems to fit a bit better than the original. Using log income leads to a similar estimate.)
We will estimate several types of marginal effects. I will refer to the calculation based on the derivative as “continuous” and to the unit change as “discrete”.
(a) Estimate the marginal effect of hispanic using the continuous formula at the mean of all covariates. Try a quick approximation using the formula evaluated at the overall probability. This is known as the marginal effect at the mean.
. quietly sum ins . di opri * (1 - opri) * _b[hisp] -.14934911
> p * (1 - p) * b["hisp"] mean(ins) 1 -0.1493491
We estimate the difference between hispanics and non-hispanics, at the overall probability of 38.7%, to be 14.9% percentage points. This is much lower than the original (unadjusted) difference of 25.6 points.
(b) Predict insurance for hispanics and others with the other covariates set to their means, and calculate the difference. This is the unit change version of the marginal effect at the mean.
. forvalues q = 2/4 {
2. gen hhincq`q' = qhhinc == `q'
3. }
. quietly logit ins retire age hstatusg hhincq2-hhincq4 educyear married hisp
. preserve
. collapse retire age hstatusg hhincq2-hhincq4 educyear married hisp
. quietly replace hisp = 0
. predict pr0, pr
. quietly replace hisp = 1
. predict pr1, pr
. di pr1, pr0, pr1 - pr0
.23580998 .36672541 -.13091543
. restore
> X = model.matrix(m4) > Xbar = colMeans(X) > Xbar["hisp"] <- 1 > pr1 <- plogis( Xbar %*% b ) > Xbar["hisp"] <- 0 > pr0 <- plogis( Xbar %*% b ) > c(pr1, pr0, pr1 - pr0) [1] 0.2358100 0.3667254 -0.1309154
We find predicted probabilities of 23.6% for hispanics and 36.7% for others, a difference of 13.1 percentage points. A bit smaller than the quick approximation of (a) but in the same ballpark.
(c) Predict insurance for everyone with all variables as they are, and compute the marginal effect of hispanic for each respondent using the continuous approximation. This is known as the average marginal effect.
. predict pr, pr
. gen mec = pr * (1-pr) * _b[hisp]
. sum mec
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
mec │ 3,206 -.1295251 .0391199 -.1573748 -.0189168
> pri <- plogis( X %*% b ) > mec <- pri * (1 - pri) * b["hisp"] > mean(mec) [1] -0.1295251
The average marginal effect is 12.9 percentage points using the continuous approximation.
(d) Make a copy of hisp for safekeeping. Now set
hisp to 1 for everyone, predict the probability of having
insurance and average. Next set hisp to 0 for everyone,
predict again and average. The difference between these two means is the
discrete version of the average marginal effect. Don’t forget to set
hisp back to its true value.
. preserve
. quietly replace hisp = 0
. predict pr0, pr
. quietly replace hisp = 1
. predict pr1, pr
. gen med = pr1 - pr0
. sum med
Variable │ Obs Mean Std. dev. Min Max
─────────────┼─────────────────────────────────────────────────────────
med │ 3,206 -.1233619 .0414112 -.1560881 -.0246634
. restore
> X[, "hisp"] <- 1 > pr1 <- plogis( X %*% b ) > X[, "hisp"] <- 0 > pr0 <- plogis( X %*% b ) > colMeans(cbind(pr1, pr0, pr1 - pr0)) [1] 0.2698466 0.3932085 -0.1233619
The average marginal effect is 12.3 percentage points using the unit change. (Of all calculations, this is probably the one that makes the most sense. Many analysts use (c) for continuous variables and (d) for discrete factors.
(e) The last two approaches avoid setting variables such as
married to its mean of 0.773, which is not meaningful.
Another approach is to select a combination of predictors of interest
and predict for that case. Cameron and Trivedi use a 75-year old married
person with good health status, 12 years of education and an income in
the third quartile. Calculate the predicted probabilities if the person
is hispanic and if not, and compute the difference.
An easy way to do the calculation is to create an observation with
the desired attributes, or use one in the dataset if the combination
exists. We have plenty of cases except for age. (I forgot to specify a
value for retire. Given the age it makes sense to assume
yes.)
. preserve
. keep if retire & hstatus == 1 & educyear == 12 & qhhinc == 3 & married
(3,042 observations deleted)
. keep in 1/2
(162 observations deleted)
. quietly replace age = 75
. quietly replace hisp = _n == 1
. predict prp, pr
. list hisp prp
┌─────────────────┐
│ hisp prp │
├─────────────────┤
1. │ 1 .3401491 │
2. │ 0 .4917181 │
└─────────────────┘
. restore
> proto <- matrix(c(
+ 1, 1, 75, 1, 0, 1, 0, 12, 1, 1,
+ 1, 1, 75, 1, 0, 1, 0, 12, 1, 0), 2, 10, byrow = TRUE)
> xb <- proto %*% coef(m4)
> plogis(xb)
[,1]
[1,] 0.3401491
[2,] 0.4917181
We predict for the prototype person a probability of 34% if hispanic and 49% if not, a difference of 15 percentage points.
Note: Stata’s powerful margins command can do these
calculations, but you should do them “by hand”, so you know exactly
what’s being done.
(a) With individual data the deviance does not have a χ2 distribution, and even with household income grouped into categories there are too many covariate patterns to trust the asymptotics. Calculate the Hosmer-Lemeshow goodness of fit test using ten groups of approximately equal size based on predicted probabilities.
. estat gof, group(10) table
note: obs collapsed on 10 quantiles of estimated probabilities.
Goodness-of-fit test after logistic model
Variable: ins
Table collapsed on quantiles of estimated probabilities
┌───────┬────────┬───────┬───────┬───────┬───────┬───────┐
│ Group │ Prob │ Obs_1 │ Exp_1 │ Obs_0 │ Exp_0 │ Total │
├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
│ 1 │ 0.1169 │ 20 │ 26.8 │ 301 │ 294.2 │ 321 │
│ 2 │ 0.1497 │ 40 │ 42.9 │ 283 │ 280.1 │ 323 │
│ 3 │ 0.3223 │ 77 │ 68.1 │ 241 │ 249.9 │ 318 │
│ 4 │ 0.3824 │ 122 │ 114.5 │ 199 │ 206.5 │ 321 │
│ 5 │ 0.4367 │ 131 │ 130.5 │ 190 │ 190.5 │ 321 │
├───────┼────────┼───────┼───────┼───────┼───────┼───────┤
│ 6 │ 0.4922 │ 155 │ 150.6 │ 166 │ 170.4 │ 321 │
│ 7 │ 0.5236 │ 173 │ 167.1 │ 156 │ 161.9 │ 329 │
│ 8 │ 0.5448 │ 164 │ 167.8 │ 149 │ 145.2 │ 313 │
│ 9 │ 0.5817 │ 171 │ 179.2 │ 148 │ 139.8 │ 319 │
│ 10 │ 0.6568 │ 188 │ 193.5 │ 132 │ 126.5 │ 320 │
└───────┴────────┴───────┴───────┴───────┴───────┴───────┘
Number of observations = 3,206
Number of groups = 10
Hosmer–Lemeshow chi2(8) = 6.43
Prob > chi2 = 0.5988
> source("https://grodri.github.io/glms/r/hosmer.R")
> hosmer(ct$ins, fitted(m4))
0 1 e.0 e.1
[0.031,0.117] 301 20 294.2278 26.77222
(0.117,0.15] 283 40 280.0543 42.94574
(0.15,0.322] 241 77 249.8864 68.11363
(0.322,0.382] 199 122 206.4775 114.52255
(0.382,0.437] 190 131 190.5101 130.48987
(0.437,0.492] 166 155 170.3789 150.62107
(0.492,0.524] 156 173 161.9098 167.09021
(0.524,0.545] 149 164 145.2456 167.75442
(0.545,0.582] 148 171 139.8297 179.17029
(0.582,0.657] 132 188 126.4800 193.52002
test groups chi.sq pvalue
1 Hosmer-Lemeshow 10 6.433716 0.5987689
We get a chi-squared of 6.43 on 8 d.f. which is quite reassuring, with a p-value of 0.6.
(b) Compute predicted probabilities of having insurance and predict that a respondent will have insurance if the predicted probability is 0.5 or more. Tabulate predicted versus actual outcomes. What’s the overall error rate? Comment on the numbers of false positives and false negatives. How well would be do if we just predicted that nobody would buy supplemental insurance? How much reduction in error did the model achieve?
. predict pri, pr
. gen preins = pri >= 0.5
. tab preins ins
│ ins
preins │ 0 1 │ Total
───────────┼──────────────────────┼──────────
0 │ 1,421 587 │ 2,008
1 │ 544 654 │ 1,198
───────────┼──────────────────────┼──────────
Total │ 1,965 1,241 │ 3,206
> ct <- mutate(ct, predins = fitted(m4) >= 0.5) > group_by(ct, predins, ins) %>% tally() # A tibble: 4 × 3 # Groups: predins [2] predins ins n <lgl> <dbl> <int> 1 FALSE 0 1421 2 FALSE 1 587 3 TRUE 0 544 4 TRUE 1 654
We predict correctly 654 + 1421 cases or 64.7%, so the error rate is 35.3%. Our false positives are too high, we get 45% wrong (544 of 1198 predicted to have insurance). False negatives are a bit better, we get 29% wrong (587 of 2008 predicted not to have insurance). The overall proportion was 38.7%, so if we predicted that everyone buys supplementary insurance we would get only 38.7% wrong. With the model we reduce that to 35.3%. The proportionate reduction in error is only 9% (or 8.86%).
(a) Fit the model using a probit specification, compute the continuous marginal effect using the overall probability of having insurance, and compare with the results of part 2 (a).
. probit ins retire age hstatusg hhincq2-hhincq4 educyear married hisp
Iteration 0: log likelihood = -2139.7712
Iteration 1: log likelihood = -1904.9874
Iteration 2: log likelihood = -1901.9111
Iteration 3: log likelihood = -1901.9053
Iteration 4: log likelihood = -1901.9053
Probit regression Number of obs = 3,206
LR chi2(9) = 475.73
Prob > chi2 = 0.0000
Log likelihood = -1901.9053 Pseudo R2 = 0.1112
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. z P>|z| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
retire │ .1090073 .0522262 2.09 0.037 .0066458 .2113687
age │ -.0125164 .0071196 -1.76 0.079 -.0264706 .0014378
hstatusg │ .0722337 .0580082 1.25 0.213 -.0414602 .1859276
hhincq2 │ .751445 .0786195 9.56 0.000 .5973537 .9055363
hhincq3 │ 1.067105 .0835774 12.77 0.000 .9032965 1.230914
hhincq4 │ 1.081014 .0884047 12.23 0.000 .9077443 1.254284
educyear │ .0395948 .0090395 4.38 0.000 .0218777 .057312
married │ .0170883 .0633161 0.27 0.787 -.107009 .1411855
hisp │ -.3790227 .1162247 -3.26 0.001 -.606819 -.1512265
_cons │ -.81568 .4718992 -1.73 0.084 -1.740585 .1092254
─────────────┴────────────────────────────────────────────────────────────────
. scalar eta = invnormal(opri)
. di normalden(eta) * _b[hisp]
-.14511058
> m5 <- update(m4, family = binomial(probit))
> xb = qnorm(as.numeric(p))
> dnorm(xb) * coef(m5)["hisp"]
hisp
-0.1451103
We estimate the marginal effect at the mean to be 14.5 percentage points using the derivative. This is very close to the comparable logit estimate of 14.9 obtained in part 2.a (BTW the coefficient of -0.38 tells us that the (net) utility of buying supplementary insurance of hispanics is 0.38 standard deviations less than comparable non-hispanics.)
(b) Fit the same model using OLS. The coefficient of
hisp is often justified as a quick estimate of the average
marginal effect of ethnicity adjusted for the other variables. How does
it compare with the logit and probit estimates?
. reg ins retire age hstatusg hhincq2-hhincq4 educyear married hisp
Source │ SS df MS Number of obs = 3,206
─────────────┼────────────────────────────────── F(9, 3196) = 53.40
Model │ 99.4266617 9 11.0474069 Prob > F = 0.0000
Residual │ 661.198728 3,196 .206883207 R-squared = 0.1307
─────────────┼────────────────────────────────── Adj R-squared = 0.1283
Total │ 760.62539 3,205 .237324615 Root MSE = .45484
─────────────┬────────────────────────────────────────────────────────────────
ins │ Coefficient Std. err. t P>|t| [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
retire │ .0364401 .0177052 2.06 0.040 .0017254 .0711549
age │ -.0038497 .0023577 -1.63 0.103 -.0084724 .000773
hstatusg │ .0200136 .0192637 1.04 0.299 -.0177568 .0577841
hhincq2 │ .214232 .0247282 8.66 0.000 .1657472 .2627169
hhincq3 │ .3435194 .0269492 12.75 0.000 .2906799 .3963589
hhincq4 │ .3554378 .0287588 12.36 0.000 .2990503 .4118254
educyear │ .0115776 .0029295 3.95 0.000 .0058336 .0173215
married │ .0055474 .020842 0.27 0.790 -.0353177 .0464124
hisp │ -.0793971 .0329527 -2.41 0.016 -.1440076 -.0147866
_cons │ .2434602 .1563759 1.56 0.120 -.063147 .5500673
─────────────┴────────────────────────────────────────────────────────────────
> update(m4, family = gaussian)
Call: glm(formula = ins ~ retire + age + hstatusg + qhhinc + educyear +
married + hisp, family = gaussian, data = ct)
Coefficients:
(Intercept) retire age hstatusg qhhinc2 qhhinc3
0.243460 0.036440 -0.003850 0.020014 0.214232 0.343519
qhhinc4 educyear married hisp
0.355438 0.011578 0.005547 -0.079397
Degrees of Freedom: 3205 Total (i.e. Null); 3196 Residual
Null Deviance: 760.6
Residual Deviance: 661.2 AIC: 4059
OLS estimates the average marginal effect as 7.9 percentage points, much lower than previous estimates. The value is more directly comparable to the average marginal effect than the marginal effect at the average, but it misses both.