In this section we will be working with the additive analysis of covariance model of the previous section. As usual, we start by reading the data and recreating the variables we need. We then fit the model.
. use https://grodri.github.io/datasets/effort, clear (Family Planning Effort Data) . recode effort (0/4=1 "Weak") (5/14=2 "Moderate") /// > (15/max=3 "Strong"), gen(effort_g) label(effort_g) (20 differences between effort and effort_g) . regress change setting i.effort_g Source │ SS df MS Number of obs = 20 ─────────────┼────────────────────────────────── F(3, 16) = 21.55 Model │ 2124.50633 3 708.168776 Prob > F = 0.0000 Residual │ 525.693673 16 32.8558546 R-squared = 0.8016 ─────────────┼────────────────────────────────── Adj R-squared = 0.7644 Total │ 2650.2 19 139.484211 Root MSE = 5.732 ─────────────┬──────────────────────────────────────────────────────────────── change │ Coefficient Std. err. t P>|t| [95% conf. interval] ─────────────┼──────────────────────────────────────────────────────────────── setting │ .1692677 .1055505 1.60 0.128 -.0544894 .3930247 │ effort_g │ Moderate │ 4.143915 3.191179 1.30 0.213 -2.621082 10.90891 Strong │ 19.44761 3.729295 5.21 0.000 11.54186 27.35336 │ _cons │ -5.954036 7.16597 -0.83 0.418 -21.14521 9.237141 ─────────────┴────────────────────────────────────────────────────────────────
All of the diagnostic measures discussed in the lecture notes can be calculated in Stata and R, some in more than one way.
Let us start with the residuals. The easiest way to get them is as
options of the predict
command. Specify the option
res
for the raw residuals, rstand
for the
standardized residuals, and rstud
for the studentized (or
jackknifed) residuals. Let us obtain all three:
. predict ri, res . predict si, rsta . predict ti, rstu . label var ti "Jack-knifed residuals"
To get the diagonal elements of the hat matrix and Cook’s distance we
use two more options of predict
, hat
and
cook
:
. predict hii, hat . predict di, cook
We are now ready to print Table 2.29 in the notes.
. list country ri si ti hii di, clean country ri si ti hii di 1. Bolivia -.8322767 -.1689738 -.1637543 .2616128 .002529 2. Brazil 3.428229 .6573142 .645213 .1720945 .0224529 3. Chile .4416054 .0834989 .0808651 .1486769 .0003044 4. Colombia -1.527183 -.2913581 -.2828576 .1637904 .0041569 5. CostaRica 1.287944 .242732 .2354582 .1431063 .0024599 6. Cuba 11.44161 2.163383 2.490349 .1486769 .2043412 7. DominicanRep 11.29992 2.161597 2.487445 .1682585 .2363079 8. Ecuador -10.03862 -1.925296 -2.126719 .1725536 .1932498 9. ElSalvador 4.654061 .8956616 .8898143 .178205 .0434895 10. Guatemala -3.4996 -.6853749 -.6735727 .206462 .030554 11. Haiti .0296676 .0069303 .0067103 .4422478 9.52e-06 12. Honduras .1774703 .0355449 .0344175 .2412746 .0001004 13. Jamaica -7.219859 -1.361729 -1.402245 .1444142 .0782469 14. Mexico .90482 .1830367 .1774104 .2562359 .0028855 15. Nicaragua 1.443835 .2726553 .2646128 .1465179 .0031905 16. Panama -5.712056 -1.076521 -1.082269 .1431063 .0483857 17. Paraguay -.5717711 -.109629 -.1061877 .1720945 .0006246 18. Peru -4.402503 -.8410965 -.8330122 .1661363 .0352372 19. TrinidadTobago 1.287944 .242732 .2354582 .1431063 .0024599 20. Venezuela -2.593236 -.5752294 -.5628135 .3814295 .051009
Here is an easy way to find the cases highlighted in Table 2.29, those with standardized or jackknifed residuals greater than 2 in magnitude:
. list country ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean country ri si ti hii di 6. Cuba 11.44161 2.163383 2.490349 .1486769 .2043412 7. DominicanRep 11.29992 2.161597 2.487445 .1682585 .2363079 8. Ecuador -10.03862 -1.925296 -2.126719 .1725536 .1932498
We will calculate the maximum acceptable leverage, which is 2p/n in general, and then list the cases exceeding that value (if any).
. scalar hiimax = 2*4/20 . list country ri si ti hii di if hii > hiimax, clean country ri si ti hii di 11. Haiti .0296676 .0069303 .0067103 .4422478 9.52e-06
We find that Haiti has a lot of leverage, but very little actual
influence. Let us list the six most influential countries. I will do
this by sorting the data in descending order of influence and
then listing the first six. Stata’s regular sort
command
sorts only in ascending order, but gsort
can do
descending if you specify -di
.
. gsort -di . list country di in 1/6, clean country di 1. DominicanRep .2363079 2. Cuba .2043412 3. Ecuador .1932498 4. Jamaica .0782469 5. Venezuela .051009 6. Panama .0483857
Turns out that the D.R., Cuba, and Ecuador are fairly influential observations. Try refitting the model without the D.R. to verify what I say on page 57 of the lecture notes.
On to plots! Here is the standard residual plot in Figure 2.6, produced using the following code:
. predict yhat (option xb assumed; fitted values) . label var yhat "Fitted values" . scatter ti yhat, title("Figure 2.6: Residual Plot for Ancova Model") . graph export fig26.png, width(500) replace file fig26.png saved as PNG format
Exercise: Can you label the points coresponding to Cuba, the D.R. and Ecuador?
Now for that lovely Q-Q-plot in Figure 2.7 of the notes:
. qnorm ti, title("Figure 2.7: Q-Q Plot for Residuals of Ancova Model") . graph export fig27.png, width(500) replace file fig27.png saved as PNG format
Wasn’t that easy? Stata’s qnorm
evaluates the inverse
normal cdf at i/(n+1) rather than at (i-3/8)/(n+1/4)
or some of the other approximations discussed in the notes. Of course
you can use any approximation you want, albeit at the expense of
additional work.
I will illustrate the general idea by calculating Filliben’s
approximation to the expected order statistics or rankits. I will use
Stata’s built-in system variables _n
for the observation
number and _N
for the number of cases.
. sort ti . gen pi = (_n-0.3175)/(_N+0.365) . replace pi = 1-0.5^(1/_N) if _n == 1 (1 real change made) . replace pi = 0.5^(1/_N) if _n ==_N (1 real change made) . gen filliben = invnorm(pi) . corr ti si filliben (obs=20) │ ti si filliben ─────────────┼─────────────────────────── ti │ 1.0000 si │ 0.9984 1.0000 filliben │ 0.9518 0.9655 1.0000
The correlation is 0.9518 using jack-knifed residuals, and 0.9655 using standardized residuals. The latter is the value quoted in the notes. Both are above (if barely) the minimum correlation of 0.95 needed to accept normality. I will skip the graph because it looks almost identical to the one produced above.
Updated fall 2022