Germán Rodríguez
Generalized Linear Models Princeton University

2.9 Regression Diagnostics

In this section we will be working with the additive analysis of covariance model of the previous section. As usual, we start by reading the data and recreating the variables we need. We then fit the model.

. use https://grodri.github.io/datasets/effort, clear
(Family Planning Effort Data)

. recode effort (0/4=1 "Weak") (5/14=2 "Moderate") ///
>   (15/max=3 "Strong"), gen(effort_g) label(effort_g)
(20 differences between effort and effort_g)

.  regress change setting i.effort_g

      Source │       SS           df       MS      Number of obs   =        20
─────────────┼──────────────────────────────────   F(3, 16)        =     21.55
       Model │  2124.50633         3  708.168776   Prob > F        =    0.0000
    Residual │  525.693673        16  32.8558546   R-squared       =    0.8016
─────────────┼──────────────────────────────────   Adj R-squared   =    0.7644
       Total │      2650.2        19  139.484211   Root MSE        =     5.732

─────────────┬────────────────────────────────────────────────────────────────
      change │ Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
─────────────┼────────────────────────────────────────────────────────────────
     setting │   .1692677   .1055505     1.60   0.128    -.0544894    .3930247
             │
    effort_g │
   Moderate  │   4.143915   3.191179     1.30   0.213    -2.621082    10.90891
     Strong  │   19.44761   3.729295     5.21   0.000     11.54186    27.35336
             │
       _cons │  -5.954036    7.16597    -0.83   0.418    -21.14521    9.237141
─────────────┴────────────────────────────────────────────────────────────────

All of the diagnostic measures discussed in the lecture notes can be calculated in Stata and R, some in more than one way.

Residuals

Let us start with the residuals. The easiest way to get them is as options of the predict command. Specify the option res for the raw residuals, rstand for the standardized residuals, and rstud for the studentized (or jackknifed) residuals. Let us obtain all three:

. predict ri, res

. predict si, rsta 

. predict ti, rstu

. label var ti "Jack-knifed residuals"

Leverage and Influence

To get the diagonal elements of the hat matrix and Cook’s distance we use two more options of predict, hat and cook:

. predict hii, hat

. predict di, cook

We are now ready to print Table 2.29 in the notes.

. list country ri si ti hii di, clean

              country          ri          si          ti        hii         di  
  1.          Bolivia   -.8322767   -.1689738   -.1637543   .2616128    .002529  
  2.           Brazil    3.428229    .6573142     .645213   .1720945   .0224529  
  3.            Chile    .4416054    .0834989    .0808651   .1486769   .0003044  
  4.         Colombia   -1.527183   -.2913581   -.2828576   .1637904   .0041569  
  5.        CostaRica    1.287944     .242732    .2354582   .1431063   .0024599  
  6.             Cuba    11.44161    2.163383    2.490349   .1486769   .2043412  
  7.     DominicanRep    11.29992    2.161597    2.487445   .1682585   .2363079  
  8.          Ecuador   -10.03862   -1.925296   -2.126719   .1725536   .1932498  
  9.       ElSalvador    4.654061    .8956616    .8898143    .178205   .0434895  
 10.        Guatemala     -3.4996   -.6853749   -.6735727    .206462    .030554  
 11.            Haiti    .0296676    .0069303    .0067103   .4422478   9.52e-06  
 12.         Honduras    .1774703    .0355449    .0344175   .2412746   .0001004  
 13.          Jamaica   -7.219859   -1.361729   -1.402245   .1444142   .0782469  
 14.           Mexico      .90482    .1830367    .1774104   .2562359   .0028855  
 15.        Nicaragua    1.443835    .2726553    .2646128   .1465179   .0031905  
 16.           Panama   -5.712056   -1.076521   -1.082269   .1431063   .0483857  
 17.         Paraguay   -.5717711    -.109629   -.1061877   .1720945   .0006246  
 18.             Peru   -4.402503   -.8410965   -.8330122   .1661363   .0352372  
 19.   TrinidadTobago    1.287944     .242732    .2354582   .1431063   .0024599  
 20.        Venezuela   -2.593236   -.5752294   -.5628135   .3814295    .051009  

Here is an easy way to find the cases highlighted in Table 2.29, those with standardized or jackknifed residuals greater than 2 in magnitude:

. list country ri si ti hii di if abs(si) > 2 | abs(ti) > 2, clean

            country          ri          si          ti        hii         di  
  6.           Cuba    11.44161    2.163383    2.490349   .1486769   .2043412  
  7.   DominicanRep    11.29992    2.161597    2.487445   .1682585   .2363079  
  8.        Ecuador   -10.03862   -1.925296   -2.126719   .1725536   .1932498  

We will calculate the maximum acceptable leverage, which is 2p/n in general, and then list the cases exceeding that value (if any).

. scalar hiimax = 2*4/20

. list country ri si ti hii di if hii > hiimax, clean 

       country         ri         si         ti        hii         di  
 11.     Haiti   .0296676   .0069303   .0067103   .4422478   9.52e-06  

We find that Haiti has a lot of leverage, but very little actual influence. Let us list the six most influential countries. I will do this by sorting the data in descending order of influence and then listing the first six. Stata’s regular sort command sorts only in ascending order, but gsort can do descending if you specify -di.

. gsort -di

. list country di in 1/6, clean

            country         di  
  1.   DominicanRep   .2363079  
  2.           Cuba   .2043412  
  3.        Ecuador   .1932498  
  4.        Jamaica   .0782469  
  5.      Venezuela    .051009  
  6.         Panama   .0483857  

Turns out that the D.R., Cuba, and Ecuador are fairly influential observations. Try refitting the model without the D.R. to verify what I say on page 57 of the lecture notes.

Residual Plots

On to plots! Here is the standard residual plot in Figure 2.6, produced using the following code:

. predict yhat
(option xb assumed; fitted values)

. label var yhat "Fitted values"

. scatter ti yhat, title("Figure 2.6: Residual Plot for Ancova Model") 

. graph export fig26.png, width(500) replace
file fig26.png saved as PNG format

Exercise: Can you label the points coresponding to Cuba, the D.R. and Ecuador?

Q-Q Plots

Now for that lovely Q-Q-plot in Figure 2.7 of the notes:

. qnorm ti, title("Figure 2.7: Q-Q Plot for Residuals of Ancova Model")

. graph export fig27.png, width(500) replace
file fig27.png saved as PNG format

Wasn’t that easy? Stata’s qnorm evaluates the inverse normal cdf at i/(n+1) rather than at (i-3/8)/(n+1/4) or some of the other approximations discussed in the notes. Of course you can use any approximation you want, albeit at the expense of additional work.

Filliben

I will illustrate the general idea by calculating Filliben’s approximation to the expected order statistics or rankits. I will use Stata’s built-in system variables _n for the observation number and _N for the number of cases.

. sort ti

. gen pi = (_n-0.3175)/(_N+0.365)

. replace pi = 1-0.5^(1/_N) if _n == 1
(1 real change made)

. replace pi = 0.5^(1/_N)   if _n ==_N 
(1 real change made)

. gen filliben = invnorm(pi)

. corr ti si filliben          
(obs=20)

             │       ti       si filliben
─────────────┼───────────────────────────
          ti │   1.0000
          si │   0.9984   1.0000
    filliben │   0.9518   0.9655   1.0000

The correlation is 0.9518 using jack-knifed residuals, and 0.9655 using standardized residuals. The latter is the value quoted in the notes. Both are above (if barely) the minimum correlation of 0.95 needed to accept normality. I will skip the graph because it looks almost identical to the one produced above.

Updated fall 2022