This page collects some notes on fitting multinomial logit models with (and without) random effects using Stata and Stan.
I use data from Hedeker (1999) “MIXNO: a computer program for
mixed-effects nominal logistic regression” (see JSS 4(5)), although the
models I fit are not exactly the same as his. The data pertain to a
panel study of mentally ill homeless individuals. The outcome is housing
status
(0=street, 1=community, 2=independent). Individuals
were randomly assigned to a treatment group with access to Section 8
housing certificates and a control group without such access, and were
assessed at the baseline and at 6, 12 and 24-month follow ups. The
predictors of interest are time
’ (coded 0-3) and
sec8
, a dummy variable coded 1 for the treatment group.
Below we read the data and check totals by group and time
. use https://grodri.github.io/datasets/sdhouse, clear (McKinney Homeless Research Project from Hedeker's MIXNO) . tab sec8 time │ time sec8 │ 0 1 2 3 │ Total ───────────┼────────────────────────────────────────────┼────────── 0 │ 180 161 146 145 │ 632 1 │ 181 161 157 158 │ 657 ───────────┼────────────────────────────────────────────┼────────── Total │ 361 322 303 303 │ 1,289
For comparison I fit a standard multinomial logit model in Stata:
. mlogit status c.sec8##c.time, base(0) Iteration 0: log likelihood = -1381.4444 Iteration 1: log likelihood = -1227.7024 Iteration 2: log likelihood = -1223.1853 Iteration 3: log likelihood = -1223.16 Iteration 4: log likelihood = -1223.16 Multinomial logistic regression Number of obs = 1,289 LR chi2(6) = 316.57 Prob > chi2 = 0.0000 Log likelihood = -1223.16 Pseudo R2 = 0.1146 ──────────────┬──────────────────────────────────────────────────────────────── status │ Coefficient Std. err. z P>|z| [95% conf. interval] ──────────────┼──────────────────────────────────────────────────────────────── street │ (base outcome) ──────────────┼──────────────────────────────────────────────────────────────── community │ sec8 │ .2395584 .2130131 1.12 0.261 -.1779395 .6570564 time │ .7961881 .1075957 7.40 0.000 .5853045 1.007072 │ c.sec8#c.time │ -.5180044 .1576099 -3.29 0.001 -.8269142 -.2090947 │ _cons │ -.2109418 .1428502 -1.48 0.140 -.4909231 .0690395 ──────────────┼──────────────────────────────────────────────────────────────── independent │ sec8 │ 1.157348 .2551718 4.54 0.000 .6572208 1.657476 time │ 1.138287 .123074 9.25 0.000 .8970665 1.379508 │ c.sec8#c.time │ -.2308719 .1637933 -1.41 0.159 -.5519008 .090157 │ _cons │ -1.405489 .1964876 -7.15 0.000 -1.790598 -1.02038 ──────────────┴────────────────────────────────────────────────────────────────
To fit this model in Stan one can follow the example in Section 9.5 of the Stan manual. That model, however, is “typically only identified if there is a suitable prior on the coefficients”. I found that it ran, and if I then calculated the differences between the coefficients for categories 1 and 2 compared with 0 (actually coded 2 and 3 compared with 1) the results were similar to maximum likelihood.
A better(and faster) alternative is to work with only k-1 equations
for k response categories. I defined the coefficients to be estimated as
a k-1 by p matrix beta
, and the coefficients used to
compute the actual probabilities as a k by p matrix betap
(for beta-plus) such that the first column is zero and the others are
filled by beta. (Obviously this assumes that the first category is the
baseline. The code could be generalized to use any of the categories as
baseline.)
I first defined the betap
’ in a transformed parameters
block, but then the output includes both beta
and
betap
, which is redundant. A better approach turned out to
be creating a local block as part of the model definition, and then
create the expanded parameters there.
So here’s the code to read the data and prepare them for use by
rstan
:
require(haven)
sd <- read_dta("http://grodri.github.io/datasets//sdhouse.dta")
y <- as.numeric(sd$status) + 1
x <- model.matrix(~sec8*time, data=sd)
sd_data <- list(N=length(y), K=3, K1=2, P=ncol(x), y=y, x=x)
Cautionary note: the outcome is coded 0,1,2 in Stata with labels “street”, “community” and “independent”. I treat it as numeric and add 1 to obtain 1,2,3. The next step is to define the model:
sd_model <- "
data {
int K; // number of outcome categories
int K1; // K-1
int N; // number of observations
int P; // number of predictors a.k.a. D
int y[N];// outcome, coded 1 to K for each obs
vector[P] x[N]; // predictors, including constant
}
parameters {
matrix[K1,P] beta;
}
model {
// prior for beta (vectorized)
for(k in 1:K1) {
beta[k] ~ normal(0,5);
}
{ // separate block for transformation
matrix[K,P] betap;
for(p in 1:P) {
betap[1,p] <- 0;
for(k in 2:K) {
betap[k,p] <- beta[k-1,p];
}
}
// likelihood of outcome
for(n in 1:N) {
y[n] ~ categorical(softmax(betap * x[n]));
}
}
}
"
The parameter beta
’ is a k1 by P matrix. The expression
beta[k]
’ refers to the k-th row. The prior is vectorized,
so each element is normal(0,5)
. The expansion to add a
column of zeroes is included in a local block. Here’s an actual run
> sf1 <- stan(model_code=sd_model,data=sd_data,iter=100,chains=1)
TRANSLATING MODEL 'sd_model' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'sd_model' NOW.
...
Iteration: 100 / 100 [100%] (Sampling)
Elapsed Time: 4.059 seconds (Warm-up)
4.762 seconds (Sampling)
8.821 seconds (Total)
This model run much faster than the over-parametrized version. To print results in a more compact format I defined my own function
stan.print<-function(fit) {
print(fit, probs=c(0.5), digits_summary=3)
}
> stan.print(sf1)
Inference for Stan model: sd_model.
1 chains, each with iter=100; warmup=50; thin=1;
post-warmup draws per chain=50, total post-warmup draws=50.
mean se_mean sd 50% n_eff Rhat
beta[1,1] -0.163 0.038 0.129 -0.145 11 1.199
beta[1,2] 0.169 0.031 0.174 0.172 31 1.017
beta[1,3] 0.756 0.026 0.105 0.756 16 1.072
beta[1,4] -0.462 0.029 0.168 -0.450 33 0.980
beta[2,1] -1.375 0.044 0.169 -1.351 15 0.980
beta[2,2] 1.112 0.055 0.232 1.119 18 1.030
beta[2,3] 1.099 0.029 0.130 1.102 20 0.980
beta[2,4] -0.171 0.037 0.185 -0.149 25 1.037
lp__ -1226.875 0.305 1.387 -1226.857 21 1.062
This model obviously needs to run much longer, we only have 50 draws!
But the numbers are all in the right ballpark. When you compare results
with Stata note that beta[e,1]
is the constant for equation
e.
Now on to the random intercept model. Stata 17 added an
xtmlogit
command. Users of earlier versions of Stata can
fit this model gsem
, as explained by Rebecca Pope in the
Stata Newsletter for Apr-Jun 2014.
. //v<17 gsem (1.status <- sec8 time secXtime RI1[id]) /// . // (2.status <- sec8 time secXtime RI2[id]), mlogit . xtmlogit status c.sec8##c.time, i(id) base(0) covariance(unstructured) Fitting comparison model ... Refining starting values: Grid node 0: log likelihood = -1178.1888 Grid node 1: log likelihood = -1201.0323 Fitting full model: Iteration 0: log likelihood = -1178.1888 Iteration 1: log likelihood = -1163.1103 Iteration 2: log likelihood = -1157.7496 Iteration 3: log likelihood = -1157.6259 Iteration 4: log likelihood = -1157.6253 Iteration 5: log likelihood = -1157.6253 Random-effects multinomial logistic regression Number of obs = 1,289 Group variable: id Number of groups = 361 Random effects u_i ~ Gaussian Obs per group: min = 1 avg = 3.6 max = 4 Integration method: mvaghermite Integration pts. = 7 Wald chi2(6) = 210.34 Log likelihood = -1157.6253 Prob > chi2 = 0.0000 ──────────────┬──────────────────────────────────────────────────────────────── status │ Coefficient Std. err. z P>|z| [95% conf. interval] ──────────────┼──────────────────────────────────────────────────────────────── street │ (base outcome) ──────────────┼──────────────────────────────────────────────────────────────── community │ sec8 │ .3835373 .2829853 1.36 0.175 -.1711037 .9381783 time │ 1.015074 .1329061 7.64 0.000 .7545826 1.275565 │ c.sec8#c.time │ -.5786798 .181297 -3.19 0.001 -.9340153 -.2233442 │ _cons │ -.2063278 .1926945 -1.07 0.284 -.5840022 .1713466 ──────────────┼──────────────────────────────────────────────────────────────── independent │ sec8 │ 1.60725 .3791572 4.24 0.000 .8641157 2.350384 time │ 1.530787 .1579142 9.69 0.000 1.221281 1.840293 │ c.sec8#c.time │ -.2223493 .1998801 -1.11 0.266 -.6141072 .1694085 │ _cons │ -2.047767 .3034708 -6.75 0.000 -2.642559 -1.452975 ──────────────┼──────────────────────────────────────────────────────────────── var(u1)│ 1.744592 .4753894 1.022697 2.976052 var(u2)│ 3.818815 .7779019 2.56175 5.692728 ──────────────┼──────────────────────────────────────────────────────────────── cov(u1,u2)│ 1.701683 .5029326 3.38 0.001 .7159536 2.687413 ──────────────┴──────────────────────────────────────────────────────────────── LR test vs. multinomial logit: chi2(3) = 131.07 Prob > chi2 = 0.0000 Note: LR test is conservative and provided only for reference. . di 1.701683/sqrt(1.744592*3.818815) .65927622 . di sqrt(1.744592),sqrt(3.818815) 1.32083 1.9541789
Getting this model to run in Stan is fun. Basically all we need to do is add the random effects, which depend on the subject id. To this end we expand the data to include the number of groups (individuals) and a mapping of observations to groups (individuals):
sd_data <- list(N=length(y),K=3,K1=2,P=ncol(x),y=y,x=x,
G=362,map=sd$id)
The model builds on the previous run. In the data block we add the
number of groups G
’ and an integer array map to lookup the
groups. To define the prior for the random effects we need a
K1
-vector of zeroes. Turns out the best way to define this
is in a transformed data block. Let me put the model out here and then
comment on each part.
sd_model <- "
data {
int K; // number of outcome categories
int K1; // K-1
int N; // number of observations
int P; // number of predictors a.k.a. D
int y[N];// outcome, coded 1 to K for each obs
vector[P] x[N]; // predictors, including constant
int G; // number of groups
int map[N]; //map obs to groups
}
transformed data {
vector[K1] zero;
for(k in 1:K1) {
zero[k] <- 0;
}
}
parameters {
matrix[K1,P] beta; // fixed effects
corr_matrix[K1] omega; // ranef correlations
vector[K1] tau; // ranef scales
vector[K1] u[G]; // random intercepts
}
model {
matrix[K1,K1] sigma;
// prior for fixed effects (vectorized)
for(k in 1:K1) {
beta[k] ~ normal(0,5);
}
// prior/hyper prior for random effects
sigma <- diag_matrix(tau) * omega * diag_matrix(tau);
tau ~ cauchy(0,2.5);
omega ~ lkj_corr(2);
for(g in 1:G) {
u[g] ~ multi_normal(zero,sigma);
}
{ // local block for linear predictor
vector[K] xbs;
vector[K1] xb;
xbs[1] <- 0;
for(n in 1:N) {
xb <- beta*x[n] + u[map[n]]; // a K1 vector
for(k in 2:K) {
xbs[k] <- xb[k-1];
}
y[n] ~ categorical(softmax(xbs));
}
}
}
"
To define correlated multivariate normal random effects I looked
carefully at section 9.9, which described multivariate priors for
hierarchical models. The basic idea is that each individual will have a
random intercept for each equation, drawn from a bivariate normal
distribution. The code, however, allows any number of categories, so we
work with a multivariate normal. The advice here is to work with a
correlation matrix omega
and a set of scale factors or
standard deviations tau
, and then define the covariance
matrix sigma
as Σ = diag(τ) Ω diag(τ).
I do this by defining the matrices and vectors in the parameters
block, and the priors in the model block. Following the advice in the
Stan manual I let τ be cauchy(0,2.5)
’ and Ω have a
lkj_corr(2)
prior. See the manual for details on these
priors.
The final hurdle is to define the probabilities. Rather than expand
both the fixed effects beta
and the random effects
u
to add zeroes for the omitted category, I decided to
compute a linear predictor, adding the k
-1 fixed and random
effects, and then expand that by adding a zero. This is all done in a
local block. The final code follows below.
Here’s an actual run:
> sf2 <- stan(model_code=sd_model,data=sd_data,iter=100,chains=1)
TRANSLATING MODEL 'sd_model' FROM Stan CODE TO C++ CODE NOW.
COMPILING THE C++ CODE FOR MODEL 'sd_model' NOW.
... output supressed, including informational messages ...
... warning about non-positive definite matrices ...
Iteration: 100 / 100 [100%] (Sampling)
Elapsed Time: 138.088 seconds (Warm-up)
106.526 seconds (Sampling)
244.614 seconds (Total)
> stan.print(sf2)
> stan.print(sf2)
Inference for Stan model: sd_model.
1 chains, each with iter=100; warmup=50; thin=1;
post-warmup draws per chain=50, total post-warmup draws=50.
mean se_mean sd 50% n_eff Rhat
beta[1,1] -0.214 0.028 0.195 -0.212 50 0.986
beta[1,2] 0.378 0.041 0.289 0.414 50 0.980
beta[1,3] 1.017 0.018 0.129 1.010 50 1.011
beta[1,4] -0.589 0.027 0.193 -0.582 50 0.980
beta[2,1] -2.017 0.059 0.265 -2.047 20 1.078
beta[2,2] 1.540 0.056 0.398 1.567 50 0.984
beta[2,3] 1.521 0.022 0.132 1.513 37 0.982
beta[2,4] -0.205 0.029 0.208 -0.187 50 0.990
omega[1,1] 1.000 0.000 0.000 1.000 50 NaN
omega[1,2] 0.623 0.013 0.073 0.624 31 1.145
omega[2,1] 0.623 0.013 0.073 0.624 31 1.145
omega[2,2] 1.000 0.000 0.000 1.000 41 0.980
tau[1] 1.345 0.043 0.133 1.354 10 1.114
tau[2] 1.917 0.040 0.141 1.903 12 1.242
u[1,1] 0.747 0.206 1.455 0.755 50 0.990
u[1,2] 1.524 0.232 1.643 1.604 50 0.996
... supressing the estimates of the random effects ...
u[362,1] 0.880 0.117 0.829 0.815 50 0.980
u[362,2] -0.175 0.159 1.114 -0.158 49 1.011
lp__ -1552.537 20.623 41.317 -1548.684 4 1.698
Samples were drawn using NUTS(diag_e) at Thu May 29 13:29:19 2014.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).
Again, we need to run this model much longer, but the preliminary results are pretty close to the Stata estimates. In particular, note the correlation of 0.623 compared to 0.659 and the standard deviations of 1.345 and 1.917, compared to 1.321 and 1.954, which is remarkable agreement for only 50 samples.
Extension to crossed random effects would require defining two maps,
one for each grouping factor, and then two sets random effects, each
with its own omega
and tau
(and hence
sigma
).