In this chapter we study log-linear models for count data under the assumption of a Poisson error structure. These models have many applications, not only to the analysis of counts of events, but also in the context of models for contingency tables and the analysis of survival data.
As usual, we start by introducing an example that will serve to illustrative regression models for count data. We then introduce the Poisson distribution and discuss the rationale for modeling the logarithm of the mean as a linear function of observed covariates. The result is a generalized linear model with Poisson response and link log.
Table 4.1, adapted from Little (1978), comes from the Fiji Fertility Survey and is typical of the sort of table published in the reports of the World Fertility Survey. The table shows data on the number of children ever born to married women of the Indian race classified by duration since their first marriage (grouped in six categories), type of place of residence (Suva, other urban and rural), and educational level (classified in four categories: none, lower primary, upper primary, and secondary or higher). Each cell in the table shows the mean, the variance and the number of observations.
Table 4.1. Number of Children Ever Born to Women of Indian Race
By Marital Duration, Type of Place of Residence and Educational Level
Marr. | Suva | Urban | Rural | |||||||||
Dur. | N | LP | UP | S\(+\) | N | LP | UP | S\(+\) | N | LP | UP | S\(+\) |
0–4 | 0.50 | 1.14 | 0.90 | 0.73 | 1.17 | 0.85 | 1.05 | 0.69 | 0.97 | 0.96 | 0.97 | 0.74 |
1.14 | 0.73 | 0.67 | 0.48 | 1.06 | 1.59 | 0.73 | 0.54 | 0.88 | 0.81 | 0.80 | 0.59 | |
8 | 21 | 42 | 51 | 12 | 27 | 39 | 51 | 62 | 102 | 107 | 47 | |
5–9 | 3.10 | 2.67 | 2.04 | 1.73 | 4.54 | 2.65 | 2.68 | 2.29 | 2.44 | 2.71 | 2.47 | 2.24 |
1.66 | 0.99 | 1.87 | 0.68 | 3.44 | 1.51 | 0.97 | 0.81 | 1.93 | 1.36 | 1.30 | 1.19 | |
10 | 30 | 24 | 22 | 13 | 37 | 44 | 21 | 70 | 117 | 81 | 21 | |
10–14 | 4.08 | 3.67 | 2.90 | 2.00 | 4.17 | 3.33 | 3.62 | 3.33 | 4.14 | 4.14 | 3.94 | 3.33 |
1.72 | 2.31 | 1.57 | 1.82 | 2.97 | 2.99 | 1.96 | 1.52 | 3.52 | 3.31 | 3.28 | 2.50 | |
12 | 27 | 20 | 12 | 18 | 43 | 29 | 15 | 88 | 132 | 50 | 9 | |
15–19 | 4.21 | 4.94 | 3.15 | 2.75 | 4.70 | 5.36 | 4.60 | 3.80 | 5.06 | 5.59 | 4.50 | 2.00 |
2.03 | 1.46 | 0.81 | 0.92 | 7.40 | 2.97 | 3.83 | 0.70 | 4.91 | 3.23 | 3.29 | – | |
14 | 31 | 13 | 4 | 23 | 42 | 20 | 5 | 114 | 86 | 30 | 1 | |
20–24 | 5.62 | 5.06 | 3.92 | 2.60 | 5.36 | 5.88 | 5.00 | 5.33 | 6.46 | 6.34 | 5.74 | 2.50 |
4.15 | 4.64 | 4.08 | 4.30 | 7.19 | 4.44 | 4.33 | 0.33 | 8.20 | 5.72 | 5.20 | 0.50 | |
21 | 18 | 12 | 5 | 22 | 25 | 13 | 3 | 117 | 68 | 23 | 2 | |
25–29 | 6.60 | 6.74 | 5.38 | 2.00 | 6.52 | 7.51 | 7.54 | – | 7.48 | 7.81 | 5.80 | – |
12.40 | 11.66 | 4.27 | – | 11.45 | 10.53 | 12.60 | – | 11.34 | 7.57 | 7.07 | – | |
47 | 27 | 8 | 1 | 46 | 45 | 13 | – | 195 | 59 | 10 | – |
In our analysis of these data we will treat the number of children ever born to each woman as the response, and her marriage duration, type of place of residence and level of education as three discrete predictors or factors.
A random variable \( Y \) is said to have a Poisson distribution with parameter \( \mu \) if it takes integer values \( y=0,1,2,\ldots \) with probability
\[\tag{4.1}\Pr\{Y=y\} = \frac{ e^{-\mu} \mu^y } { y! }\]for \( \mu > 0 \). The mean and variance of this distribution can be shown to be
\[ E(Y) = \mbox{var}(Y) = \mu. \]Since the mean is equal to the variance, any factor that affects one will also affect the other. Thus, the usual assumption of homoscedasticity would not be appropriate for Poisson data.
The classic text on probability theory by Feller (1957) includes a number of examples of observations fitting the Poisson distribution, including data on the number of flying-bomb hits in the south of London during World War II. The city was divided into 576 small areas of one-quarter square kilometers each, and the number of areas hit exactly \( k \) times was counted. There were a total of 537 hits, so the average number of hits per area was 0.9323. The observed frequencies in Table 4.2 are remarkably close to a Poisson distribution with mean \( \mu=0.9323 \). Other examples of events that fit this distribution are radioactive disintegrations, chromosome interchanges in cells, the number of telephone connections to a wrong number, and the number of bacteria in different areas of a Petri plate.
Table 4.2. Flying-bomb Hits on London During World War II
Hits | 0 | 1 | 2 | 3 | 4 | \(5+\) |
Observed | 229 | 211 | 93 | 35 | 7 | 1 |
Expected | 226.7 | 211.4 | 98.6 | 30.6 | 7.1 | 1.6 |
The Poisson distribution can be derived as a limiting form of the binomial distribution if you consider the distribution of the number of successes in a very large number of Bernoulli trials with a small probability of success in each trial. Specifically, if \( Y \sim B(n,\pi) \) then the distribution of \( Y \) as \( n \rightarrow \infty \) and \( \pi \rightarrow 0 \) with \( \mu = n \pi \) remaining fixed approaches a Poisson distribution with mean \( \mu \). Thus, the Poisson distribution provides an approximation to the binomial for the analysis of rare events, where \( \pi \) is small and \( n \) is large.
In the flying-bomb example, we can think of each day as one of a large number of trials where each specific area has only a small probability of being hit. Assuming independence across days would lead to a binomial distribution which is well approximated by the Poisson.
An alternative derivation of the Poisson distribution is in terms of a stochastic process described somewhat informally as follows. Suppose events occur randomly in time in such a way that the following conditions obtain:
The probability of at least one occurrence of the event in a given time interval is proportional to the length of the interval.
The probability of two or more occurrences of the event in a very small time interval is negligible.
The numbers of occurrences of the event in disjoint time intervals are mutually independent.
Then the probability distribution of the number of occurrences of the event in a fixed time interval is Poisson with mean \( \mu = \lambda t \), where \( \lambda \) is the rate of occurrence of the event per unit of time and \( t \) is the length of the time interval. A process satisfying the three assumptions listed above is called a Poisson process.
In the flying bomb example these conditions are not unreasonable. The longer the war lasts, the greater the chance that a given area will be hit at least once. Also, the probability that the same area will be hit twice the same day is, fortunately, very small. Perhaps less obviously, whether an area is hit on any given day is independent of what happens in neighboring areas, contradicting a common belief that bomb hits tend to cluster.
The most important motivation for the Poisson distribution from the point of view of statistical estimation, however, lies in the relationship between the mean and the variance. We will stress this point when we discuss our example, where the assumptions of a limiting binomial or a Poisson process are not particularly realistic, but the Poisson model captures very well the fact that, as is often the case with count data, the variance tends to increase with the mean.
A useful property of the Poisson distribution is that the sum of independent Poisson random variables is also Poisson. Specifically, if \( Y_1 \) and \( Y_2 \) are independent with \( Y_i \sim P(\mu_i) \) for \( i=1,2 \) then
\[ Y_1 + Y_2 \sim P(\mu_1+\mu_2). \]This result generalizes in an obvious way to the sum of more than two Poisson observations.
An important practical consequence of this result is that we can analyze individual or grouped data with equivalent results. Specifically, suppose we have a group of \( n_i \) individuals with identical covariate values. Let \( Y_{ij} \) denote the number of events experienced by the \( j \)-th unit in the \( i \)-th group, and let \( Y_i \) denote the total number of events in group \( i \). Then, under the usual assumption of independence, if \( Y_{ij}\sim P(\mu_i) \) for \( j=1,2,\ldots,n_i \), then \( Y_i \sim P(n_i\mu_i) \). In words, if the individual counts \( Y_{ij} \) are Poisson with mean \( \mu_i \), the group total \( Y_i \) is Poisson with mean \( n_i \mu_i \). In terms of estimation, we obtain exactly the same likelihood function if we work with the individual counts \( Y_{ij} \) or the group counts \( Y_i \).
Suppose that we have a sample of \( n \) observations \( y_1, y_2, \ldots, y_n \) which can be treated as realizations of independent Poisson random variables, with \( Y_i \sim P(\mu_i) \), and suppose that we want to let the mean \( \mu_i \) (and therefore the variance!) depend on a vector of explanatory variables \( \boldsymbol{x}_i \).
We could entertain a simple linear model of the form
\[ \mu_i = \boldsymbol{x}_i'\boldsymbol{\beta}, \]but this model has the disadvantage that the linear predictor on the right hand side can assume any real value, whereas the Poisson mean on the left hand side, which represents an expected count, has to be non-negative.
A straightforward solution to this problem is to model instead the logarithm of the mean using a linear model. Thus, we take logs calculating \( \eta_i = \log(\mu_i) \) and assume that the transformed mean follows a linear model \( \eta_i = \boldsymbol{x}_i'\boldsymbol{\beta}. \) Thus, we consider a generalized linear model with link log. Combining these two steps in one we can write the log-linear model as
\[\tag{4.2}\log(\mu_i) = \boldsymbol{x}_i'\boldsymbol{\beta}.\]In this model the regression coefficient \( \beta_j \) represents the expected change in the log of the mean per unit change in the predictor \( x_j \). In other words increasing \( x_j \) by one unit is associated with an increase of \( \beta_j \) in the log of the mean.
Exponentiating Equation 4.2 we obtain a multiplicative model for the mean itself:
\[ \mu_i = \exp\{ \boldsymbol{x}_i'\boldsymbol{\beta} \}. \]In this model, an exponentiated regression coefficient \( \exp\{\beta_j\} \) represents a multiplicative effect of the \( j \)-th predictor on the mean. Increasing \( x_j \) by one unit multiplies the mean by a factor \( \exp\{\beta_j\} \).
A further advantage of using the log link stems from the empirical observation that with count data the effects of predictors are often multiplicative rather than additive. That is, one typically observes small effects for small counts, and large effects for large counts. If the effect is in fact proportional to the count, working in the log scale leads to a much simpler model.