1 1 Lecture 13 Poisson Regression Using Stata 2 This lecture covers ? T he Univariate Poisson Distribution ? T he Poisson Regression Model ? T he Negative Binomial Distribution ? N egative Binomial Regression ? C omparing the Poisson and Negative Binomial Regression Models 2 3 Introduction ? T his lecture deals with the modeling of dependent variables that are event count or count data. An event count refers to the number of times an event occurs, and is the realization of a nonnegative integer-valued random variable. Variables that count the number of times that something has happened are common in the social scien ce s. 4 ? S ome examples of count variables are the number of times in a year that persons visit the doctor, the number of car accidents that occur each day in a city, the number of love affairs that occur to a university student during the four years, the number of industrial injuries in the workplace in a day, the number of cigarettes a person smokes in a day, etc. 3 5 ? I n demography, popular count variables are the number of children born to a woman, the number of pregnancies a woman has, the number of sexual intercou rse s a p e rson has in a week, the number of sexual partners in a year, the number of abortions a woman has had in her lifetime, the number of residential migrations a person makes in a lifetime, the number of jobs a migrant worker has done since s/he arrived, etc. 6 ? F requently, count variables are treated as though they are continuous and unbounded. OLS models are then used to estimate the effects of X variable s on thei r occurrence. OLS is appropriate if the dependent variable, the count, is independently and identically distributed. However, the use of OLS for count outcomes can result in inefficient, inconsistent and biased estimates if one or more of the OLS assumptions are not met. 4 7 ? T hus, models other than OLS models have been used to handle count data. This lecture will cover two: (1) the Poisson regression model, and (2) the negative binomial regression model. The software used in this lecture is Sta t a . (Poisson regre s sion is also available in SPSS under general log-linear model) Before the discu ssi on of th e Poisso n regre s sions, let’s first take a look at the univariatePoisson distribution. 8 The Univariate Poisson Distribution ? T he univariate Poisson distribution provides the benchmark for Poisson regression. Let Y equal a random variable that represents the number of times that an event has occurred during an interval of time. Y will have a Poisson distribution with a parameter μ greater than 0: 5 9 ? w here th e param e ter μ represents the expected number of counts that have occurred; for the distribution this will also be the mean. Thus, 10 Some properties of a Poisson distribution ?A s μ increases, the mass of the distribution shifts to the right; we’ll see this below in the sample graphs of the univariate Poisson distribution. 6 11 ? T he va ri ance of Y equals μ . The equality of the mean and the variance is known as equidis p ersion . Actually in practice, many co unt varia b les have a varia n ce greater than the mean, which is called overdispersion . Sometimes, therefore, the Poisson regression model is not entirely appropriate, often leading the analyst to the negative binomial regression model (to be discussed later). 12 ?A s μ increases, the pr obability of 0’s decreases. In a Poisson distribution, for μ = 0.8, the probability of an 0 is 0.45; for μ = 1.5, the probability of an 0 is 0.22; for μ = 2.9, the probability of an 0 is 0.05; for μ = 10.5, the probability of an 0 is 0.00002. ?A s μ in crea se s, the Poisso n distribution approximates a normal distribution. 7 13 ? H ere are four examples of univariate Poisson distributions, varying on their values o f μ . ? T he first Poisson distribution has μ =0.8. The second, μ =1.5. The third, μ =2.9. The fourth, μ =10.5. ? T he Stata c ommands to produce the graph are as below: 14 8 1516 ? A critical assumption of the Poisson distribution is that when an event occurs, it does not affect the probability of the event occurring in the future. If the “count” is children born to mothers, the assumption of independence implies that when a woman has a baby born to her, it does not affect the probability of another baby being born to her. In demography, however, future fertility is not independent from past fertility, and particularly in China, the next birth (or abortion) is heavily dependent upon the previous ones in the context of the strict family planning policy. 9 17 The Poisson Regression Model ? I n the Poisson regression model, the number of events (the dependent variable) is a nonnegative intege r; it has a Poisson distribution with a conditional mean that depends on the characteristics (the X variables) of the individuals according to the following structural model: 18 The Poisson regression model is a nonlinear model, predicting for each individual the number of times, μ , that the event has occurred. The X variables are related to μ nonlinearly. 10 19 The Poisson Regression Model Predic t ing the Number of Children Ever Born to Chinese Women ? W e are going to modeling the number of children ever born (CEB) to Chinese women from the 1997 survey. Before doing the Poisson regression, we would be wondering if the count data are Poisson distributed. 20 ? T hus we conducted an analysis of the count dependent variable to compare the observed distribution of the count data with a univariate Poisson distribution with the same mean as the count data. ? T he dependent variable is a count variable, namely, the number of children ever born to a woman. The variable is called “CEB”. Here are descriptive data on this variable for the sample women: 11 21 sum ceb, detail 22 ? T he “CEB” v ariable is a count variable, ranging from 0 to 9, with a mean of 1.855, a standard deviation of 1.125, and a variance of 1.266. The mean and the variance are not the same, as they are in a univariate Poisson distribution, but they are close. Unlike the case with many count variables, there is no overdispersion in the “CEB” v ariable. We will use Stata’s “ prcounts” c ommand to graph the distribution of “CEB” in a graph along with a univariate Poisson distribution that has a mean of 1.855. We can then see how closely the data are Poisson distributed. 12 23 ? F irst, we estimate a Poisson regression without any independent variables, so to be able to fit a univariate P oisson distribution with a mean equal to that of our count dependent variable, CEB, namely 1.855. ? w hen the Poisson regression model has no independent variables, the estimated model is reduced to: 24 poisson c eb, nolog 13 25 ? O bserve that the Poisson intercept in this model, which has no independent variables, is .6178104 . We exponentiatethis value, that is, e .6178104 = 1.854862 , which is indeed the mean of the “CEB”variable. ? N ow we use the “prcounts” c ommand to graph the observed distribution of the CEB variable with a univariate P oisson distribution with a mean of 1.855. 26 Stata c ommand: prcounts c ebprob, plot max(10) label var c ebprobobeq "Observed CEB Distribution" label var c ebprobpreq " Univariate P oisson, mu = 1.855"label var c ebprobval "Number of Children Ever Born" graph twoway connected cebprobobeq cebprobpreq cebprobval, ytitle("Proportion or Probability") ylabel(0(.1).4) xlabel(0(1) 9) title("CEB D istribution and Poisson Distribution with mu = 1.855") 14 27 0 .1 .2 .3 .4 Proportion or Probability 0 1 2 3 4 5 6 7 8 9 Number of Children Ever Born Observed CEB Distribution Univariate Poisson, mu = 1.855 CEB Distribution and Poisson Distribution with mu = 1.855 28 ? T he graph shows that the children ever born variable is pretty much Poisson distributed. The univariate Poisson distribution over- predicts the observed CEB distribution at the count of zero, and under-predicts counts of 1 and 2; the remaining counts are pretty close. The observed CEB distribution, compared to the univariate P oisson distribution with a mean, μ , of 1.855, has substantially fewer 0’s, and more cases in the earlier counts. Even though the two distributions are not perfect, we may conclude that we are correct in estimating the CEB dependent variable with a Poisson model. 15 29 ? O ne reason for the failure of the pure Poisson distribution to perfectly fit the observed CEB data is that the rate of childbearing, i.e., the number of babies produced, μ , differs across the women. The univariate P oisson distribution with a mean of 1.855 does not take into account the heterogeneity of the sample women in their values of μ . So we need to extend the univariate Poisson distribution to the Poisson regression model, in which we assume that the observed CEB count for woman i is drawn from a Poisson distribution with mean μ , where μ i , is estimated from observed characteristics, that is, from X variables of the women. 30 Poisson regression is used to model CEB, with se ven X var i ables, namely: ? w oman’s age at menarche , in ye ars; ? a ge at first marriage ( afm ), in years; ? w oman’s exposure to the risk of childbearing ( fec u nd ), which is calculated in years for each woman, is the difference between her age at menarche and either, her age at sterilization, her age at menopause, or her age when the survey was conducted, whichever is less; 16 31 ? n umber of years of education completed ( eduyrs ); ? w hether the woman lives in an urban ar ea, coded 1 if yes, 0 if no; ? w hether the woman is a Han Chinese, coded 1 if yes, 0 if no; ? w hether the woman’s first pregnancy occurred after 1979; this variable is called policy , and is coded 1 if yes, 0 if no. 32 Here are summary descriptive data for the dependent variable and the seven independent variables:sum ceb m enarche afm f ecund eduyrs urban han policy 17 33 ? W e hypothesize that the older the woman at first menstruation, the greater the number of CEB to her; the greater the number of years of education, the less the CEB; if the first pregnancy occurred after 1980, CEB will be lower; urban women will have fewer CEB than rural women. ? W e will now estimate a Poisson regression model, predicting CEB with the above seven X variables. The Stata c ommand is poiss on followed by the dependent variable and then the seven X variables. 34 poisson ceb m enar che afm f ecund eduyrs urban han policy 18 35 ? W e may ask how well does this Poisson regression model of children ever born improve our ability to predict the probabilities of a woman having each number (i.e., each count) of children. ? W e use the “prcounts” command to calculate the predicted probabilities for each count of CEB. We will call these predicted probabilities “prceb”. 36 Stata c ommand: poisson ceb m enar che afm f ecund eduyrs u rban han p olicy prcount s prceb, plot max(10) label var p rcebpr eq "Prediction from PRM" graph twoway connected cebprobobeq cebprobpr eq prcebpreq c ebprobval, ytitle("Proportion or Probability") ylabel( 0 (.1).4) xlabel (0(1) 9) title("Di s tributions of CEB, Uni v ariate Poisson") sub("and Poisson Regression Model") 19 37 0 .1 .2 .3 .4 Proportion or Probability 0 1 2 3 4 5 6 7 8 9 Number of Children Ever Born Observed CEB Distribution Univariate Poisson, mu = 1.855 Prediction from PRM and Poisson Regression Model Distributions of CEB, Univariate Poisson 38 ? T he predicted probabilities generated by the Poisson regression model (PRM) are only slightly worse at predicting count 0 than the predictions generated by the univariate Poisson distribution; but for the most pa rt the P R M only result s in a modest improvement in the predictions. Both the PRM predictions, and the predictions generated by the univariatePoisson model, are still somewhat off the actual distribution of CEB. 20 39 Poisson Goodness of Fit ? T here is a formal “goodness of fit” test we may calculate, that co mpares the observed empirical distribution with the distribution predicted by the Poisson regression model. The null hypothesis ( H 0 ) is that there is no difference between the observed data and the modeled data, indicating that the model fits the data. So we are looking for a small value of chi-square, wi th a probability > 0.05. If we have a small chi-square, this would mean we have a model that fits the data. 40 ? T he Stata c ommand is poisgof ? T he goodness of fit test is good news for our model. It tells us that the model fits the data very well; specifically, the goodness of fit Chi2 test indicates that giv en the Poisson regression model we cannot reject the null hypothesis that our observed data are Poisson distributed. ? T he Stata p rintout for the Poisson regression equation also shows values of Pseudo R2 and the (likelihood ratio) LR Chi2 , which indicate that we have a statistically significant model. 21 41 Interpretation of Poisson Regression Coefficients ? T he interpretation of the actual Poisson regression coefficients are a little difficult. This owes to the fact that the predicted number of children ever born for each woman, μ i , is nonlinearly related to the X variable s. Thus it is more straightforward and easier to interpret the Poisson b coefficients in a multiplicative way. This is analogous to our discussion of odds ratios and percent change in odds ratios. 42 ? If we exponentiate t he Poisson b coefficient, we get what Stata r efers to as incide nce rate ratios ( IR R ); these are very similar to odds ratios. Stata w ill provide these if you ask for them (ask for irr ) after the comma following the poiss on equation command. ? A n easier way is to use the listcoef command and you get the Poisson b coefficients (1st column of data on slide page 43), the irr c oefficients (4th column of data on slide page 44, e^ b ), and other things. 22 43 poisson c eb menarche afm f ecund eduyrs urban han policy, irr 44 listcoef, help 23 45 The IRR coefficients are interpreted just like odds ratios: ? T he IRR for urban is 0.76497 = e -0.26792 . Being a urban women compared to being a rural woman multiplies the expected number of children by a factor of 0.765, holding the other X variables constant. ? W e can interpret the effect of the woman’s age at menarche as follows: for every additional year of the woman’s age at menarche, her CEB is multiplied by a factor of 1.063, holding the other X variables constant. 46 ? A n even better way to interpret the effects of the X variables is in terms of the percent change in the IRR coefficients; this is calculated just like we did in calculating the percent change in odds ratios. ? T he Stata c ommand listcoef, pe rcent will produce these calculations (in the 4th column of data on slide page 47). 24 47 listcoef, percent help 48 ? T hus for every additional year in age at menarche, a woman’s mean production of children increases by 6.3%, holding all other variables constant. For every additional year in education, a woman’s mean production of children decreases by 2.4%, holding all other variables constant. Being a urban woman decreased the expected production of children by -23.5%, holding all other variables constant. 25 49 ? A lso, we can interpret the Poisson effects in a sta ndardized way, by calculating IRR coefficients standardized on the X variab le. This is done by exponentiating the Poisson b coefficient after it has been multiplied by its standard deviation. These values are reported in the Stata listcoef table (5th column of data on slide page 44). We may also interpre t these standardized IRR coefficients in terms of percent change. These values are reported in the Stata listcoe f, percent table (5th column of data on slide page 47). 50 ? W e may use these standardized coefficients to gauge the relative impacts on fertility of all seven of the X var i ables. ? T hus, from the above table, we see that the woman’s exposure to the risk of childbearing has the greatest impact, followed by the age at first marriage variable, followed by the age at menarche variable, followed by the urban variable, followed by the education variable, followed by the han v ariable, followed by the policy variable. 26 51 Negative Binomial Regression ? P oisson regression model rarely fits in practice since in most applications the variance of the count data is greater than the mean. Often, if there is overdispersion, the Poisson estimates will be consistent, but will be inefficient. The standard errors will be biased downwards, resulting in spuriously large z-values. Thus if there is overdispersion in data an alyzed, t h e z-te st s will tend to over estima te the significance of the X va riables. 52 ? S tatisticians thus recommend that when there is overdispersion in the count data that the dependent variable be estimated with Negative Binomial regre s sion ; this alternate approach is used when the count event that is being estimated has extra-Poisson variation , that is, overdi sp ersion. 27 53 The Negati ve Binomial (NB) Distri bution ? T he expected value of Y for the Negative Binomial (NB) distribution is the same as for the Poisson distribution. But the variance for the NB distribution is larger. The larg er va ria n ce of NB thus increa se s the relative frequency of low and high counts. A comparison of the Poisson and NB distri butions shows t h at the NB distribution corrects a number of sources of poor fit that are often found when the Poisson distribution is used. 54 ? One , the variance of the NB distribution exceeds the variance of the Poisson distribution for a given mean; ? Two , th e increa sed vari ance of t h e NB regression model results in substantially larger probabilities for small counts; and ? Finally , in the NB distribution there are slightly larger probabilities for larger count s. 28 55 ? I n the negative binomial regression model, variation in μ is due both to variation in (the independent variables) among the individuals (in the sample population) and to unobserved heterogeneity introduced by ε . The term ε i s a ran d o m erro r that is assumed to be uncorrelated with the independent variables. ε may be thought of either as the combined effects of unobserved variables that have been omitted from the model or as another sour ce o f pure ra ndomne ss. 56 ? T he negative binomial regression model thus adds to the Poisson regression model the erro r term ε according to the following stru ctu r al equa t ion: ? If there is no overdispersion, the NB r e gr ess i on model r e duces to the P o is s o n r e gr ess i on model. 29 57 Estimating a Negative Binomial Regression Model of Chinese Fertility ? S tata estimates NB regression models, with the command nbre g , followed by the dependent and independent variables. We will estimate an NB regression model, using the same 1997 survey children ever born data set, as used above. 58 ? W e know that there is no overdispersion i n these CEB data; there is actually some underdispersion. Since there is no overdispersion in the data, the NB regression model should not do us any good, and the results should be the same as in the P o is s o n r egr ess i on model. Let’s see if this happens. nbreg ceb m enarche afm f ecund eduyrs urban han policy 30 5960 ? S tata first fits (estimates) the Poisson regression model; see the first panel of output -- these are the initial and final log likelihoods from the Poisson model estimated earlier. ? S tata then estimates the NB constant-only model, with a final log likelihood of -6046.6406. This is the beginning value of the log likelihood; X variables are then introduced, with the objective of reducing the value of this log likelihood. ? T he final log likelihood of the fitted model is shown above as -5560.9586. 31 61 ? It is exactly the same value (and based on the same calculations) as in the Poisson regression model. ? A lso, compare the values of the NB regression coefficients with the coefficients produced in the Poisson regression model. They too are nearly identical. 62 ? N ote near the bottom of the NB regression table values for ln(alpha) and alpha . The alpha value represents the amount of dispersion in the data. The Stataparameterization is in terms of ln(alpha), but the value of alpha is given right below. If alpha is zero, the NB regression model reduces to the Poisson regression model. There is no evidence at all of any overdispersion in the fertility (CEB) data; alpha = 8.62e-12, that is, .00000000000862. Alpha is for all practical purposes, zero. 32 63 ? W e see at the very base of the NR regression table, a likelihood ratio t est of the H 0 that alpha = 0. This is calculated by multiplying 2 times the difference between the final log likelihoods of the NB model and the Poisson model: ? T his value of 0 means that the observed data are Poisson distributed. So, there is no need to estimate a negative binomial regression model; it reduced to the Poisson regression model. 64 Estimating Negative Binomial Regression Models of Scientific Productivity of Graduate Students ? W e’ll now use Stata t o estimate an NB regression model of scientific productivity of graduate students, using data from the examples in “Regression Models for Categorical Dependent Variables Using Stata” by Scott Long and Jeremy Freese. This research on scientific productivity examined the factors ( X variables) affecting the number of papers published during graduate school (in the last 3 years) by a sample of 915 biochemists. 33 65 ? T he dependent variable is number of articles ( art ) published by the 915 graduate students. Five X variables are intro duced as predictors of the number of articles published by the scientists while they were in graduate school. ? T he X variables are: (1) Female (fem), scored 1 if female, 0 if male; (2) Marital status (mar), sco red 1 if marri ed, 0 if not; (3) Number o f children aged 5 or younger (kid5), ranging from 0 to 3; (4) Prestige of the scientist’s PhD department (phd), an interval variable ranging from 0.76 to 4.62; (5) Articles published by the scientist’s mento r (ment ) in last 3 year s, ranging from 0 to 77. 66 Here are summary descriptive data for this count dependent variable and these five X variables: sum art fem mar kid5 phd m ent 34 67 The count dependent variable (art) has a mean of 1.693, and a variance of 3.7097. There is a lot of overdispersion in these data.Here is the Poisson regression model results: 68 poisson a rt fem mar kid5 phd ment 35 69 Since we know there is a lot of overdispersion in the dependent variable, a Poisson regression model is likely not appropriate. We will thus estimate an NB regression model, using the same scientific productivity data set, as just used above. nbreg art fem mar kid5 phd ment 70 36 71 ? N ow, note near the bottom of the NB regression table values for ln(alpha) and alpha . If alpha is zero, the NB regression model reduces to the Poisson regression model. This was the case in the CEB example. But it is not the case here. ? H ere there is strong ev idence that there is overdispersion in the data; alpha = .4416, with a standard error of .0530, with a corresponding z value of 8.3, which is significant at .01. Alpha is significantly different from zero. 72 We see at the very base of the NR regression table, a likelihood ratio test of the H 0 that alpha = 0. This is calculated by multiplying 2 times the difference between and final log likelihoods of the NB model and the Poisson model:This is a highly significant value, meaning that the observed data are not Poisson distributed. 37 73 Interpretation of the Negative Binomial Regression Coefficients ? T hey are interpreted in exactly the same way as in the Poisson model. Strata’s listcoef comman d also works afte r estimating the NB model. 74 listcoef, help 38 75 ? T he IRR coefficients ( e^b) a r e interprete d just like odds ratios. The IRR for fem is 0.8054. Being a female scientist decreases (multiplie s) the expected number of articles by a factor of 0.8, holding other variables constant. Being married increases (mul tiplies) the number of articles by a factor of 1.16, holding other X variables constant. We can interpret the effect of the mentor’s productivity as follows: for every additional article of the mentor, a scientist’s average productivity increases by a factor of 1.03, holding the other X variables constant. 76 listcoef, percent help 39 77 ? T hus for every additional article published by the scientist’s mentor, a scientist’s mean productivity increases by 3%, holding all other variables constant. Being a female scientist decreases the expected productivity by 20%, holding all other variables constant. For every addition kid, a scienti st’s mea n produ ctivity d e crea se s by 16%, holding all other variables con s tant . 78 Comparing the Poisson and Negati ve Binomial Regression Models ? N ow let’s plot the predicted probabilities of scientific publication counts, as produced by the NB regression model, with those produced by the Poisson regression model, along with the observed proportions in the actual data. 40 79 0 .1 .2 .3 .4 Pr o p o r t i o n o r Pr o b a b ilit y 0 1 2 3 4 5 6 7 8 9 10 N u m ber of A r t i c l es O b se rve d P o i sso n NB N u m b e r of A r t i cl es D i st r i bu t i on 80 ? T his chart shows us graphically that the NB model fits the observed publication count data quite well, and does so in a much better way than the Poisson model. ? W e will also compare the scientific productivity resu lts of the Poisson regression with the negative binomial regression (Table on the next slide). For the most part, the Poisson and the NB regre s sion coeffi cients a r e ver y similar; they are not at all far apart. 41 81 Poisson Model Negative Binomial Model 82 ? T he sta n dard errors are uniformly lower i n the Poisson model than in the NB model, resulting in higher z-tests in the Poisson model than in the NB model. The differences in the Poisson and NB models regarding the standard errors (lower standard errors in the Poisson model) are precisely the expected consequence of overdispersion. ? I ndeed the marriage variable ( mar ) is shown to be significant in the Poisson model, but not in the NB model. ? It would be better off using the Negative Binomial model to predict the count of scientific productivity. 42 83 ? T he “rule of thumb” is to first estimate the simpler model, the Poisson regression model, and then the NB model. If there is significant overdispersion, you are safer to report the results of the Negative Binomial model. If there is not, use the Poisson model.