1 1 Lecture 12 Ordinal Logistic Regression 2 This lecture briefly introduce ordinal logistic regression ? T he context and data type ? T he ordinal logistic regression equation ? F itting an ordinal logistic regression ? R esults and interpretation ? A n illustrative example of fertility analysis using Stata 2 3 The context ? T here are many contexts in which a variable is ordinal that have three or more categories ? S ome typical examples are health status (very good, good, so-so, bad, very bad), political ideology (very liberal, slightly liberal, moderate, slightly conservative, very conservative), fertility intention (the more the better, two, one, no) 4 ? I n these examples, the distance between categories is not equal. ? T reat the variable as though it were continuous. In this case, just use OLS regression. Certainly, this is widely done, particularly when the dependent variable has 5 or more categories. However, this will often result in biased estimates of the regression parameters 3 5 ? I gnoring the ordinal categories of the variable and treating it as nomial, i.e. use MNLM. The key problem is a loss of efficiency. By ignoring the fact that the categories are ordered, you fail to use some of the information available to you, and you may estimate many more parameters than is necessary. This increases the risk of getting insignificant results, but your parameter estimates still should be unbiased. 6 Data type ? A s in other logistic regression, the predictors in ordinal logistic regression may be quantitative, categorical, or a mixture of the two. The dependent variable should be discrete and ordinal with three or more categories. ? I n SPSS, discrete (cate g o rical) variables are entered as factors, and continuous variables as covariates. 4 7 The Ordered Logit M odel (OLM) ?S a y Y is an ordinal dependent variable with c categories. Let Pr(Y ≤ j) denote the probability that the response on Y falls in category j or below (i.e., in category 1,2, …, or j ). This is called a cumulative probability. It equals the sum of the probabilities in category j and below: Pr(Y ≤ j)= Pr(Y = 1) + (Pr(Y = 2)+ … +Pr(Y = j) 8 ?A “ c category Y dependent variable” h as c cumulative probabilities: Pr(Y ≤ 1), Pr(Y ≤ 2), … P r(Y ≤ c) . The final cumulative probability uses the entire scale; as a consequence, therefore, Pr(Y ≤ c) = 1 . The order of forming the final cumulative probabilities reflects the ordering of the dependent variable scale, and those probabilities themselves satisfy: Pr(Y ≤ 1) ≤ Pr(Y ≤ 2) ≤ … ≤ Pr(Y ≤ c) = 1 5 9 ? I n ordered logit, an underlying probability score for an observation of being in the ith response category is estimated as a linear function of the independent variables and a set of threshold points (also called cut points). ? T he probability of observing response catego ry i corresponds to the probability that the estimated linear function, plus random error, is within the range of the threshold points estimated for that response. 10 ? Pr(response c ategory for the jth outcom e = i) = Pr(k i -1 <b 1 X 1j + b 2 X 2j + … + b k X kj + u j ≤ k i ) ? O ne estimates the coefficients b 1 , b 2 , … b k along with threshold points k 1 , k 2 , …, k i-1 , where i is the number of possible response categories of the dependent variable. All of this is a direct generalization of the binary logistic model. 6 11 ? T he coefficients and threshold points are estimated using maximum likelihood. In the parameterization of SPSS, no co nstant appears because its effect is absorbed into the threshold points. ? T he SPSS output provid es single values for the b coefficients. The b coefficients (one for each X variable) are the main items of interests in the ordered logit t able. (One of the advantages using Stata i s that odds ratios are available) 12 ? W hen b = 0 , X has no effect on Y . The effect of X increa ses as t he absolute value of b increases. There are not separate b coefficients for each of the outcomes (or one minus the number of outcomes as we have seen in multinomial logistic regression in which we considered logistic regression with a nominal dependent variable). 7 13 ? I n OLM, a particular b coefficient takes the same value for the logit c oefficient for each cumulative probability. The model assumes that the effect of X is the same for each cumulative probability. This cumulative logit m odel with common effects is often called a “proportional odds”model. 14 8 15 Estimating an ordered l ogit m odel ? T he explication of the OLM is facilitated by considering an example using the 1997 data. Suppose that the response variable is health status of children, this is captured by question 302F: F. Health conditions of live births? 1). Healthy2). Basically healthy3). Sick but not disabled4). Congenitally dis a bled 5). Disabled after birth6). Dead7).N/A 16 HEALTH4 1121 75.8 89.3 89.3 90 6.1 7.2 96.5 15 1.0 1.2 97.7 29 2.0 2.3 100.0 1255 84.9 100.0 224 15.1 1479 100.0 healthybasically healthysick or disableddeadTotal Valid System MissingTotal Frequency Percent Valid Percent Cumulative Percent We are going to examine the effect on child health of maternal age at childbearing, residence, ethnicity, education, duration of breastfeeding, and child sex. We recode the health status variable into 4 categories: (1) healthy, (2) basically healthy, (3) si ck or disabled, and (4) dead, as shown in the following table (we restrict our sample to children aged 0-5): 9 17 We are going to fit the following equation: 11 2 2 () l n ... 1( ) j nn PY j ab X b X b X PY j ?? ≤ =+ + + + ?? ?≤ ?? Dependent variable:health status, denoted as health4 (4 categories: healthy, basically healthy, sick or dis a bled, and dea d). 18 ? I ndependent variables : MAC: Maternal age at childbearing, an interval var i able Par_num: parity, an interval var i able Bfeed: duration of breastfeeding, an interval var i abl e Chds ex : child sex, 1 if a girl, 0 otherwi se Urban: place of residence, 1 if urban, 0 otherwi se Han: 1 if Han, 0 otherwi se Primary: 1 if primary school, 0 otherwi se Junior: 1 if junior middle school , 0 otherwise Sencol: 1 if senior mi ddle school and over, 0 otherwi se 10 19 ? O ur hypothesis is that both child and materna l chara c t e ristics affect ch ild survival. Women in higher socio-economic categories will be more likely to have healthier children. Prolonged duration of breastfeeding is associ ated with increased probability of being healthy of a child. The practice of discrimination against girls suggests that a girl ch ild is more likely to be in a worse status of health than a boy child. 20 The ordinal logistic regression equation in our example: 12 3 45 6 78 9 () ln _ 1( ) PY j b M ac b P ar num b B f eed PY j b C hdse x b U rban b H an bP r i m a r y b J unior b Se nc ol ?? ≤ =+ + ?? ?≤ ?? ++ + ++ + 11 21 SPSS syntax: PLUM health4 WITH mac p ar_num bfeed chdsex urban han p rimary junior sencol /CRITERIA = CIN(95) DELTA(0) LCONVERGE(0) MXITER(100) MXSTEP(5) PCONVERGE(1.0E-6) SINGULAR(1.0E-8)/LINK = LOGIT/PRINT = FIT PARAMETER SUMMARY. 22 12 23 A positive and statistically significant coefficient estimate implies that the corresponding explanatory variable significantly increases the probability that the child is healthy, while a negative and statistically significant coefficient estimate implies that the corresponding explanatory variable significantly increases the probability that the child dies after birth. Thus, higher education of mothers significantly increased the likelihood of having a healthy child, as did longer duration of breastfeeding. Coefficients of other explanatory variables are however insignificant. 24 An example of fertility analysis ? D ependent variable: Number of children ever born (three categories: none, few, multiple) ? I ndependent variables: Individual characteristics (age, place of residence, ethnicity, education) 13 25 ? T he dependent variable is called CEB3, an ordinal variable scored 1 if the woman has no births, 2 if the woman has few (1-2) births, and 3 if the woman has multiple (3+) births. Thus, the outcomes of the outcomes of the dependent the dependent variable are variable are three: none, few, three: none, few, multiple.multiple. Using Stata t o Estimate an Ordered Logit M odel of Chinese Fertility 26 ? O rdinal logistic regression is used to model the CEB3 dependent variable; the X variab les are AGE (in year s), and six dummy variables representing place of residence, ethnicity and education: URBAN, HAN, PRIMARY, JUNIOR, SENIOR, COLLEGE . ? T he Stata c ommand is ologit, following by the dependent variable followed by the independent variables. 14 2728 ? N ote that the seven logit c oefficients have single values (which is not like the situation in last lecture when I estimate a multinomial logistic regression). ? N ote also the two cut points of c u t1 = 0.92 , and cut2 = 6.53 ; these are the so-called ancillary parameters. Their values assist us in calculating probabilities for each woman of her being in each of the three outcomes on the CEB3 dependent variable; they also assist in interpreting the logit c oefficients and their odds ratios. 15 29 Model Fit ? F rom the lecture last week, we already know how to evaluate the adequacy or fit of the overall model. The LR χ 2 test statistic in the full model has a value of 1429.64, which is the difference between values o f ( -2 L 0 ) and ( -2 L F ). With 7 degrees of freedom, this statistic has a P of 0.0000 for testing the null hypothesis that β 1 = β 2 = β 3 = 0 . 30 ? P seudo R 2 is 0.2384, fairly good fit of the model to the data. ? A ll seve n logit c oefficients ha ve ver y hig h z (t) scores, and all are significant at P = 0.01, meaning that the seven logit coefficients are all significantly different from 0, having significant influence on fertility. 16 31 The Ordered Logit C oefficients ? I n a summary way, the coefficients tell us that older women have more children, educated women have fewer children, urban women have fewer children than rural women, and Han women have fewer children than minority women. ? E ach coefficient refers to the linear change in the log odds of being above either of the first two categories. 32 ? O ther things equal, with each increase in age, there is an increase of 0.17 in the log odds of CEB3 being above either of the two fixed levels, that is, the two fixed levels of none or few. ? O ther things equal, the log odds for urban women of having a CEB3 value above either of the two fixed levels are -1.14 lower in value than for rural women. ? O ther things equal, the log odds for Han women of having a CEB3 value above either of the two fixed levels are -0.66 lower in value than for minority women. 17 33 ? O ther things equal, the log odds for women with primary school education, junior middle school education, senior middle school education, and college or over education having a CEB3 value above either of the two fixed levels are, respe c ti vely, -0.31, -0 .64, -1.12, and -1.28 lower in value than for illiterate women. ? O f course, each of these interpretations captures the linear effect of the particular X -variable, holding all other variables con s tant . 34 Odds Ratios ? S tata’s listcoef command will give the odds ratios, along with the ordered logitcoefficients. ? S tata will give percentage change in odds ratios with the listcoef command, followed by perc ent after the comma. 18 35 Odds ratios (e^b), 4th column of data Percentage change in odds ratios (%) , 4th column of data 36 ? W ith every one year increase in age, the odds of being in a higher fertility outcome category is 1.19 greater, or increase by 19%, holding all other variables constant. ? U rban women have odds of being in a higher CEB3 category that are 68% less than those of rural women, holding all other variables constant. ? W omen with college above education is 72% less likely in a higher fertility category than illiterate women, holding all other variable s co nsta nt. 19 37 ? W hat of the relative importance of the effects of these logit c oefficients? Which of the se ve n X vari ables is the most influential in affecting the odds of a woman being in the next higher category of the fertility variable? ? A ccording to the unstandardized logit coefficients, college is ranked first, followed by urba n. The smallest is that of age. 38 Semi- and fully standardized ordered logit c oefficients 20 39 ? S emi-standardized ordered logit coefficients on the X-variable c ontrol for the metric of the X variab le: For e v ery one standard deviation increase in age, there is an increase of 1.35 in the log odds (i.e., the logit) of CEB3 being above either of the two fixed levels, that is, the two fixed levels of none or few, holding all other variables constant. 40 ? O rdered logits coefficients standardized on the Y -variable: For every increase of one year in age, there is an increase of 0.07 standard deviations in the woman’s fertility, holding all other variables constant. ? T he fully standardized ordered logit coefficient: For every one standard deviation increase in age, there is an increase of 0.056 standard deviations in the woman’s fertility, holding all other variables constant. 21 41 Odds Ratios Standardized on the X Variable ? F or every one standard deviation increase in age, the odds are 3.85 times greater of the woman being in a higher fertility outcome category, holding all other variables constant. ? W ith every one standard deviation increase in age, the odds of being in a higher outcome fertility category increase by 285%, holding all other variables con s tant . 42 Predicted Probabilities ? U se the comman d predic t to calculate predicted probabilities of each woman having no, few and multiple births, based on the OLM. ? U se the comma nd prge n to calculate predicted probabilities of having no, few and multiple births for women of different groups, based on the OLM. 22 43 Syntax for generating graphs of predicted probabilities based on the OLM 44 Predicted probabilities of woman in each of the three outcomes of the dependent variable, by age 0 .2 .4 .6 .8 1 Pr e d i ct e d Pr o b a b ilit y 15 20 25 30 35 40 45 50 Age of woman Probability of Having No Births, by Age 0 .2 .4 .6 .8 1 Pr edicted Pr obability 15 20 25 30 35 40 45 50 Age of woman Probability of Having Few Births, by Age 0 .2 .4 .6 .8 1 Pr edicted Pr obability 15 20 25 30 35 40 45 50 Age of woman Probability of Having Multiple Births, by Age 0 .2 .4 .6 .8 1 Pr edicted pr obabilities 15 20 25 30 35 40 45 50 Age of Woman no births few births multiple births Predicted probabilities, by Age 23 45 Syntax for generating graphs of predicted probabilities by place of residence based on the OLM 46 Predic t ed probabilities by place of residence 0 .2 .4 .6 .8 1 Pr edic t ed pr obabilit ies 15 20 25 30 35 40 45 50 Age of Woman urban woman rural woman Predicted probabilities of Having No Births 0 .2 .4 .6 .8 1 Pr edic t ed pr obabilit ies 15 20 25 30 35 40 45 50 Age of Woman urban woman rural woman Predicted probabilities of Having few Births 0 .2 .4 .6 .8 1 Pr edic t ed pr obabilities 15 20 25 30 35 40 45 50 Age of Woman urban woman rural woman Predicted probabilities of Having multiple Births 24 47 Syntax for generating graphs of predi cted probabilities by education based on the OLM 48 0 .2 .4 .6 .8 1 P r edi c t ed pr obabi l i t i es 15 20 25 30 35 40 45 50 Age of Woman Illiterate primary junior senior college Predicted probabilities of Having No Births 0 .2 .4 .6 .8 1 P r edi c t ed pr obabi l i t i es 15 20 25 30 35 40 45 50 Age of Woman Illiterate primary junior senior college Predicted probabilities of Having few Births 0 .2 .4 .6 .8 1 Pr edic t ed pr obabilities 15 20 25 30 35 40 45 50 Age of Woman Illiterate primary junior senior college Predicted probabilities of Having multiple Births Predicted probabilities by education 25 49 A Last Note ? C ategory order of the dependent variable is important, while the actual numerical values used to label the three dependent variable categories make no difference as long as they capture the order. Original: 1=none, 2=few, 3=multipleRe-labeled: 1=none, 2=few, 9=multiple 50 ? A sk Stata t o re-score the “multiple” category of CEB3 as 9 i nstead of 3, then re-run the ordered logit m odel, the results will be exactly the same as the ordered logit t able shown earlier where the “multiple” category was scored as 3. ? U se the following syntax: 26 51