1 1 Lecture 10 Logistic Regression 2 Why use logistic regression? In linear regression:Y =b 0 + b 1 X 1 + b 2 X 2 + .... + b n X n + e the dependent variable, Y, is continuous and unbounded, and we want to identify a set of explanatory (independent, or X) variable s that wi ll assist us in predicting its mean value while explaining its observed variability 2 3 In many situations in demography and the social sciences, however, we have a dependent variable, Y , that is dichotomous, rather than continuous, e.g., whether or not a woman has had a second birth, whether the second birth is a male or a female baby, whether or not a woman uses any contraceptive method, wh ether or not a person has migrated in the last 5 years, whether or not the People’s Universi ty staff uses public transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People’s University was awarded BA degree, etc. 4 In all these situations, the outcome of Y only assumes two forms; usually, the value 1 r e prese n t s yes, o r a “ s u c cess, ” and the value 0, no, or a “failure.” The mean of this dichotomous (also referred to as binary) dependent variable, designated p , is the proportion of times that it takes the value 1. 3 5 Example: Obtaining Abortion ? T he data in the following tables were derived from the 1997 survey, which contains information on abortion use and associated information. The tables and the chart show the incidence of abortion according to the number of pregnancies. It can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies. 6 PREG5 * whether abortion Crosstabulation Count 880 9 889 950 418 1368 499 461 960 230 271 501 102 193 295 2661 1352 4013 12345 PREG5Total no yes whether abortion Total PREG 5 * whether abortion Crosstabulation % within PREG5 99. 0% 1. 0% 100. 0% 69. 4% 30. 6% 100. 0% 52. 0% 48. 0% 100. 0% 45. 9% 54. 1% 100. 0% 34. 6% 65. 4% 100. 0% 66. 3% 33. 7% 100. 0% 1. 00 2. 00 3. 00 4. 00 5. 00 PREG5Total no yes whether abortion Total PREG5 5 4 3 2 1 Mean wh eth e r aborti on .7.6.5.4.3.2.1 0.0 4 7 ? T o make a statistical model of this relationship, we could feasibly fit a linear regre s sion line to the ca ses with pregnancy number as the explanatory variable and a dichotomous dependent variable (0=not having abortion, 1=having abortion). There are two main problems with this approach. 8 ? T he first problem is that it is possible, and indeed happens in this case, that the fitted regression line will cross below zero and/or above one right in the range where we do not want that to occur. The fitted regression line can be shown to have the form p = -0.01314+ 0.13798*PREG where p is the proportion having abortion and PREG is numbers of pregnancy. 5 9 The estimated probability can be greater than 1 or less than 0 ? T his line is above 1 up to 7 pregnancies, meaning that more than 100 per cent of all pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid. 10 The linearity assumption is seriously violated ? T his would be very dangerous to do, because of the second problem, which is that the assumptions of linear regression are violated badly in this case. This can be seen clearly in the plots obtained with the REGRESSION sub-commands, particularly the final scatterplot of the standardized residuals against the predicted values: 6 11 Residual plot Standardized Residual 3 2 1 0 -1 -2 -3 St a n da r d iz e d P r e d ict e d Va lue 6543210 -1-2 12 ? R ecall that this scatter diagram should show no pattern at all, as if a handful of stones were dropped at the centre of the diagram. On the contrary, in this case it could not show a pattern more clearly! This pattern of two lines across the diagram is caused by the fact that the dependent variable can only take two values (1 for having abortion and 0 otherwise), and the distribution of the residuals consequently has a binomial distribution, not a normal distribution. 7 13 ? B ecause we break the linearity assumption the usual hypothesis testing procedures are invalid. ? R square tends to be very low. The fit of the line is poor because the response can only be 0 or 1 so the values do not cluster around the line. 14 Logistic Regression ? T o get around both problems, we will instead fit a curve of a particular form to the data. This type of curve, known as a logistic curve, has the following general form: P = exp(b 0 +b 1 *X)/(1 +exp(b 0 +b 1 *X)) where p is the proportion at each value of the explanatory variable X, b 0 and b 1 are numerical constants to be estimated, and exp is the exponential function. 8 15 ? T his curve has the following form when the parameter b 1 is positive, or its mi rror image when the parameter is negative. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 - 4 - 3 - 2 - 1 0123 456789 1 0 1 1 1 2 16 Logistic Function ? T he logistic curve has the property that it never t a kes valu es less t h an ze r o or gre a ter than one. The way to fit it is to transform the definition of the logistic curve given above into a linear form: loge (p/(1-p)) = b 0 +b 1 *X ? T he function on the left-hand side of this equation has various names, of which the most common are the 'logistic function' and the 'log-odds function'. The log equation has the general form of a linear model. 9 17 Probability and Odds ? A probability is the likelihood that a given event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes. ? A definition of “odds” is the likelihood of a given event occurring, compared to the likelihood of the same event not occurring. 18 Probability Odds 10 19 Taking the natural logarithm of each side of the odds equation yields the following: The above equation has the logit on the left-side 20 The logit is a linear function of the X variables The probability is a non-linear function of the X variables 11 21 The Logistic Regression Modelwhere: ? l n is the natural logarithm, loge, where e=2.71828… ? p is the probability that the event Y occurs, p(Y=1) ? p /(1-p) is the "odds" ? l n[p/(1-p)] is the log odds, or "logit" ? a ll other components are the same as before 22 Using Logistic Regression ? W e now proceed to a logistic regression with abortion as the dependent variable and pregnancy number as the explanatory variable. 12 23 SPSS Results ? T he most fundamental part is the estimation equation derived from the coefficients in the last table of the display: Variables in the Eq uation . 691 . 031 484. 752 1 . 000 1. 996 -2. 506 . 092 741. 147 1 . 000 . 082 PREGConstant Step1 a B S.E. Wald df Si g. Exp(B ) Variable(s) entered on step 1: PREG. a. 24 ln(p/(1-p)) = -2.506+ 0.691*PREG ? F or any number of pr egnancy (PREG) we can calculate the 'log odds' directly from this equation, for example (pregnancy 2) -2.506+ 0.691*2 = -1.1229(pregnancy 4) -2.506+ 0.691*4 = 0.2597(pregnancy 6) -2.506+ 0.691*6 = 1.6424 13 25 ? B ecause these log odds values have no immediate meaning to most people, it is sometimes helpful to remember that they correspond directly and uniquely with proportions. As a rough guide the following table shows correspondences between log odds, odds (the exponential of log odds) and proportions, for log odds in the ra nge -3 to +3. 26 14 27 ? N otice that odds and proportions are identical to two decimal places for small values (although to more decimal places the odds are always slightly higher). ? I n the case of our m odel, we can estimate the corresponding proportions as follows from the log odds we have already calculated: (pregnancy 2) exp(-1.1229)/(1+exp(-1.1229)) = 0.25(pregnancy 4) exp(0.2597)/(1+exp(0.2597)) = 0.56(pregnancy 6) exp(1.6424)/(1+exp(1.6424)) = 0.84 28 Goodness of fit ? T he output gives us quite a lot of other information, of which the most important is the information about the likelihood ratio χ 2 (called in the output '-2 log likelihood' for some reason). We will call this parameter LR χ 2 , noting that it is exactly the same parameter given by the statistic Chi-square in the CROSSTABS procedure. It provides important information about the goodness of fit of the logistic regression model. We find it in two places in the output: 15 29 ? B efore the variable PREG is entered into the model 30 ? A fter PREG is entered into the model 16 31 Omnibus Tests of Model Coefficients 615.480 1 .000 615.480 1 .000 615.480 1 .000 StepBlockModel Step 1 Chi-square df Sig. Model Summary 4512.823 .142 .197 Step1 -2 Log likelihood Cox & Snell R Square Nagelkerke R Square 32 ? T he initial value of LR χ 2 (5128.303) is the value when only a constant term is in the model, that is when b 1 is equal to zero. After the variable PREG is included in the model, it reduces to 4512.823, a decrease of 615.840 on one degree of freedom. This decrease is interpreted as a χ 2 st atistic with one degree of freedom, and it is a highly significant value. 17 33 ? N otice that the three rows of the omnibus tests table, headed 'Model', 'Block' and 'Step' all have the same content in this example, since the single explanatory variable was ent e red in a single step. Generally, it is the 'step' information that is used to examine whether a change to the model at the previous step is worthwhile. It is in this case. ? R -square, similar to that in linear regre s sion, is re ported. 34 ? T he output also contains a classificati on table, showing which cases are cl assified correctly and incorrectly by their predicted values based on pregnancy number. Note that the model does not do extremely well, getting only 70 per cent correct, including slightly over one-third of those who actually had abortion. This is to be expected, if we try to predict abortion on the basis of pregnancy number and nothing else. Classification Table a 2329 332 87.5 888 464 34.369.6 Observed noyes whe t he r a b o r tio n Ove r a ll P e rce n ta g e Step 1 no yes whe t he r a b o r tio n Per c ent a ge Corre c t P r e d icte d The cut va lue is .500 a. 18 35 Multivariate Logistic Regression ? S uppose that we want to find out the relationship between educational attainment and the likelihood that a pregnancy is being aborted, to test the hypothesis that better-educated women are more likely to abort their pregnancies than uneducated women. (This is one of the effects of 'modernization' in many populations.) 36 ? I n logistic regression no particular distinction is made between covariates and other explanatory variables. For SPSS, which cannot ordinarily distinguish between true interval variables and categorical or nominal ones, categorical variables must be specifically identified in the LOGISTIC REGRESSION procedure. ? F or the abortion model which we wish to analyse, the dependent variable ABORTION is dichotomous, the variables AGE and PREG are interval variables, and EDUCAT is ordinal. Ordinal variables should be treated as if they are categorical in a logistic regression. 19 37 ? C arry out a logistic regression to test the hypothesis that abor tion prevalence is higher among more educated women, controlling for age of woman and number of pregnancies. 38 ? W hen there is a categorical independent variable, a CONTRAST statement must be spe c ified. SPSS by defa ult regards the last category of the categorical variable as the base category. In the version given here, we use the first category. If there are two or more categorical variables, a separate CONTRAST statement is required for each cate go rical va riable to achieve the desired result. 20 39 ? T his is quite a satisfactory result, because there is a large reduction in LR χ 2 , from 5128.303 to 3983.040, a reduction of 1145.263 on 6 degrees of freedom. This is a highly significant result, with p<0.00 ··· 01. 40 ? N ull model (logodds(ABORTION)=constant): LR χ 2 = 5128.303 ? M odel including EDUCAT, PREG and AGE: LR χ 2 = 3983.040 ? R eduction LR χ 2 = 1145.263 (6 degrees of freedom, p<0.00 ··· 01) 21 41 ? A ll of the variables show parameters significantly different from zero. Note that there are only four parameters shown for the five cate gories of EDUCAT. This is be cau s e we have set the coefficient of the first catego ry (refe ren c e categ o ry) to b e zero. Variables in the Equation -.040 .006 42.190 1 .000 .960 1.090 .044 619.793 1 .000 2.974 395.710 4 .000 .850 .118 51.729 1 .000 2.339 1.760 .124 200.509 1 .000 5.814 2.623 .154 291.916 1 .000 13.778 2.914 .237 151.386 1 .000 18.423 -3.348 .237 199.051 1 .000 .035 AGEPRE G EDU C AT EDU C AT ( 1 ) EDU C AT ( 2 ) EDU C AT ( 3 ) EDU C AT ( 4 ) Constant Step1 a B S.E. Wa l d df Sig. Exp( B) Va r i a b l e ( s ) en t e r e d o n s t ep 1: AGE, P R EG, EDU C AT . a. 42 ? W e can set out the model as a base category table 22 43 ? F or any combination of AGE, PREG and EDUCAT, it is now possible to calculate the estimated proportion having abortion, using the results of the above table. For example, for women aged 30 having 3 pregnancies with the lowest category of education (illiterate) the estimates are: 44 When the calculation is done for all possible combinations of pregnancy number and level of education (at age 30) the results are: 23 45 ? T hese results look quite reasonable, but how closely do they accord with the actual proportions? If we do a cross-tabulation of ABORTION b y EDUCAT by PREG and extract the corresponding proportions (for age 30) the results are: 46 numbe r of pre g nanc i e s * w h e t he r aborti on * e d uc ati o n c a te gori e s Cro sstabul ati o n % within number of pregnancies 100.0% 100.0% 78.6% 21.4% 100.0% 60.0% 40.0% 100.0% 100.0% 100.0% 71.4% 28.6% 100.0% 100.0% 100.0% 76.0% 24.0% 100.0% 64.3% 35.7% 100.0% 42.9% 57.1% 100.0% 100.0% 100.0% 100.0% 100.0% 73.8% 26.2% 100.0% 95.2% 4.8% 100.0% 50.0% 50.0% 100.0% 50.0% 50.0% 100.0% 100.0% 100.0% 61.0% 39.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 100.0% 50.0% 50.0% 100.0% 100.0% 100.0% 100.0% 100.0% 1.002.003.004.00 number ofpregnanciesTo t a l 1.002.003.004.005.006.00 number ofpregnanciesTo t a l 1.002.003.004.00 number ofpregnanciesTo t a l 1.002.003.00 number ofpregnanciesTo t a l 2.00 number ofpregnanciesTo t a l education categoriesIllite r a t e Prim a r y s c hool J unior middle school S e nior m i ddle s c hool C o l l eg e and ab o v e no yes whether ab o r ti o n To t a l 24 47 ? T he actual proportions show much more variation than the modelled proportions, but this can generally be ascribed to small number s in som e cells, particula rly at higher-order pregnancy. It is indeed possible that the model is a better representation of the underlying probabilities than tabulation of the observed proportions can give, because observed proportions are subject to chan ce variation . 48 Odds Ratios ? O ne reason for calculating the base category model is that it contains very important information about estimates known a s odds ratios for each explanatory variable. These odds ratios are calculated by taking the exponential of each parameter of a categorical variable. For example, the odds ratio for the second category of EDUCAT in the example is exp(0.84993) = 2.33948 25 49 ? T his is strictly the ratio of the odds of having abortion for women with primar y sch ool education to the odds of having abortion for women without any education (the base category). Loose interpretation of this ratio is rampant in the literature, usually taking a form such as 'the probability of having abortion for women with primary school education is 2.3 times the probability for having abortion for women without any educat ion'. This is a very loose type of statement and one to be avoided, although it is roughly accurate when the dependent variable describes a very rare event, for which odds ap proximate proportions. 50 ? I n a logistic regression model, odds ratios for one explanatory variable are constant over the categories of any other explanatory variable, while this is explicitly not true for ratios of proportions or probabilities. The loose statement given above should have been expressed as 'the odds of having abortion for women with primary school education is 2.3 times the odds of having abortion for women without any education'. If it feels less meaningful to be discussing odds rather than probabilities, this is unfortunate but necessary in the interests of accuracy. 26 51 The odds ratios for the categorical explanatory variable EDUCAT in the abortion example are: 52 ? T his summary of odds ratios illustrates the strength of the association between a woman’s education level and her likelihood that a pregnancy is being aborted. This is an entirely expected relationship, an extremely strong positive relationship between abortion and level of education. The higher the level of education of the respondent, the more likely it is that she will abort her pregnancy at any age and at any pregnancy number. 27 53 ? T his both confirms the hypothesis which we set out to investigate, and perhaps adds the additional information that abortion practice is likely to become more prevalent as educational attainment rises in the younger generation. 54 Using Dummy Variables ? D ummy variables can be created for a categorical variable, and the CONTRAST sub-command is omitted in this case. I recommend you to explicitly use dummy variables for any categorical independent variable. ? C ompare the following two logistic regre s sions that yield the exa c tly same result s. 28 5556 Variables in the Equation -.040 .006 42.170 1 .000 .961 1.090 .044 619.713 1 .000 2.974 395.549 4 .000 .850 .118 51.745 1 .000 2.340 1.760 .124 200.447 1 .000 5.814 2.622 .154 291.744 1 .000 13.765 2.914 .237 151.447 1 .000 18.432 -3.331 .239 193.894 1 .000 .036 AGEPREGEDUC AT EDUC AT (1) EDUC AT (2) EDUC AT (3) EDUC AT (4) Co nstant Ste p 1 a B S.E. Wald df Sig. Exp(B) Va ri a b l e (s) en tered on step 1: AG E, PREG , EDUC AT . a. Variables in the Equation - . 040 .006 42.170 1 .000 .961 1.090 .044 619.713 1 .000 2.974 .850 .118 51.745 1 .000 2.340 1.760 .124 200.447 1 .000 5.814 2.622 .154 291.744 1 .000 13.765 2.914 .237 151.447 1 .000 18.432 - 3 .331 .239 193.894 1 .000 .036 AGEPREGPRI M ARY JUNIORSENIORCOLLE G E Constant Step1 a B S.E. Wald df Sig. Exp( B) Variable(s) entered on step 1 : AGE, PREG, PRI M ARY, JUNI OR, SENI OR, COLLEGE. a. 29 57 Using Stata ? O ne of the advantages of using Stata t o perform the logistic regression is that it provides standardized regression coefficients (recall Beta in linear r e gr ess i on) . T h e S t ata c ommand is logit , followed by the dependent variable, and then the X -variables. Here is the syntax for the logistic regression using Stata: logit abortion age preg primary junior senior college 58 30 59 listcoef command will give the odds ratios, along with the logit c oefficients 60 Calculate standardized logit c oefficients by invoking the listcoef command, followed by std after the comma