1
1
Lecture 10
Logistic Regression
2
Why use logistic regression?
In linear regression:Y =b
0
+ b
1
X
1
+ b
2
X
2
+ .... + b
n
X
n
+ e
the dependent variable, Y, is continuous
and unbounded, and we want to identify a set of explanatory (independent, or X) variable
s that wi
ll assist
us in predicting its
mean value while explaining its observed variability
2
3
In many situations in
demography and the
social sciences, however, we have a dependent variable,
Y
,
that is dichotomous,
rather than continuous, e.g., whether or not a woman has had a second birth, whether the second birth is a male or a female baby, whether or not a woman uses any contraceptive method, wh
ether or not a person
has migrated in the last 5 years, whether or not the People’s Universi
ty staff uses public
transportation coming to work, whether or not an under-graduate student completed study last year in the Demography Department in People’s University was awarded BA degree, etc.
4
In all these situations, the outcome of
Y
only assumes two forms; usually, the value 1 r
e
prese
n
t
s yes, o
r
a “
s
u
c
cess,
”
and the value 0, no, or a “failure.”
The
mean of this dichotomous (also referred to as binary) dependent variable, designated p
, is the proportion of times that it takes
the value 1.
3
5
Example: Obtaining Abortion
?
T
he data in the following tables were
derived from the 1997 survey, which contains information on abortion use and associated information. The tables and the chart show the incidence of abortion according to the number of pregnancies. It can be seen that the proportion of women obtaining abortion increases rapidly from a very low proportion at the first pregnancy to a big proportion among women having 5 or more pregnancies.
6
PREG5 * whether abortion Crosstabulation
Count
880
9
889
950
418
1368
499
461
960
230
271
501
102
193
295
2661
1352
4013
12345
PREG5Total
no
yes
whether abortion
Total
PREG
5 * whether abortion Crosstabulation
% within PREG5
99.
0%
1.
0%
100.
0%
69.
4%
30.
6%
100.
0%
52.
0%
48.
0%
100.
0%
45.
9%
54.
1%
100.
0%
34.
6%
65.
4%
100.
0%
66.
3%
33.
7%
100.
0%
1.
00
2.
00
3.
00
4.
00
5.
00
PREG5Total
no
yes
whether abortion
Total
PREG5
5
4
3
2
1
Mean wh eth e r aborti on
.7.6.5.4.3.2.1
0.0
4
7
?
T
o make a statistical model of this
relationship, we could feasibly fit a linear regre
s
sion line to the ca
ses with
pregnancy number as the explanatory variable and a dichotomous dependent variable (0=not having abortion, 1=having abortion). There are two main problems with this approach.
8
?
T
he first problem is that it is possible, and
indeed happens in this case, that the fitted regression line will cross below zero and/or above one right in the range where we do not want that to occur. The fitted regression line can be shown to have the form
p = -0.01314+ 0.13798*PREG
where p is the proportion having abortion and PREG is numbers of pregnancy.
5
9
The estimated probability can be
greater than 1 or less than 0
?
T
his line is above 1 up to 7 pregnancies,
meaning that
more
than 100 per cent of all
pregnancies at pregnancy 7 or over are aborted. And there are many other cases where the predicted proportions are negative. Apart from the fact that such results are impossible, we might nevertheless be inclined to accept them in the limited range where they are valid.
10
The linearity assumption is
seriously violated
?
T
his would be very dangerous to do,
because of the second problem, which is that the assumptions of linear regression are violated badly in this case. This can be seen clearly in the plots obtained with the REGRESSION sub-commands, particularly the final scatterplot of the standardized residuals against the predicted values:
6
11
Residual plot
Standardized Residual
3
2
1
0
-1
-2
-3
St a n da r d iz e d P r e d ict e d Va lue
6543210
-1-2
12
?
R
ecall that this scatter diagram should
show no pattern at all, as if a handful of stones were dropped at the centre of the diagram. On the contrary, in this case it could not show a pattern more clearly! This pattern of two lines across the diagram is caused by the fact that the dependent variable can only take two values (1 for having abortion and 0 otherwise), and the distribution of the residuals consequently has a binomial distribution, not a normal distribution.
7
13
?
B
ecause we break the linearity
assumption the usual hypothesis testing procedures are invalid.
?
R
square tends to be very low. The fit of
the line is poor because the response can only be 0 or 1 so the values do not cluster around the line.
14
Logistic Regression
?
T
o get around both problems, we will
instead fit a curve of a particular form to the data. This type of curve, known as a logistic curve, has the following general form:
P = exp(b
0
+b
1
*X)/(1
+exp(b
0
+b
1
*X))
where p is the proportion at each value of the explanatory variable X, b
0
and b
1
are
numerical constants to be estimated, and exp is the exponential function.
8
15
?
T
his curve has the following form when the
parameter b
1
is positive,
or its mi
rror
image when the parameter is negative.
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
-
4
-
3
-
2
-
1
0123
456789
1
0
1
1
1
2
16
Logistic Function
?
T
he logistic curve has the property that it
never t
a
kes valu
es less t
h
an ze
r
o
or gre
a
ter
than one. The way to fit it is to transform the definition of the logistic curve given above into a linear form:
loge
(p/(1-p)) = b
0
+b
1
*X
?
T
he function on the left-hand side of this
equation has various names, of which the most common are the 'logistic function' and the 'log-odds function'. The log equation has the general form of a linear model.
9
17
Probability and Odds
?
A
probability is the likelihood that a given
event will occur. It is the frequency of a given outcome divided by the total number of all possible outcomes.
?
A
definition of “odds”
is the likelihood of a
given event occurring, compared to the likelihood of the same event not occurring.
18
Probability
Odds
10
19
Taking the natural logarithm of each side of the odds equation
yields the following:
The above equation has the logit on the left-side
20
The logit is a linear function of the X variables
The probability is a non-linear function of the X variables
11
21
The Logistic Regression Modelwhere: ?
l
n is the natural logarithm, loge, where e=2.71828…
?
p
is the probability that the event Y occurs, p(Y=1)
?
p
/(1-p) is the "odds"
?
l
n[p/(1-p)] is the log odds, or "logit"
?
a
ll other components are the same as before
22
Using Logistic Regression
?
W
e now proceed to a logistic regression
with abortion as the dependent variable and pregnancy number as the explanatory variable.
12
23
SPSS Results
?
T
he most fundamental part is the
estimation equation derived from the coefficients in the last table of the display:
Variables in the Eq
uation
.
691
.
031
484.
752
1
.
000
1.
996
-2.
506
.
092
741.
147
1
.
000
.
082
PREGConstant
Step1
a
B
S.E.
Wald
df
Si
g.
Exp(B
)
Variable(s) entered on step 1: PREG.
a.
24
ln(p/(1-p)) = -2.506+ 0.691*PREG
?
F
or any number of pr
egnancy (PREG) we
can calculate the 'log odds' directly from this equation, for example
(pregnancy 2) -2.506+ 0.691*2 = -1.1229(pregnancy 4) -2.506+ 0.691*4 = 0.2597(pregnancy 6) -2.506+ 0.691*6 = 1.6424
13
25
?
B
ecause these log odds values have no
immediate meaning to most people, it is sometimes helpful to remember that they correspond directly and uniquely with proportions. As a rough guide the following table shows correspondences between log odds, odds (the exponential of log odds) and proportions, for log odds in the ra
nge -3 to +3.
26
14
27
?
N
otice that odds and proportions are
identical to two decimal places for small values (although to more decimal places the odds are always slightly higher).
?
I
n the case of our m
odel, we can estimate
the corresponding proportions as follows from the log odds we have already calculated:
(pregnancy 2) exp(-1.1229)/(1+exp(-1.1229)) = 0.25(pregnancy 4) exp(0.2597)/(1+exp(0.2597)) = 0.56(pregnancy 6) exp(1.6424)/(1+exp(1.6424)) = 0.84
28
Goodness of fit
?
T
he output gives us quite a lot of other
information, of which the most important is the information about the likelihood ratio
χ
2
(called in the output '-2 log likelihood' for some reason). We will call this parameter LR
χ
2
, noting that it is exactly the same
parameter given by the statistic Chi-square in the CROSSTABS procedure. It provides important information about the goodness of fit of the logistic regression model. We find it in two places in the output:
15
29
?
B
efore the variable PREG is entered into
the model
30
?
A
fter PREG is entered into the model
16
31
Omnibus Tests of Model Coefficients
615.480
1
.000
615.480
1
.000
615.480
1
.000
StepBlockModel
Step 1
Chi-square
df
Sig.
Model Summary
4512.823
.142
.197
Step1
-2 Log
likelihood
Cox & Snell
R Square
Nagelkerke R
Square
32
?
T
he initial value of LR
χ
2
(5128.303) is the
value when only a constant term is in the model, that is when b
1
is equal to zero.
After the variable PREG is included in the model, it reduces to 4512.823, a decrease of 615.840 on one degree of freedom. This decrease is interpreted as a
χ
2
st
atistic
with one degree of freedom, and it is a highly significant value.
17
33
?
N
otice that the three rows of the omnibus
tests table, headed 'Model', 'Block' and 'Step' all have the same content in this example, since the single explanatory variable was ent
e
red in a single step.
Generally, it is the 'step' information that is used to examine whether a change to the model at the previous step is worthwhile. It is in this case.
?
R
-square, similar to that in linear
regre
s
sion, is re
ported.
34
?
T
he output also contains a classificati
on table,
showing which cases are cl
assified correctly and
incorrectly by their predicted values based on pregnancy number. Note that the model does not do extremely well, getting only 70 per cent correct, including slightly over one-third of those who actually had abortion. This is to be expected, if we try to predict abortion on the basis of pregnancy number and nothing else.
Classification Table
a
2329
332
87.5
888
464
34.369.6
Observed
noyes
whe
t
he
r a
b
o
r
tio
n
Ove
r
a
ll P
e
rce
n
ta
g
e
Step 1
no
yes
whe
t
he
r a
b
o
r
tio
n
Per
c
ent
a
ge
Corre
c
t
P
r
e
d
icte
d
The
cut va
lue
is
.500
a.
18
35
Multivariate Logistic Regression
?
S
uppose that we want to find out the
relationship between educational attainment and the likelihood that a pregnancy is being aborted, to test the hypothesis that better-educated women are more likely to abort their pregnancies than uneducated women. (This is one of the effects of 'modernization' in many populations.)
36
?
I
n logistic regression no particular distinction is
made between covariates and other explanatory variables. For SPSS, which cannot ordinarily distinguish between true interval variables and categorical or nominal ones, categorical variables
must be specifically
identified in the LOGISTIC REGRESSION procedure.
?
F
or the abortion model which we wish to
analyse, the dependent variable ABORTION is dichotomous, the variables AGE and PREG are interval variables, and EDUCAT is ordinal. Ordinal variables should be treated as if they are categorical in a logistic regression.
19
37
?
C
arry out a logistic regression to test the
hypothesis that abor
tion prevalence is
higher among more educated women, controlling for age of woman and number of pregnancies.
38
?
W
hen there is a categorical independent
variable, a CONTRAST statement must be spe
c
ified. SPSS by defa
ult regards the
last
category of the categorical variable as
the base category. In the version given here, we use the first
category. If there are
two or more categorical variables, a separate CONTRAST statement is required for each cate
go
rical va
riable to
achieve the desired result.
20
39
?
T
his is quite a
satisfactory result, because there is a large reduction in LR
χ
2
, from
5128.303 to 3983.040, a reduction of 1145.263 on 6 degrees of freedom. This is a highly significant result, with p<0.00
···
01.
40
?
N
ull model (logodds(ABORTION)=constant):
LR
χ
2
= 5128.303
?
M
odel including EDUCAT, PREG and AGE:
LR
χ
2
= 3983.040
?
R
eduction
LR
χ
2
= 1145.263
(6 degrees of freedom, p<0.00
···
01)
21
41
?
A
ll of the variables show parameters
significantly different from zero. Note that there are only four parameters shown for the five cate
gories of
EDUCAT. This is be
cau
s
e
we have set the coefficient of the first catego
ry (refe
ren
c
e categ
o
ry) to b
e
zero.
Variables in the Equation
-.040
.006
42.190
1
.000
.960
1.090
.044
619.793
1
.000
2.974
395.710
4
.000
.850
.118
51.729
1
.000
2.339
1.760
.124
200.509
1
.000
5.814
2.623
.154
291.916
1
.000
13.778
2.914
.237
151.386
1
.000
18.423
-3.348
.237
199.051
1
.000
.035
AGEPRE
G
EDU
C
AT
EDU
C
AT
(
1
)
EDU
C
AT
(
2
)
EDU
C
AT
(
3
)
EDU
C
AT
(
4
)
Constant
Step1
a
B
S.E.
Wa
l
d
df
Sig.
Exp(
B)
Va
r
i
a
b
l
e
(
s
)
en
t
e
r
e
d o
n
s
t
ep 1: AGE, P
R
EG, EDU
C
AT
.
a.
42
?
W
e can set out the model as a base
category table
22
43
?
F
or any combination of AGE, PREG and
EDUCAT, it is now possible to calculate the estimated proportion having abortion, using the results of the above table. For example, for women aged 30 having 3 pregnancies with the lowest category of education (illiterate) the estimates are:
44
When the calculation is done for all possible combinations of pregnancy number and level of education (at age 30) the results are:
23
45
?
T
hese results look quite reasonable, but
how closely do they accord with the actual proportions? If we do a cross-tabulation of ABORTION b
y
EDUCAT by PREG and
extract the corresponding proportions (for age 30) the results are:
46
numbe
r
of pre
g
nanc
i
e
s * w
h
e
t
he
r aborti
on * e
d
uc
ati
o
n c
a
te
gori
e
s
Cro
sstabul
ati
o
n
% within number of
pregnancies
100.0%
100.0%
78.6%
21.4%
100.0%
60.0%
40.0%
100.0%
100.0%
100.0%
71.4%
28.6%
100.0%
100.0%
100.0%
76.0%
24.0%
100.0%
64.3%
35.7%
100.0%
42.9%
57.1%
100.0%
100.0%
100.0%
100.0%
100.0%
73.8%
26.2%
100.0%
95.2%
4.8%
100.0%
50.0%
50.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
61.0%
39.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
100.0%
50.0%
50.0%
100.0%
100.0%
100.0%
100.0%
100.0%
1.002.003.004.00
number ofpregnanciesTo
t
a
l
1.002.003.004.005.006.00
number ofpregnanciesTo
t
a
l
1.002.003.004.00
number ofpregnanciesTo
t
a
l
1.002.003.00
number ofpregnanciesTo
t
a
l
2.00
number ofpregnanciesTo
t
a
l
education categoriesIllite
r
a
t
e
Prim
a
r
y s
c
hool
J
unior middle school
S
e
nior
m
i
ddle
s
c
hool
C
o
l
l
eg
e and
ab
o
v
e
no
yes
whether ab
o
r
ti
o
n
To
t
a
l
24
47
?
T
he actual proportions show much more
variation than the modelled proportions, but this can generally be ascribed to small number
s in som
e
cells, particula
rly at
higher-order pregnancy. It is indeed possible that the model is a better representation of the underlying probabilities than tabulation of the observed proportions can give, because observed proportions
are subject to
chan
ce variation
.
48
Odds Ratios
?
O
ne reason for calculating the base
category model is that it contains very important information about estimates known a
s
odds ratios
for each
explanatory variable. These odds ratios are calculated by taking the exponential of each parameter of a categorical variable. For example, the odds ratio for the second category of EDUCAT in the example is
exp(0.84993) = 2.33948
25
49
?
T
his is strictly the ratio of the odds of having
abortion for women
with primar
y sch
ool
education to the odds of having abortion for women without any education (the base category). Loose interpretation of this ratio is rampant in the literature, usually taking a form such as 'the probability of having abortion for women with primary school education is 2.3 times the probability for having abortion for women without any educat
ion'. This is a very
loose type of statement and one to be avoided, although it is roughly accurate when the dependent variable describes a very rare event, for which odds ap
proximate proportions.
50
?
I
n a logistic regression model, odds ratios
for one explanatory variable are constant over the categories of any other explanatory variable, while this is explicitly not true for ratios of proportions or probabilities. The loose statement given above should have been expressed as 'the odds of having abortion for women with primary school education is 2.3 times the odds of having abortion for women without any education'. If it feels less meaningful to be discussing odds rather than probabilities, this is unfortunate but necessary
in the interests of
accuracy.
26
51
The odds ratios for the categorical explanatory variable EDUCAT in the abortion example are:
52
?
T
his summary of odds ratios illustrates the
strength of the association between a woman’s education level and her likelihood that a pregnancy is being aborted. This is an entirely expected relationship, an extremely strong positive relationship between abortion and level of education. The higher the level of education of the respondent, the more likely it is that she will abort her pregnancy at any age and at any pregnancy number.
27
53
?
T
his both confirms the hypothesis which
we set out to investigate, and perhaps adds the additional information that abortion practice is likely to become more prevalent as educational attainment rises in the younger generation.
54
Using Dummy Variables
?
D
ummy variables can be created for a
categorical variable, and the CONTRAST sub-command is omitted in this case. I recommend you to explicitly use dummy variables for any categorical independent variable.
?
C
ompare the following two logistic
regre
s
sions that yield the exa
c
tly same
result
s.
28
5556
Variables in the Equation
-.040
.006
42.170
1
.000
.961
1.090
.044
619.713
1
.000
2.974
395.549
4
.000
.850
.118
51.745
1
.000
2.340
1.760
.124
200.447
1
.000
5.814
2.622
.154
291.744
1
.000
13.765
2.914
.237
151.447
1
.000
18.432
-3.331
.239
193.894
1
.000
.036
AGEPREGEDUC
AT
EDUC
AT
(1)
EDUC
AT
(2)
EDUC
AT
(3)
EDUC
AT
(4)
Co
nstant
Ste
p
1
a
B
S.E.
Wald
df
Sig.
Exp(B)
Va
ri
a
b
l
e
(s) en
tered on
step
1: AG
E, PREG
, EDUC
AT
.
a.
Variables in the Equation
-
.
040
.006
42.170
1
.000
.961
1.090
.044
619.713
1
.000
2.974
.850
.118
51.745
1
.000
2.340
1.760
.124
200.447
1
.000
5.814
2.622
.154
291.744
1
.000
13.765
2.914
.237
151.447
1
.000
18.432
-
3
.331
.239
193.894
1
.000
.036
AGEPREGPRI
M
ARY
JUNIORSENIORCOLLE
G
E
Constant
Step1
a
B
S.E.
Wald
df
Sig.
Exp(
B)
Variable(s) entered on step 1
:
AGE,
PREG, PRI
M
ARY, JUNI
OR, SENI
OR, COLLEGE.
a.
29
57
Using Stata
?
O
ne of the advantages of using Stata
t
o
perform the logistic regression is that it provides standardized regression coefficients (recall Beta in linear r
e
gr
ess
i
on)
.
T
h
e S
t
ata
c
ommand is
logit
,
followed by the dependent variable, and then the
X
-variables. Here is the syntax for
the logistic regression using Stata:
logit
abortion age preg
primary junior senior college
58
30
59
listcoef
command will give the odds ratios,
along with the logit
c
oefficients
60
Calculate standardized logit
c
oefficients
by
invoking the
listcoef
command, followed by
std
after the comma