1
1
Lecture 8
Simple Linear
Regression
2
Simple Linear Regression
(Bivariate
Regression)
We already looked at measuring relationships
between two interval variables using correlation. Now we continue to look at the bivariate
analysi
s
of the two variables using regression analysis. However, the purpose of doing regression rather than correlation is that we can predict results i
n
one variable, based on another variable. So, rather than simply see if the variables are related, we can interpret their effect.
2
3
Simple Linear Regression
Like correlation, there are two major assumptions:
?
T
he relationship should be linear; and
?
T
he level of data must be continuous
4
The regression equation
The purpose of simple linear regression is to fit a line to the two variables. This line is called the
line of best fit
, or the
regress
i
on line
. When we do a scatterplot
of two variables, it is possible to fit a line which best represents the data.
3
5
The regression equation
A
re
gression equa
tion
is used to define
the relationship between two variables. It takes the form:
or
6
The regression equation
They are essentially the same, except that the second includes an error term at the end. This error term indicates that what we have is in fact a model, and hence won't fit the data perfectly
.
4
7
The regression equation
8
S
c
at
t
e
r
p
l
o
t
an
d
r
e
g
r
e
s
s
i
o
n
l
i
n
e
0
10 20 30 40 50 60 70
012
345
X
Y
Change i
n
X
i
s
1
Change i
n
Y
i
s
10
I
n
t
e
r
c
ept
i
s
20
5
9
How do we fit a line to data?
In order to fit a line of best fit we use a method called the
Method of Least
Square
s.
This method allows us to
determine which line, out of all the lines that could be drawn,
best represents the
least amount of difference between the actual values (the data points) and the predicted line.
10
In the Figure above, three data points fall on the
line, while the remaining 6 are slightly above or below the line. The difference between these points and the line are called residuals. Some of these differences will be positive (above the line), while others will be negative (below the line). If we add up all these differences, some of the positive and negative values will cancel each other out, which will have the effect of overestimating how well the line represents the data. Instead, if we square the differences and then add them up then we can work out which line has the smallest sum of squares (that is, the one with the least error).
6
1112
Example 1
7
1314
Prediction
8
15
We could now draw a line of best fit
through the observed data points
0.
0
0.
5
1.
0
1.
5
2.
0
2.
5
3.
0
3.
5
1
5
20
25
30
3
5
4
0
45
50
Age
N u m b e r of ch il dr en
16
9
1718
Inference for Regression
?
W
hen a scatterplot
shows a linear
relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least-squares line fitted to th
e data to predict y
for a given value of x. Now we want to do test
s an
d confid
ence int
e
rval
s in this
setting.
10
19
Example 2: Crying and IQ
?
I
nfants who cry easily may be a sign of
higher IQ. Crying intensity and IQ data on 38 infants:
IQ=
intelligenc
e
quotient
20
11
21
?
Plot a
n
d interpret.
As always, we first
examine the data. Figure 3 is a scatterplotof the crying data. Plot the explanatory variable (crying intensity at birth) horizontally and the response variable (IQ at age 3) vertically. Look for the form, direction, and strength of the relationship as well as for outliers or other deviations. There is a moderate positive linear relationship, with no extreme outliers or potentially influential observations.
22
?
Numerical summary.
B
e
cause the
scat
t
erpl
ot
shows a rou
g
h
ly linear
(straight-line) pattern, the correlation describes the direction and strength of the relationship. The correlation between crying and IQ is r = 0.455.
?
Mathematical model.
We are interested
in predicting the response from information about the explanatory variable. So we find the least-squares regression line for predicting IQ from crying.
12
23
This line lies as close as possible to the
points (in the sense of
least squares) in
the vertical (y) direction. The equation of the least-squares regression line isBecause = 0.207, about 21% of the variation in IQ scores is explained by crying intensity. See SPSS output:
?
91.27
1.493
y
ab
x
x
=+
=
+
2
r
24
13
25
The regression model
?
T
he slope b and intercept a of the least-
squares line are statistics. That is, we calculated them from the sample data. These statistics would take somewhat different values if we repeated the study with different infants. To do formal inference, we think of a and b as estimates of unknown parameters.
26
Assumptions for regression
inference
We have n observations on an explanatory variable x and a respon
se variable y. Our
goal is to study or pr
edict the behavior of y
for given
values of x.
?
F
or an
y fixed val
u
e of x, t
he resp
onse y
varies according to a normal distribution. Repeated responses y are independent of each other.
14
27
?
T
he mean response has a straight-line
relationship with x:The slope and intercept are unknown parameters.
?
T
he standard deviation of y (call it ) is
the same for all values of x. The value of is unknown.
y
x
μ
αβ
=+
y
μ
σ
σ
β
α
28
The heart of this model is that there is an "on the average" straight-line relationship between y and x. The true regression line
y
x
μ
αβ
=+
response moves along a straight line as the explanatory variable x changes. We can't observe the true regression line. The values of y that we do observe vary about their means according to a normal distribution. If we hold x fixed and take many observations on y, the normal pattern will eventually appear in a histogram.
says that the mean
y
μ
15
29
In practice, we observe y for many differentvalues o
f
x, so th
at we se
e an ove
r
all linear
pattern formed by points scattered about the true line. The standard deviation determines whether the points fall close to the true regression line (small ) or are widely scattered (large ).
σ
σ
σ
30
?
F
igure 4 shows the regression model in
picture form. The line in the figure is the true regression line. The mean of the response y moves along this line as the explanatory variable x takes different values. The normal curves show how y will vary when x is held fixed at different values. All of the curves have the same , so the variability of y is the same for all values of x. You should check the assumptions for inference when you do inference about regression.
σ
16
31
Figure 4 The regression model
. The line is the true
regression line, which shows how the mean response changes as
the explanatory variable x
changes. For any fixed value of x, the observed response y varies according to a normal distribution having mean .
y
μ
y
μ
32
Inference about the Model
?
T
he first step in inference is to estimate
the unknown parameters , , and . When the regression model describes our data and we calculate the least-squares line ,
the slope b of the least-
squares line is an unbiased estimator of the true slope , and the intercept a of the least-squares line is an unbiased estimator of the true intercept
.
α
β
σ
?
y
ab
x
=+
β
α
17
33
?
T
he data in Figure 3 fit the regression
model of scatter about an invisible true regression line reasonably well. The least-squares line is
. The
slope is particularly important.
A slope is a
rate of change.
The true slope says how
much higher average IQ
is for children with
one more intensity unit in their crying measurement. Because b = 1.493 estimates the unknown , we estimate that on the average IQ is about 1.5 points higher for each added crying intensity.
?
91.27
1
.493
y
x
=+
β
β
34
?
W
e need the intercept a = 91.27 to draw the
line, but it has no statistical meaning in this example. No child had fewer than 9 crying intensity, so we have no data near x = 0.
?
T
he remaining parameter of the model is the
standard deviation , which describes the variability of the response y about the true regression line. The least-squares line estimates the true regression line. So the residuals
estimate how much y varies about
the true line.
σ
18
35
?
R
ecall that the residuals are the vertical
deviations of the data po
ints from the least-
squares line:
re
sidua
l
obse
r
v
e
d pre
d
icte
d
?
yy
yy
=?
=?
There are n residuals, one for each data point. Because is the standard deviation of responses about the true regression
line, we estimate it by a
sample standard deviation of the residuals. We call this sample standard deviation a standard error to emphasize that it is es
timated from data. The
residuals from a least-squares line always have mean zero. That simplifi
es their standard error.
σ
36
Standard error about the least-
squares line
?
T
he standard error about the line is
2
2
1
re
sidua
l
2
1
?
()
2
s
n
yy
n
=
?
=?
?
∑
∑
Use s to estimate the unknown in the r
egr
ess
i
on model.
σ
19
37
?
B
ecause we use the standard error about the line
so often i
n
regression inference, we just call i
t
s.
Notice that is an average of the squared deviations of the data points from the line, so it qualifies as a variance. We average the squared deviations by dividing by n -
2
, the number of data
points less 2. It turns out that if we know n -
2
of
the n resi
duals, the other two are determined. That
is, n -
2
is the
degrees of freedom
of s. We first
met the idea of degrees of freedom in the case of the ordinary sample standard deviation of n observations, which has n -
1
degrees of freedom.
Now we observe two variables rather than one, and the proper degrees of freedom is n -
2
rather
than n -
1
.
2
s
38
?
C
alculating s is unpleasant. You must find
the predicted response
fo
r each x in our
data set, then the residuals, and then s. In practice we will use SPSS that does this arithmetic instantly. Nonetheless, here is an example to help you understand the standa
r
d erro
r s.
20
39
Example 2 (continued)
?
T
he first infant had 10 crying intensity and
a later IQ of 87. The predicted IQ for x=10 is:
?
91.27
1
.493
91.27
1.493
1
0
106.2
yx
=+=+
×
=
The residual for this observation is
?
r
e
sidua
l
8
7
106.2
19.2
yy
=?=
?
=
?
T
hat is, the observed IQ for this inf
ant li
es 1
9
.2 points be
lo
w
the l
east-
squares l
i
ne.
40
?
R
epeat this calculation 37 more times,
once fo
r
each su
bj
ect. The 38 residuals
are:
-19.20
-31.13
-22.65
-15.18
-12.18
-15.15
-16.63
-
6.18
-1.70
-22.60
-6.68
-
6.17
-9.15
-23.58
-9.14
2
.80
-9.14
-1.66
-6.14
-12.60
0.34
-8.62
2.85
1
4.30
9.82
10.82
0.37
8
.85
10.87
19.34
10.89
-
2.55
20.85
24.35
18.94
3
2.89
18.47
51.32
21
41
?
T
he variance about the line is:
2
2
22
2
1
res
i
du
al
2
1
(
1
9.20)
(
31.13
)
.
..
51.32
38-
2
1
(
11023.3
)
306.20
36
s
n
=
?
??
=?
+
?
+
+
??
==
∑
42
?
F
inally, the standard error about the line is:
306.20
=17.50
s
=
The standard error s about the line is the key measure of the variability of the responses in regressi
on. It is part of the
standard error of all the statistics we will use for inference.
22
43
Confidence intervals for the
regression slope
?
T
he slope of the true regression line is
usually the most important parameter in a regression problem. The slope is the rate of change of the mean response as the explanatory variable increases. We often want to estimate . The slope b of the least-squares line is
an unbiased estimator
of . A confidence interval is more useful becau
s
e
it shows how a
ccu
rate t
h
e
estimate b is likely to be.
β
β
β
44
?
T
he confidence interval for has the
familiar formBecause b is our estimate, the confidence interval becomesHere are the details:
β
±×
esti
m
a
te
es
ti
m
a
te
SE
t
SE
b
bt
±×
23
45
Confidence interval for
regression slope
?
A
level C confidence interval for the slope
of the true regression line isIn this recipe, the standard error of the least-squares slope b is
β
SE
b
bt
±×
2
SE
()
b
xx s
=
?
∑
46
and t is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom.
?
A
s advertised, the stan
da
rd error of b is a
multiple of s. Although we give the recipe for this standard erro
r, yo
u should
rarely
have to calculate it by hand. Regression softwa
r
e
such as SPSS gives the standa
r
d
error along with b itself.
SE
b
24
47
Example 2 (continued)
?
T
o calculate a 95% confidence interval for
the true slope , we use the critical value t=2.0294 when degrees of freedom are n-2=36.
β
df
t
30
2.042
36
?
40
2.021
2.0294
48
?
T
he standard error of b is
2
22
2
SE
()
17.50
(
1
0
1
7.39)
(
12
17.39)
...
(
22
17.39)
17.50
1
7.50
0.487
35.93
1291.08
b
xx s
=
?
=
?+
?+
+
?
==
=
∑
25
49
?
T
he 95% confidence interval is
S
E
1.493
2
.0294
0
.487
1.493
0
.988
0.505
t
o
2.481
b
bt
±×
=
±
×
=±=
We are 95% confident that mean IQ increases by between about 0.5 and 2.5 points for each additional crying intensity.
50
Using SPSS
26
5152
Predicte
d y
and
residuals
27
53
SPSS output
Model
Summary
b
.455
a
.207
.185
17.499
Model1
R
R Squar
e
AdjustedR Squar
e
S
t
d.
Erro
r o
f
the E
s
timate
Pr
edictor
s
:
(C
onstant), C
R
YI
NG
a.
Dependent Variable: IQ
b.
54
Coefficients
a
91.268
8.934
10.216
.000
73.149
109.388
1.493
.487
.455
3.065
.004
.505
2.481
(Con
stan
t)
CRYING
Model1
B
Std.
E
r
r
o
r
Unstandardized
Coefficien
ts
Beta
Stan
dar
d
iz
ed
Coefficien
ts
t
Sig.
L
o
we
r Bound
Upper Bound
95%
Con
f
iden
ce In
terval for
B
Depen
d
en
t Variable: IQ
a.
ANOVA
b
2877.480
1
2877.480
9.397
.004
a
11023.39
36
306.205
13900.87
37
Reg
r
ession
ResidualTotal
Mo
de
l
1
Sum of
Squares
df
Mean Square
F
Sig.
Pr
ed
ictor
s
:
(Constant),
CRYING
a.
Dependent Variable: IQ
b.
28
55
Testing the hypothesis of no
linear relationship
?
W
e can also test hypotheses about the
slope . The most common hypothesis is
?
A
regression line with slope 0 is horizontal.
That is, the mean of y does not change at all when x changes. So this says that there is no true linear relationship between x and y.
0
:0
H
β
=
0
H
β
56
?
P
ut another way, says that straight-line
dependence on x is of no value for predicting y. Put yet another
way, says
that there is no correlation between x and y in the population from which we drew our data.
?
T
he test statistic is just the standardized
version of the least-s
quares slope b. it is
another t statistic. Here are the details:
0
H
0
H
29
57
Significance tests for regression
slope
?
T
o test the hypothesis ,
compute the t statistic
0
:0
H
β
=
SE
b
b
t
=
58
?
I
n terms of a random variable T having
the t(n-2) distribution, the P-value for a test of against
0
H
1
:0
i
s
(
)
HP
T
t
β
>≥
1
:0
i
s
(
)
H
PT
t
β
<≤
1
:
0
is 2
(
)
HP
T
t
β
≠≥
30
59
Example 2 (continued)
?
T
he hypothesis says that
crying has no straight-line relationship with IQ.
0
:0
H
β
=
1.493
3.066
SE
0.487
b
b
t
==
=
SPSS gives t=3.065 with P-value
0.004.
There is very strong
evidence that IQ is
cor
r
elat
ed with cr
ying.
60
Checking the Regression
Assumptions
?
Y
ou can fit a least-squares line to any set of
explanatory-response data when both variables are quantitative. If the
scatterplot
doesn't show a
roughly linear pattern, the fi
tted line may be
almost useless. But it is
still the line that fits the
data best in the least-squares sense. To use regression inference, however, the data must satisfy the regression model assumptions. Before we do inference, we must check these assumptions one by one.
31
61
The observations are
independent
?
T
he observations are independent. In
particular, repeated observations on the same individual are not allowed. So we can't use ordinary regression to make inferences about the growth of a single child over time, for example.
62
The true relationship is linear
?
T
he true relationship is linear. We can't
observe the true regression line, so we will almost n
e
ver see
a perfe
ct st
raig
ht-line
relationship in our data. Look at the scatterplot
to check that the overall pattern
is roughly linear. A plot of the residuals against x magnifies any unusual pattern. Draw a horizontal line at zero on the residual plot to orient your eye. Because the sum of the residuals is always zero, zero is also the mean of the residuals.
32
63
The standard deviation of the
response about the true line is
the same everywhere
?
T
he standard deviation of the response
about the true line is the same everywhere. Look at the scatterplot
a
gain. The scatter
of the data points about the line should be roughly the same over the entire range of the data. A plot of the residuals against x, with a horizontal line at zero, makes this easier to
che
ck.
64
?
It is quite common to find that as the response y gets larger
, so does the scatter
of the points about the fitted line. Rather than remaining fixed, the standard deviation about the line is changing with x as the mean re
sponse changes with
x. You cannot safely use our inference recipes when this happens. There is no fixed for s to estimate.
σ
σ
33
65
The response varies normally
about the true regression line
?
W
e can't observe the true regression line.
We can observe the least-squares line and the residuals, which show the variation of the response about the fitted line. The residuals estimate the deviations of the response from the true regression line, so they should follow a normal distribution.
66
?
M
ake a histogram of the residuals and
che
ck fo
r clear skew
ness
or other major
departures from normality. Like other t procedures, inference
for regression is not
very sensitive to minor lack of normality, especially when we have many observations. Do beware of influential observations, which move the regression line and can greatly affect the results of inference.