第 12章 简单线性回归与相关
The Simple Linear Regression and Correlation
本章概要
? Types of Regression Models
? Determining the Simple Linear Regression
Equation
? Measures of Variation in Regression and
Correlation
? Assumptions of Regression and Correlation
? Residual Analysis and the Durbin-Watson Statistic
? Estimation of Predicted Values
? Correlation - Measuring the Strength of the
Association
Purpose of Regression and Correlation Analysis
回归与相关分析的目的
? Regression Analysis is Used Primarily for
Prediction(回归主要用于预测)
A statistical model used to predict the values of a
dependent or response variable based on values of at
least one independent or explanatory variable
Correlation Analysis is Used to Measure
Strength of the Association Between
Numerical Variables(度量关系密切程度)
The Scatter Diagram
散点图
0
20
40
60
0 20 40 60
X
Y
Plot of all (Xi,Yi) pairs
Types of Regression Models
Positive Linear Relationship
Negative Linear Relationship
Relationship NOT Linear
No Relationship
Simple Linear Regression Model
简单线性回归模型
iii XY ??? ??? 10
Y intercept
Slope
? The Straight Line that Best Fit the Data
? Relationship Between Variables Is a Linear Function
Random
Error
Dependent
(Response)
Variable
Independent
(Explanatory)
Variable
?
i = Random Error
Y
X
Population
Linear Regression Model
Observed
Value
Observed Value
? ? ?YX iX? ?0 1
Y Xi i i? ? ?? ? ?0 1
Sample Linear Regression Model
简单线性相关模型
ii XbbY 10 ??
?
Yi
?
= Predicted Value of Y for observation i
Xi = Value of X for observation i
b0 = Sample Y - intercept used as estimate of
the population ?0
b1 = Sample Slope used as estimate of the
population ?1
Simple Linear Regression
Equation,Example
You wish to examine the
relationship between the
square footage of
produce stores and its
annual sales,Sample
data for 7 stores were
obtained,Find the
equation of the straight
line that fits the data
best
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Scatter Diagram Example
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
S q u a r e F e e t
A
n
n
u
a
l
S
a
l
e
s
(
$
0
0
0
)
Equation for the Best
Straight Line
i
ii
X..
XbbY
4 8 714 1 51 6 3 6
10
??
???
From Excel Printout,C o e f f i c i e n t s
I n t e r c e p t 1 6 3 6, 4 1 4 7 2 6
X V a r i a b l e 1 1, 4 8 6 6 3 3 6 5 7
Graph of the Best Straight
Line
0
2000
4000
6000
8000
10000
12000
0 1000 2000 3000 4000 5000 6000
S q u a r e F e e t
A
n
n
u
a
l
S
a
l
e
s
(
$
0
0
0
)
Interpreting the Results
Yi = 1636.415 +1.487Xi
The slope of 1.487 means for each increase of one unit
in X,the Y is estimated to increase 1.487units.
For each increase of 1 square foot in the size of the
store,the model predicts that the expected annual
sales are estimated to increase by $1487.
?
Measures of Variation:
The Sum of Squares变异平方和
SST = Total Sum of Squares
?measures the variation of the Yi values around their
mean Y
SSR = Regression Sum of Squares
?explained variation attributable to the relationship
between X and Y
SSE = Error Sum of Squares
?variation attributable to factors other than the
relationship between X and Y
_
Measures of Variation,
The Sum of Squares
Xi
Y
X
Y
SST = ?(Yi - Y)2
SSE =?(Yi - Yi )2
?
SSR = ?(Yi - Y)2
?
_
_
_
df SS
R e g r e s s i o n 1 30380456,12
R e s i d u a l 5 1871199,595
T o t a l 6 32251655,71
Measures of Variation
The Sum of Squares,Example
Excel Output for Produce Stores
SSR SSE SST
The Coefficient of Determination
决定系数 ----用回归解释相关的桥梁
SSR regression sum of squares
SST total sum of squaresr
2 = =
Measures the proportion of variation that is
explained by the independent variable X in the
regression model
Coefficients of Determination (r2)
and Correlation (r)
r2 = 1,r2 = 1,
r2 =,8,r2 = 0,Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
Y
Yi = b0 + b1Xi
X
^
r = +1 r = -1
r = +0.9 r = 0
Standard Error of Estimate
标准误的估计
2?
?
n
S S E
S yx
2
1
2
?
? ?
?
n
)YY(
n
i
ii
?
=
The standard deviation of the variation of
observations around the regression line
R e g r e s s i o n S t a t i s t i c s
M u l t i p l e R 0, 9 7 0 5 5 7 2
R S q u a r e 0, 9 4 1 9 8 1 2 9
A d j u s t e d R S q u a r e 0, 9 3 0 3 7 7 5 4
S t a n d a r d E r r o r 6 1 1, 7 5 1 5 1 7
O b s e r v a t i o n s 7
Measures of Variation,Example
Excel Output for Produce Stores
r2 =,94 Syx94% of the variation in annual sales can be
explained by the variability in the size of the store
as measured by square footage
Linear Regression
Assumptions
1,Normality
? Y Values Are Normally Distributed For Each X
? Probability Distribution of Error is Normal
2,Homoscedasticity (Constant Variance)
3,Independence of Errors
For Linear Models
Variation of Errors Around the
Regression Line
X1
X2
X
Y
f(e)
y values are normally distributed
around the regression line.
For each x value,the spreads or
variance around the regression line is
the same.
Regression Line
Residual Analysis
残差分析
? Purposes
? Examine Linearity
? Evaluate violations of assumptions
? Graphical Analysis of Residuals
? Plot residuals Vs,Xi values
? Difference between actual Yi & predicted Yi
? Studentized residuals:
? Allows consideration for the magnitude of the
residuals
?
Residual Analysis for Linearity
Not Linear Linear?
X
e e
X
Residual Analysis for
Homoscedasticity
Heteroscedasticity ?Homoscedasticity
Using Standardized Residuals
SR
X
SR
X
R e s i d u a l P l o t
0 1000 2000 3000 4000 5000 6000
S q u a r e F e e t
Residual Analysis,
Computer Output Example
Produce Stores
Excel Output
O b s e r v a t i o n P r e d i c t e d Y R e s i d u a l s
1 4 2 0 2, 3 4 4 4 1 7 - 5 2 1, 3 4 4 4 1 7 3
2 3 9 2 8, 8 0 3 8 2 4 - 5 3 3, 8 0 3 8 2 4 5
3 5 8 2 2, 7 7 5 1 0 3 8 3 0, 2 2 4 8 9 7 1
4 9 8 9 4, 6 6 4 6 8 8 - 3 5 1, 6 6 4 6 8 8 2
5 3 5 5 7, 1 4 5 4 1 - 2 3 9, 1 4 5 4 1 0 3
6 4 9 1 8, 9 0 1 8 4 6 4 4, 0 9 8 1 6 0 3
7 3 5 8 8, 3 6 4 7 1 7 1 7 1, 6 3 5 2 8 2 9
The Durbin-Watson
Statistic
?Used when data is collected over time to detect
autocorrelation (Residuals in one time period
are related to residuals in another period)
?Measures Violation of independence assumption
?
? ?
?
?
?
?
n
i
i
n
i
ii
e
)ee(
D
1
2
2
2
1
Should be close to 2,
If not,examine the model for
autocorrelation.
Residual Analysis for Independence
独立性残差分析
Not Independent Independent?
X
SR
X
SR
Inferences about the Slope,t
Test(回归系数的检验)
? t Test for a Population Slope Is
a Linear Relationship Between X & Y?
1
11
bS
bt ???
?Test Statistic,? ?
?
?
n
i
i
YX
b
)XX(
S
S
1
2
1
and df = n - 2
?Null and Alternative Hypotheses
H0,?1 = 0 (No Linear Relationship)
H1,?1 ? 0 (Linear Relationship)
Where
Example,Produce Stores
Data for 7 Stores,Regression Model
Obtained:
The slope of this model is
1.487.
Is there a linear
relationship between the
square footage of a store
and its annual sales?
?
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Yi = 1636.415 +1.487Xi
t S t a t P - v a l u e
I n t e r c e p t 3, 6 2 4 4 3 3 3 0, 0 1 5 1 4 8 8
X V a r i a b l e 1 9, 0 0 9 9 4 4 0, 0 0 0 2 8 1 2
H0,?1 = 0
H1,?1? 0
??,05
df ? 7 - 2 = 7
Critical Value(s):
Test Statistic,
Decision:
Conclusion:
There is evidence of a
relationship.t0 2.5706-2.5706
.025
Reject Reject
.025
From Excel Printout
Reject H0
Inferences about the Slope,
t Test Example
Inferences about the Slope,
Confidence Interval Example
Confidence Interval Estimate of the Slope
b1? tn-2
1bS
Excel Printout for Produce Stores
At 95% level of Confidence The confidence Interval for the slope
is (1.062,1.911),Does not include 0.
Conclusion,There is a significant linear relationship
between annual sales and the size of the store.
L o w e r 9 5 % U p p e r 9 5 %
I n t e r c e p t 4 7 5, 8 1 0 9 2 6 2 7 9 7, 0 1 8 5 3
X V a r i a b l e 1 1, 0 6 2 4 9 0 3 7 1, 9 1 0 7 7 6 9 4
Estimation of
Predicted Values
Confidence Interval Estimate for ?XY
The Mean of Y given a particular Xi
? ?
?
???
?
? n
i
i
i
yxni
)XX(
)XX(
n
StY ?
1
2
2
2
1
t value from table
with df=n-2
Standard error of
the estimate
Size of interval vary according to
distance away from mean,X.
Estimation of Predicted Values
区间预测
Confidence Interval Estimate for Individual
Response Yi at a Particular Xi
? ?
?
????
?
? n
i
i
i
yxni
)XX(
)XX(
n
StY ?
1
2
2
2
1
1Addition of this 1 increased width of interval
from that for the mean Y
Interval Estimates for
Different Values of X
X
Y
X
Confidence Interval
for a individual Yi
A Given X
Confidence
Interval for the
mean of Y
_
Example,Produce Stores
Yi = 1636.415 +1.487Xi
Data for 7 Stores:
Regression Model Obtained:
Predict the annual
sales for a store with
2000 square feet.
?
Annual
Store Square Sales
Feet ($000)
1 1,726 3,681
2 1,542 3,395
3 2,816 6,653
4 5,555 9,543
5 1,292 3,318
6 2,208 5,563
7 1,313 3,760
Estimation of Predicted
Values,Example
Confidence Interval Estimate for Individual Y
Find the 95% confidence interval for the average annual sales for
stores of 2,000 square feet
? ?
?
???
?
? n
i
i
i
yxni
)XX(
)XX(
n
StY ?
1
2
2
2
1
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
?
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706
= 4610.45 ? 980.97
Confidence interval for mean Y
Estimation of Predicted
Values,Example
Confidence Interval Estimate for ?XY
Find the 95% confidence interval for annual sales of one particular
stores of 2,000 square feet
Predicted Sales Yi = 1636.415 +1.487Xi = 4610.45 ($000)
?
X = 2350.29 SYX = 611.75 tn-2 = t5 = 2.5706
= 4610.45 ? 1853.45
Confidence interval for individual Y
? ?
?
????
?
? n
i
i
i
yxni
)XX(
)XX(
n
StY ?
1
2
2
2
1
1
Correlation,Measuring the
Strength of Association
? Answer How Strong Is the Linear Relationship
Between 2 Variables?
? Coefficient of Correlation Used
? Population correlation coefficient denoted ?
? Values range from -1 to +1
? Measures degree of association
? Is the Square Root of the Coefficient of
Determination
Test of
Coefficient of Correlation
? Tests If There Is a Linear Relationship
Between 2 Numerical Variables
? Same Conclusion as Testing Population
Slope ?1
? Hypotheses
? H0,?= 0 (No Correlation)
? H1,?? 0 (Correlation)
本章小结
? Described Types of Regression Models
? Determined the Simple Linear Regression
Equation
? Provided Measures of Variation in Regression and
Correlation
? Stated Assumptions of Regression and Correlation
? Described Residual Analysis and the Durbin-Watson
Statistic
? Provided Estimation of Predicted Values
? Discussed Correlation - Measuring the Strength of the
Association