1 1 Lecture 8 Simple Linear Regression 2 Simple Linear Regression (Bivariate Regression) We already looked at measuring relationships between two interval variables using correlation. Now we continue to look at the bivariate analysi s of the two variables using regression analysis. However, the purpose of doing regression rather than correlation is that we can predict results i n one variable, based on another variable. So, rather than simply see if the variables are related, we can interpret their effect. 2 3 Simple Linear Regression Like correlation, there are two major assumptions: ? T he relationship should be linear; and ? T he level of data must be continuous 4 The regression equation The purpose of simple linear regression is to fit a line to the two variables. This line is called the line of best fit , or the regress i on line . When we do a scatterplot of two variables, it is possible to fit a line which best represents the data. 3 5 The regression equation A re gression equa tion is used to define the relationship between two variables. It takes the form: or 6 The regression equation They are essentially the same, except that the second includes an error term at the end. This error term indicates that what we have is in fact a model, and hence won't fit the data perfectly . 4 7 The regression equation 8 S c at t e r p l o t an d r e g r e s s i o n l i n e 0 10 20 30 40 50 60 70 012 345 X Y Change i n X i s 1 Change i n Y i s 10 I n t e r c ept i s 20 5 9 How do we fit a line to data? In order to fit a line of best fit we use a method called the Method of Least Square s. This method allows us to determine which line, out of all the lines that could be drawn, best represents the least amount of difference between the actual values (the data points) and the predicted line. 10 In the Figure above, three data points fall on the line, while the remaining 6 are slightly above or below the line. The difference between these points and the line are called residuals. Some of these differences will be positive (above the line), while others will be negative (below the line). If we add up all these differences, some of the positive and negative values will cancel each other out, which will have the effect of overestimating how well the line represents the data. Instead, if we square the differences and then add them up then we can work out which line has the smallest sum of squares (that is, the one with the least error). 6 1112 Example 1 7 1314 Prediction 8 15 We could now draw a line of best fit through the observed data points 0. 0 0. 5 1. 0 1. 5 2. 0 2. 5 3. 0 3. 5 1 5 20 25 30 3 5 4 0 45 50 Age N u m b e r of ch il dr en 16 9 1718 Inference for Regression ? W hen a scatterplot shows a linear relationship between a quantitative explanatory variable x and a quantitative response variable y, we can use the least-squares line fitted to th e data to predict y for a given value of x. Now we want to do test s an d confid ence int e rval s in this setting. 10 19 Example 2: Crying and IQ ? I nfants who cry easily may be a sign of higher IQ. Crying intensity and IQ data on 38 infants: IQ= intelligenc e quotient 20 11 21 ? Plot a n d interpret. As always, we first examine the data. Figure 3 is a scatterplotof the crying data. Plot the explanatory variable (crying intensity at birth) horizontally and the response variable (IQ at age 3) vertically. Look for the form, direction, and strength of the relationship as well as for outliers or other deviations. There is a moderate positive linear relationship, with no extreme outliers or potentially influential observations. 22 ? Numerical summary. B e cause the scat t erpl ot shows a rou g h ly linear (straight-line) pattern, the correlation describes the direction and strength of the relationship. The correlation between crying and IQ is r = 0.455. ? Mathematical model. We are interested in predicting the response from information about the explanatory variable. So we find the least-squares regression line for predicting IQ from crying. 12 23 This line lies as close as possible to the points (in the sense of least squares) in the vertical (y) direction. The equation of the least-squares regression line isBecause = 0.207, about 21% of the variation in IQ scores is explained by crying intensity. See SPSS output: ? 91.27 1.493 y ab x x =+ = + 2 r 24 13 25 The regression model ? T he slope b and intercept a of the least- squares line are statistics. That is, we calculated them from the sample data. These statistics would take somewhat different values if we repeated the study with different infants. To do formal inference, we think of a and b as estimates of unknown parameters. 26 Assumptions for regression inference We have n observations on an explanatory variable x and a respon se variable y. Our goal is to study or pr edict the behavior of y for given values of x. ? F or an y fixed val u e of x, t he resp onse y varies according to a normal distribution. Repeated responses y are independent of each other. 14 27 ? T he mean response has a straight-line relationship with x:The slope and intercept are unknown parameters. ? T he standard deviation of y (call it ) is the same for all values of x. The value of is unknown. y x μ αβ =+ y μ σ σ β α 28 The heart of this model is that there is an "on the average" straight-line relationship between y and x. The true regression line y x μ αβ =+ response moves along a straight line as the explanatory variable x changes. We can't observe the true regression line. The values of y that we do observe vary about their means according to a normal distribution. If we hold x fixed and take many observations on y, the normal pattern will eventually appear in a histogram. says that the mean y μ 15 29 In practice, we observe y for many differentvalues o f x, so th at we se e an ove r all linear pattern formed by points scattered about the true line. The standard deviation determines whether the points fall close to the true regression line (small ) or are widely scattered (large ). σ σ σ 30 ? F igure 4 shows the regression model in picture form. The line in the figure is the true regression line. The mean of the response y moves along this line as the explanatory variable x takes different values. The normal curves show how y will vary when x is held fixed at different values. All of the curves have the same , so the variability of y is the same for all values of x. You should check the assumptions for inference when you do inference about regression. σ 16 31 Figure 4 The regression model . The line is the true regression line, which shows how the mean response changes as the explanatory variable x changes. For any fixed value of x, the observed response y varies according to a normal distribution having mean . y μ y μ 32 Inference about the Model ? T he first step in inference is to estimate the unknown parameters , , and . When the regression model describes our data and we calculate the least-squares line , the slope b of the least- squares line is an unbiased estimator of the true slope , and the intercept a of the least-squares line is an unbiased estimator of the true intercept . α β σ ? y ab x =+ β α 17 33 ? T he data in Figure 3 fit the regression model of scatter about an invisible true regression line reasonably well. The least-squares line is . The slope is particularly important. A slope is a rate of change. The true slope says how much higher average IQ is for children with one more intensity unit in their crying measurement. Because b = 1.493 estimates the unknown , we estimate that on the average IQ is about 1.5 points higher for each added crying intensity. ? 91.27 1 .493 y x =+ β β 34 ? W e need the intercept a = 91.27 to draw the line, but it has no statistical meaning in this example. No child had fewer than 9 crying intensity, so we have no data near x = 0. ? T he remaining parameter of the model is the standard deviation , which describes the variability of the response y about the true regression line. The least-squares line estimates the true regression line. So the residuals estimate how much y varies about the true line. σ 18 35 ? R ecall that the residuals are the vertical deviations of the data po ints from the least- squares line: re sidua l obse r v e d pre d icte d ? yy yy =? =? There are n residuals, one for each data point. Because is the standard deviation of responses about the true regression line, we estimate it by a sample standard deviation of the residuals. We call this sample standard deviation a standard error to emphasize that it is es timated from data. The residuals from a least-squares line always have mean zero. That simplifi es their standard error. σ 36 Standard error about the least- squares line ? T he standard error about the line is 2 2 1 re sidua l 2 1 ? () 2 s n yy n = ? =? ? ∑ ∑ Use s to estimate the unknown in the r egr ess i on model. σ 19 37 ? B ecause we use the standard error about the line so often i n regression inference, we just call i t s. Notice that is an average of the squared deviations of the data points from the line, so it qualifies as a variance. We average the squared deviations by dividing by n - 2 , the number of data points less 2. It turns out that if we know n - 2 of the n resi duals, the other two are determined. That is, n - 2 is the degrees of freedom of s. We first met the idea of degrees of freedom in the case of the ordinary sample standard deviation of n observations, which has n - 1 degrees of freedom. Now we observe two variables rather than one, and the proper degrees of freedom is n - 2 rather than n - 1 . 2 s 38 ? C alculating s is unpleasant. You must find the predicted response fo r each x in our data set, then the residuals, and then s. In practice we will use SPSS that does this arithmetic instantly. Nonetheless, here is an example to help you understand the standa r d erro r s. 20 39 Example 2 (continued) ? T he first infant had 10 crying intensity and a later IQ of 87. The predicted IQ for x=10 is: ? 91.27 1 .493 91.27 1.493 1 0 106.2 yx =+=+ × = The residual for this observation is ? r e sidua l 8 7 106.2 19.2 yy =?= ? = ? T hat is, the observed IQ for this inf ant li es 1 9 .2 points be lo w the l east- squares l i ne. 40 ? R epeat this calculation 37 more times, once fo r each su bj ect. The 38 residuals are: -19.20 -31.13 -22.65 -15.18 -12.18 -15.15 -16.63 - 6.18 -1.70 -22.60 -6.68 - 6.17 -9.15 -23.58 -9.14 2 .80 -9.14 -1.66 -6.14 -12.60 0.34 -8.62 2.85 1 4.30 9.82 10.82 0.37 8 .85 10.87 19.34 10.89 - 2.55 20.85 24.35 18.94 3 2.89 18.47 51.32 21 41 ? T he variance about the line is: 2 2 22 2 1 res i du al 2 1 ( 1 9.20) ( 31.13 ) . .. 51.32 38- 2 1 ( 11023.3 ) 306.20 36 s n = ? ?? =? + ? + + ?? == ∑ 42 ? F inally, the standard error about the line is: 306.20 =17.50 s = The standard error s about the line is the key measure of the variability of the responses in regressi on. It is part of the standard error of all the statistics we will use for inference. 22 43 Confidence intervals for the regression slope ? T he slope of the true regression line is usually the most important parameter in a regression problem. The slope is the rate of change of the mean response as the explanatory variable increases. We often want to estimate . The slope b of the least-squares line is an unbiased estimator of . A confidence interval is more useful becau s e it shows how a ccu rate t h e estimate b is likely to be. β β β 44 ? T he confidence interval for has the familiar formBecause b is our estimate, the confidence interval becomesHere are the details: β ±× esti m a te es ti m a te SE t SE b bt ±× 23 45 Confidence interval for regression slope ? A level C confidence interval for the slope of the true regression line isIn this recipe, the standard error of the least-squares slope b is β SE b bt ±× 2 SE () b xx s = ? ∑ 46 and t is the upper (1-C)/2 critical value from the t distribution with n-2 degrees of freedom. ? A s advertised, the stan da rd error of b is a multiple of s. Although we give the recipe for this standard erro r, yo u should rarely have to calculate it by hand. Regression softwa r e such as SPSS gives the standa r d error along with b itself. SE b 24 47 Example 2 (continued) ? T o calculate a 95% confidence interval for the true slope , we use the critical value t=2.0294 when degrees of freedom are n-2=36. β df t 30 2.042 36 ? 40 2.021 2.0294 48 ? T he standard error of b is 2 22 2 SE () 17.50 ( 1 0 1 7.39) ( 12 17.39) ... ( 22 17.39) 17.50 1 7.50 0.487 35.93 1291.08 b xx s = ? = ?+ ?+ + ? == = ∑ 25 49 ? T he 95% confidence interval is S E 1.493 2 .0294 0 .487 1.493 0 .988 0.505 t o 2.481 b bt ±× = ± × =±= We are 95% confident that mean IQ increases by between about 0.5 and 2.5 points for each additional crying intensity. 50 Using SPSS 26 5152 Predicte d y and residuals 27 53 SPSS output Model Summary b .455 a .207 .185 17.499 Model1 R R Squar e AdjustedR Squar e S t d. Erro r o f the E s timate Pr edictor s : (C onstant), C R YI NG a. Dependent Variable: IQ b. 54 Coefficients a 91.268 8.934 10.216 .000 73.149 109.388 1.493 .487 .455 3.065 .004 .505 2.481 (Con stan t) CRYING Model1 B Std. E r r o r Unstandardized Coefficien ts Beta Stan dar d iz ed Coefficien ts t Sig. L o we r Bound Upper Bound 95% Con f iden ce In terval for B Depen d en t Variable: IQ a. ANOVA b 2877.480 1 2877.480 9.397 .004 a 11023.39 36 306.205 13900.87 37 Reg r ession ResidualTotal Mo de l 1 Sum of Squares df Mean Square F Sig. Pr ed ictor s : (Constant), CRYING a. Dependent Variable: IQ b. 28 55 Testing the hypothesis of no linear relationship ? W e can also test hypotheses about the slope . The most common hypothesis is ? A regression line with slope 0 is horizontal. That is, the mean of y does not change at all when x changes. So this says that there is no true linear relationship between x and y. 0 :0 H β = 0 H β 56 ? P ut another way, says that straight-line dependence on x is of no value for predicting y. Put yet another way, says that there is no correlation between x and y in the population from which we drew our data. ? T he test statistic is just the standardized version of the least-s quares slope b. it is another t statistic. Here are the details: 0 H 0 H 29 57 Significance tests for regression slope ? T o test the hypothesis , compute the t statistic 0 :0 H β = SE b b t = 58 ? I n terms of a random variable T having the t(n-2) distribution, the P-value for a test of against 0 H 1 :0 i s ( ) HP T t β >≥ 1 :0 i s ( ) H PT t β <≤ 1 : 0 is 2 ( ) HP T t β ≠≥ 30 59 Example 2 (continued) ? T he hypothesis says that crying has no straight-line relationship with IQ. 0 :0 H β = 1.493 3.066 SE 0.487 b b t == = SPSS gives t=3.065 with P-value 0.004. There is very strong evidence that IQ is cor r elat ed with cr ying. 60 Checking the Regression Assumptions ? Y ou can fit a least-squares line to any set of explanatory-response data when both variables are quantitative. If the scatterplot doesn't show a roughly linear pattern, the fi tted line may be almost useless. But it is still the line that fits the data best in the least-squares sense. To use regression inference, however, the data must satisfy the regression model assumptions. Before we do inference, we must check these assumptions one by one. 31 61 The observations are independent ? T he observations are independent. In particular, repeated observations on the same individual are not allowed. So we can't use ordinary regression to make inferences about the growth of a single child over time, for example. 62 The true relationship is linear ? T he true relationship is linear. We can't observe the true regression line, so we will almost n e ver see a perfe ct st raig ht-line relationship in our data. Look at the scatterplot to check that the overall pattern is roughly linear. A plot of the residuals against x magnifies any unusual pattern. Draw a horizontal line at zero on the residual plot to orient your eye. Because the sum of the residuals is always zero, zero is also the mean of the residuals. 32 63 The standard deviation of the response about the true line is the same everywhere ? T he standard deviation of the response about the true line is the same everywhere. Look at the scatterplot a gain. The scatter of the data points about the line should be roughly the same over the entire range of the data. A plot of the residuals against x, with a horizontal line at zero, makes this easier to che ck. 64 ? It is quite common to find that as the response y gets larger , so does the scatter of the points about the fitted line. Rather than remaining fixed, the standard deviation about the line is changing with x as the mean re sponse changes with x. You cannot safely use our inference recipes when this happens. There is no fixed for s to estimate. σ σ 33 65 The response varies normally about the true regression line ? W e can't observe the true regression line. We can observe the least-squares line and the residuals, which show the variation of the response about the fitted line. The residuals estimate the deviations of the response from the true regression line, so they should follow a normal distribution. 66 ? M ake a histogram of the residuals and che ck fo r clear skew ness or other major departures from normality. Like other t procedures, inference for regression is not very sensitive to minor lack of normality, especially when we have many observations. Do beware of influential observations, which move the regression line and can greatly affect the results of inference.