RESEARCH
METHODS & ANALYSIS
INTRODUCTION TO REGRESSION
WEEK 11 - CHAPTER 17
Purposes of Regression Line
Note that Figure 17.1 shows a good, but not perfect, positive relationship. Its
purposes are:
Makes relationship between SAT and GPA easier to see
Identifies the central tendency of the relationship, just as
does a mean for a set of scores
Can be used for prediction: establishes precise relationship
between each X value and a corresponding Y value
Definition
The statistical technique for finding the best-fitting straight line for a set
of data is called regression, and the resulting straight line is called the
regression line.
Introduction to Regression
Developing the regression line involves the use of a linear equation.
All linear equations have the form Y = bX + a, where a and b are constants.
The value of b is the slope of the line and measures how much Y changes when X
is increased by one point.
The value of a is the Y-intercept and identifies the point where the line
crosses the Y axis.
Example
A local tennis club charges a fee of $5 per hour plus an annual membership fee
of $25. With this information, the total cost of playing tennis can be computed
using a linear equation that describes the relationship between the total cost
(Y) and the number of hours (X).
Y = 5X + 25
Developing the Equation
The regression equation is obtained by first finding the error (or distance)
between the actual data points and the predicted values on the line (or the mean
of the scores).
Each error is then squared to make the values consistently positive.
The goal of regression is to find the equation that produces the smallest total
amount of squared error (called the least-squared error solution).
Thus, the regression equation produces the “best fitting” line for the data
points.
Regression Equation
The regression equation is defined by the slope constant, b = SP/SSX, and the
Y-intercept, a = MY - bMX, producing a linear equation of the form Y = bX + a.
The equation can be used to compute a predicted Y value for each of the X values
in the data.
Caution:
The predicted value is not perfect (unless r = +1 or -1)
Equation should be used for predictions of X values that fall outside of the
range covered by the original data
Based on these values…
SP = ∑(X – Mx)(Y – My) = 16
SSx = ∑(X – Mx)² = 10
Since the equation for the slope constant, b = SP/SSX, and the Y-intercept, a =
MY - bMX, solutions for b and a are:
b = 16/10 = 1.6
a = 6 – 1.6(5)
The resulting regression equation is:
Ŷ = 1.6X - 2
Using the Regression Equation
For any given value of X, we can use the equation to compute a predicted value
of Y.
Using the previous example, an individual with an X score of X = 5 would be
predicted to have a Y score of:
Ŷ = 1.6X – 2
Ŷ = 1.6(5) – 2
= 8 – 2
= 6
Cautions Reviewed
The predicted value is not perfect (unless r = +1.00 or -1.00). Although the
amount of error varies from point to point, on average the errors are directly
related to the magnitude of the correlation. With a correlation near 1 or -1,
small error, but as correlation nears zero, error increases.
The regression equation should not be used to make predictions for X values that
fall outside the range of values covered by the original data. In the previous
example, X values range from 3 to 7; because no info. exists for X values
outside of this range, the equation should not be used to predict Y for any X
lower than 3 or greater than 7.
Standardized Form
Occasionally, researchers transform X and Y values into z-scores; the resulting
equation is simpler because z-scores have standardized characteristics. The mean
is always 0, and the standard deviation is always 1.
Zy = (beta)Zx
Zy = (beta)Zx
Notice:
Z-score for each X is used to predict z-score for Y
Slope is now called beta
Because both sets of z-scores have a mean of zero, the
constant a disappears from the equation
When 1 variable, X, is being used to predict a second
variable, Y, the value of beta is equal to the Pearson correlation for X and Y.
Thus, the standardized form of the regression equation can also be written as:
Zy = rZx
Standard Error of Estimate
A regression equation, by itself, allows you to make predictions, but it does
not provide any information about the accuracy of the predictions. To measure
the precision of the regression, it is customary to compute a standard error of
estimate.
The standard error of estimate gives a measure of the standard distance between
a regression line and the actual data points.
Calculating Standard Error of Estimate
Like standard deviation; measures standard distance. To calculate:
Find a sum of squared deviations (SS) of the residual; measures distance between
actual Y value (raw score) from the predicted Y value (regression line).
Obtained SS is divided by its df (n – 2) to obtain measure of variance
Take the square root of the variance
Final equation: square root of SSresidual/df
Standard Error and Correlation
Because r² measures the portion of the variability in the Y scores that is
predicted by the regression equation, we can use (1 - r²) to measure the
unpredicted portion.
Predicted variability = SSregression = r²SSy
Unpredicted variability = SSresidual = (1 - r²)SSy
Testing the Significance of the Regression Equation: Analysis of Regression
Analysis of regression is similar to analysis of variance (ANOVA); both use an
F-ratio
Also, in a simple linear regression, when a single X variable is being used to
predict a single Y variable, hypothesis testing is much like that of a Pearson
correlation.
Question: Is the amount of variance accounted for by the regression equation
significantly greater than would be expected by chance alone?
The F-ratio is a ratio of two variances, or mean square (MS) values, and each
variance is obtained by dividing an SS value by its corresponding df.
Significance Testing
Remember that SS for Y scores can be separated into the predicted portion and
the unpredicted, or residual, portion.
Predicted variability = SSregression = r²SSy
Unpredicted variability = SSresidual = (1 - r²)SSy
The numerator of the F-ratio is MSregression, which measures the variance that
is predicted by the regression equation.
The denominator is MSresidual, which measures unpredicted variance.
Numerator: MSregression = SSregression/dfregression with df = 1
Denominator: MSresidual = SSresidual/dfresidual with df = n – 2
F = MSregression/MSresidual with df = 1, n - 2
Example
In previous example, SSy = 40 with correlation of r = .8, producing r² = 0.64
Predicted variability = SSregression = 0.64(40) = 25.6
Unpredicted variability = SSresidual = (1 – 0.64)(40) = 14.40
Using these SS values and the corresponding df values, we calculate a variance,
or MS, for each component:
MSregression = SSregression/dfregression with df = 1
MSregression = 25.60 / 1 = 25.60
MSresidual = SSresidual/dfresidual with df = n – 2
MSresidual = 14.40 / 3 = 4.8
Finally, the F-ratio for evaluating the significance of the regression equation
is:
F = MSregression/MSresidual with df = 1, n - 2
F = 25.6 / 4.80 = 5.33
With df = 1, 3 and α = .05, the critical value is 10.13 (see Table B.4 on p.
705),
Thus, we fail to reject the null hypothesis
Conclusion: the regression equation does not account for a significant portion
of the variance for the Y scores.
(Note: consistent with corresponding significance test for the Pearson critical
value for r = 0.80 with n = 5; see Table B.6 on p. 709)
Multiple Regression
Multiple Regression, even limited to two predictors, can be
relatively complex.
Additional predictors often overlap with those existing, so the new information
gained is minimal after the first two predictors.
The simple concept is that each new variable provides more information and
allows for more accurate predictions.
Having two predictors in the equation will produce more accurate predictions
(less error and smaller residuals) than can be obtained using either predictor
by itself.
The equation becomes:
Y = b1X1 + b2X2 + a
Figure 17.2 (p. 566)
Y = b1X1 + b2X2 + a
b1 = (SPx1y)(SSx2) – (SPx1x2)(SPx2y)
(SSx1)(SSx2) – (SPx1x2)²
= (52)(64) – (35)(47) = 0.800
(52)(64) – (35)²
b2 = (SPx2y)(SSx1) – (SPx1x2)(SPx1y)
(SSx1)(SSx2) – (SPx1x2)²
= (47)(52) – (35)(52) = 0.297
(52)(64) – (35)²
a = My – b1Mx1 – b2Mx2
= 7 – 0.800(4) – 0.297(6)
= 7 – 3.2 – 1.782 = 2.018
Thus, the regression equation is:
Ŷ = 0.8000X1 + 0.297X2 + 2.018
Percentage of variance accounted for by the multiple regression equation:
R² = SSregression or SSregression = R²SSy
SSy
For a regression equation with 2 predictor variables, R² = b1SPx1y + b2SPx2y SSy
R² = 0.8000(52) + 0.297(47) = 55.559 = 0.617 (61.7%)
90 90
Figure 17.3 (p. 567)
The value of R² can also be obtained indirectly, by computing the residual, or
difference between the predicted Y and the actual Y for each individual, then
computing the sum of the squared residuals.
Unpredicted variability = SSresidual = (1 - R²)SSy
The process of finding and squaring each residual is shown
in the table above.
The sum of squared residuals, the unpredicted portion of SSy, is 34.44, which
corresponds with 38.3% of the variability for the Y scores:
SSresidual/SSy = 34.44/90 = 0.383 or 38.3%
Standard Error of Estimate
Standard Error of Estimate = the standard distance between the predicted Y
values (from the regression equation) and the actual Y values (from the data).
For linear regression, SSresidual = (1 - r²)SSy and has df = n – 2
For multiple regression with 2 predictors, SSresidual = (1 - R²)SSy and has df =
n – 3
In each case we use SS and df to compute a variance or MSresidual
MSresidual = SSresidual/dfresidual
The standard error of estimate = square root of MSresidual
Significance Testing
Just as with the linear regression, the F-ratio can be calculated to test the
significance of a multiple regression equation. With 2 predictor variables,
SSregression has df = 2, and SSresidual has df = n – 3
Thus, the two MS values are:
MSregression = SSregression / 2
MSresidual = SSresidual / n – 3
In the prior example, the unpredicted portion of Y was 38.3%. The sample had n =
10 people and produced R² = .617 (or 61.7%) and SSy = 90.
SSregression = R²SSy = .617(90) = 55.53
SSresidual = (1 - R²)SSy = .383(90) = 34.47
MSregression = 55.53/2 = 27.77
MSresidual = 34.47/7 = 4.92
F = MSregression/MSresidual = 27.77/4.92 = 5.64
Analysis of Regression Table
Standardized Equations
If all 3 variables, X1, X2, and Y, have been standardized by transformations
into z-scores, then the standardized form of the multiple regression equation
predicts the z-score for each Y value. The standardized form is
zy = (beta1)zx1 + (beta2)zx2
Researchers rarely transform raw X and Y scores into z-scores before finding a
regression equation; however, the beta values are meaningful and are reported by
SPSS.
Relative Contribution
Is one of the predictors responsible for more of the prediction than the other?
In the raw multiple regression equation form, we cannot answer this question; If
b1 is larger than b2, it does not necessarily mean that X1 is a better predictor
than X2.
In the standardized form of the equation, the relative size of the beta values
is an indication of the relative contribution of the two variables
zy = (beta1)zx1 + (beta2)zx2
= .608zx1 + .250zx2
In this case, the larger beta value for X1 indicates that X1 predicts more of
the variance than does X2.
Significance of Relative Contributions
Null: the multiple regression equation is not any better than the linear
regression at predicting Y.
First, determine the variance predicted by X1 (calculate a Pearson correlation
and square it = effect size or coefficient of determination)
Second, subtract that portion from the total predicted variance from the
multiple regression equation (SSregression), resulting in a SS value.
Third, compute MSadditional by dividing the resulting SS value by its df, which
is 1.
Fourth, divide MS by the MSresidual for the multiple regression equation to find
an F-ratio for the additional contribution.
Finally evaluate significance in the table for critical values for the F
distribution (Table B.4) using df = 1, n - 3