RESEARCH METHODS & ANALYSIS
INTRODUCTION TO REGRESSION

WEEK 11 - CHAPTER 17


Purposes of Regression Line
Note that Figure 17.1 shows a good, but not perfect, positive relationship. Its purposes are:
    Makes relationship between SAT and GPA easier to see
    Identifies the central tendency of the relationship, just as does a mean for a set of scores
    Can be used for prediction: establishes precise relationship between each X value and a corresponding Y value

Definition

The statistical technique for finding the best-fitting straight line for a set of data is called regression, and the resulting straight line is called the regression line.

Introduction to Regression
Developing the regression line involves the use of a linear equation.

All linear equations have the form Y = bX + a, where a and b are constants.

The value of b is the slope of the line and measures how much Y changes when X is increased by one point.

The value of a is the Y-intercept and identifies the point where the line crosses the Y axis.

Example
A local tennis club charges a fee of $5 per hour plus an annual membership fee of $25. With this information, the total cost of playing tennis can be computed using a linear equation that describes the relationship between the total cost (Y) and the number of hours (X).

Y = 5X + 25

Developing the Equation
The regression equation is obtained by first finding the error (or distance) between the actual data points and the predicted values on the line (or the mean of the scores).

Each error is then squared to make the values consistently positive.

The goal of regression is to find the equation that produces the smallest total amount of squared error (called the least-squared error solution).

Thus, the regression equation produces the “best fitting” line for the data points.

Regression Equation
The regression equation is defined by the slope constant, b = SP/SSX, and the Y-intercept, a = MY - bMX, producing a linear equation of the form Y = bX + a.

The equation can be used to compute a predicted Y value for each of the X values in the data.

Caution:
The predicted value is not perfect (unless r = +1 or -1)
Equation should be used for predictions of X values that fall outside of the range covered by the original data

Based on these values…
SP = ∑(X – Mx)(Y – My) = 16

SSx = ∑(X – Mx)² = 10

Since the equation for the slope constant, b = SP/SSX, and the Y-intercept, a = MY - bMX, solutions for b and a are:
    b = 16/10 = 1.6
    a = 6 – 1.6(5)

The resulting regression equation is:
Ŷ = 1.6X - 2

Using the Regression Equation
For any given value of X, we can use the equation to compute a predicted value of Y.

Using the previous example, an individual with an X score of X = 5 would be predicted to have a Y score of:

Ŷ = 1.6X – 2
Ŷ = 1.6(5) – 2
= 8 – 2
= 6

Cautions Reviewed
The predicted value is not perfect (unless r = +1.00 or -1.00). Although the amount of error varies from point to point, on average the errors are directly related to the magnitude of the correlation. With a correlation near 1 or -1, small error, but as correlation nears zero, error increases.

The regression equation should not be used to make predictions for X values that fall outside the range of values covered by the original data. In the previous example, X values range from 3 to 7; because no info. exists for X values outside of this range, the equation should not be used to predict Y for any X lower than 3 or greater than 7.
Standardized Form
Occasionally, researchers transform X and Y values into z-scores; the resulting equation is simpler because z-scores have standardized characteristics. The mean is always 0, and the standard deviation is always 1.

Zy = (beta)Zx
Zy = (beta)Zx

Notice:
    Z-score for each X is used to predict z-score for Y
    Slope is now called beta
    Because both sets of z-scores have a mean of zero, the constant a disappears from the equation
    When 1 variable, X, is being used to predict a second variable, Y, the value of beta is equal to the Pearson correlation for X and Y.

Thus, the standardized form of the regression equation can also be written as:
    Zy = rZx

Standard Error of Estimate
A regression equation, by itself, allows you to make predictions, but it does not provide any information about the accuracy of the predictions. To measure the precision of the regression, it is customary to compute a standard error of estimate.

The standard error of estimate gives a measure of the standard distance between a regression line and the actual data points.

Calculating Standard Error of Estimate
Like standard deviation; measures standard distance. To calculate:

Find a sum of squared deviations (SS) of the residual; measures distance between actual Y value (raw score) from the predicted Y value (regression line).

Obtained SS is divided by its df (n – 2) to obtain measure of variance

Take the square root of the variance

Final equation: square root of SSresidual/df

Standard Error and Correlation
Because r² measures the portion of the variability in the Y scores that is predicted by the regression equation, we can use (1 - r²) to measure the unpredicted portion.

Predicted variability = SSregression = r²SSy

Unpredicted variability = SSresidual = (1 - r²)SSy

Testing the Significance of the Regression Equation: Analysis of Regression
Analysis of regression is similar to analysis of variance (ANOVA); both use an F-ratio

Also, in a simple linear regression, when a single X variable is being used to predict a single Y variable, hypothesis testing is much like that of a Pearson correlation.

Question: Is the amount of variance accounted for by the regression equation significantly greater than would be expected by chance alone?

The F-ratio is a ratio of two variances, or mean square (MS) values, and each variance is obtained by dividing an SS value by its corresponding df.

Significance Testing
Remember that SS for Y scores can be separated into the predicted portion and the unpredicted, or residual, portion.

Predicted variability = SSregression = r²SSy

Unpredicted variability = SSresidual = (1 - r²)SSy


The numerator of the F-ratio is MSregression, which measures the variance that is predicted by the regression equation.

The denominator is MSresidual, which measures unpredicted variance.

Numerator: MSregression = SSregression/dfregression with df = 1

Denominator: MSresidual = SSresidual/dfresidual with df = n – 2

F = MSregression/MSresidual with df = 1, n - 2

Example
In previous example, SSy = 40 with correlation of r = .8, producing r² = 0.64

Predicted variability = SSregression = 0.64(40) = 25.6

Unpredicted variability = SSresidual = (1 – 0.64)(40) = 14.40

Using these SS values and the corresponding df values, we calculate a variance, or MS, for each component:

MSregression = SSregression/dfregression with df = 1
MSregression = 25.60 / 1 = 25.60

MSresidual = SSresidual/dfresidual with df = n – 2
MSresidual = 14.40 / 3 = 4.8

Finally, the F-ratio for evaluating the significance of the regression equation is:
F = MSregression/MSresidual with df = 1, n - 2
F = 25.6 / 4.80 = 5.33

With df = 1, 3 and α = .05, the critical value is 10.13 (see Table B.4 on p. 705),
Thus, we fail to reject the null hypothesis
Conclusion: the regression equation does not account for a significant portion of the variance for the Y scores.

(Note: consistent with corresponding significance test for the Pearson critical value for r = 0.80 with n = 5; see Table B.6 on p. 709)

Multiple Regression


 

Multiple Regression, even limited to two predictors, can be relatively complex.

Additional predictors often overlap with those existing, so the new information gained is minimal after the first two predictors.

The simple concept is that each new variable provides more information and allows for more accurate predictions.

Having two predictors in the equation will produce more accurate predictions (less error and smaller residuals) than can be obtained using either predictor by itself.



The equation becomes:

Y = b1X1 + b2X2 + a
Figure 17.2 (p. 566)

Y = b1X1 + b2X2 + a

b1 = (SPx1y)(SSx2) – (SPx1x2)(SPx2y)
(SSx1)(SSx2) – (SPx1x2)²

= (52)(64) – (35)(47) = 0.800
(52)(64) – (35)²

b2 = (SPx2y)(SSx1) – (SPx1x2)(SPx1y)
(SSx1)(SSx2) – (SPx1x2)²

= (47)(52) – (35)(52) = 0.297
(52)(64) – (35)²

a = My – b1Mx1 – b2Mx2
= 7 – 0.800(4) – 0.297(6)
= 7 – 3.2 – 1.782 = 2.018

Thus, the regression equation is:

Ŷ = 0.8000X1 + 0.297X2 + 2.018

Percentage of variance accounted for by the multiple regression equation:
R² = SSregression or SSregression = R²SSy
SSy

For a regression equation with 2 predictor variables, R² = b1SPx1y + b2SPx2y SSy

R² = 0.8000(52) + 0.297(47) = 55.559 = 0.617 (61.7%)
90 90
Figure 17.3 (p. 567)

The value of R² can also be obtained indirectly, by computing the residual, or difference between the predicted Y and the actual Y for each individual, then computing the sum of the squared residuals.

Unpredicted variability = SSresidual = (1 - R²)SSy

The process of finding and squaring each residual is shown in the table above.

The sum of squared residuals, the unpredicted portion of SSy, is 34.44, which corresponds with 38.3% of the variability for the Y scores:

SSresidual/SSy = 34.44/90 = 0.383 or 38.3%

Standard Error of Estimate
Standard Error of Estimate = the standard distance between the predicted Y values (from the regression equation) and the actual Y values (from the data).

For linear regression, SSresidual = (1 - r²)SSy and has df = n – 2

For multiple regression with 2 predictors, SSresidual = (1 - R²)SSy and has df = n – 3

In each case we use SS and df to compute a variance or MSresidual

MSresidual = SSresidual/dfresidual

The standard error of estimate = square root of MSresidual

Significance Testing
Just as with the linear regression, the F-ratio can be calculated to test the significance of a multiple regression equation. With 2 predictor variables, SSregression has df = 2, and SSresidual has df = n – 3

Thus, the two MS values are:
MSregression = SSregression / 2
MSresidual = SSresidual / n – 3

In the prior example, the unpredicted portion of Y was 38.3%. The sample had n = 10 people and produced R² = .617 (or 61.7%) and SSy = 90.

SSregression = R²SSy = .617(90) = 55.53
SSresidual = (1 - R²)SSy = .383(90) = 34.47

MSregression = 55.53/2 = 27.77
MSresidual = 34.47/7 = 4.92

F = MSregression/MSresidual = 27.77/4.92 = 5.64

Analysis of Regression Table
Standardized Equations
If all 3 variables, X1, X2, and Y, have been standardized by transformations into z-scores, then the standardized form of the multiple regression equation predicts the z-score for each Y value. The standardized form is

zy = (beta1)zx1 + (beta2)zx2

Researchers rarely transform raw X and Y scores into z-scores before finding a regression equation; however, the beta values are meaningful and are reported by SPSS.

Relative Contribution
Is one of the predictors responsible for more of the prediction than the other?

In the raw multiple regression equation form, we cannot answer this question; If b1 is larger than b2, it does not necessarily mean that X1 is a better predictor than X2.

In the standardized form of the equation, the relative size of the beta values is an indication of the relative contribution of the two variables
zy = (beta1)zx1 + (beta2)zx2
= .608zx1 + .250zx2

In this case, the larger beta value for X1 indicates that X1 predicts more of the variance than does X2.

Significance of Relative Contributions
Null: the multiple regression equation is not any better than the linear regression at predicting Y.

First, determine the variance predicted by X1 (calculate a Pearson correlation and square it = effect size or coefficient of determination)

Second, subtract that portion from the total predicted variance from the multiple regression equation (SSregression), resulting in a SS value.

Third, compute MSadditional by dividing the resulting SS value by its df, which is 1.

Fourth, divide MS by the MSresidual for the multiple regression equation to find an F-ratio for the additional contribution.

Finally evaluate significance in the table for critical values for the F distribution (Table B.4) using df = 1, n - 3