Correlation statistics are useful for describing the relative strength of a relationship between two variables; however, when a researcher wants to establish this relationship as a basis for prediction, a **regression** procedure is used (see Box 24.1). The ability to predict outcomes and characteristics is crucial to effective clinical decision making and goal setting. It also has important implications for efficiency and quality of patient care, especially in situations where resources are limited. Regression analysis provides a powerful statistical approach for explaining and predicting quantifiable clinical outcomes. For example, clinicians have looked at functional assessments in patients with extensive burns to determine which factors are predictive of quality of life outcomes.^{1} Early language and nonverbal skills have been shown to be important predictors of outcome in adaptive behavior in communication and socialization for children with autism.^{2} Researchers have studied patients with stroke to determine the relative contributions of specific impairments toward prediction of discharge function, rehabilitation length of stay, and discharge destination.^{3} Therapists have investigated factors predictive of timely and sustained recovery following multidisciplinary rehabilitation in workmen's compensation claimants with low back pain.^{4} Such analyses help us explain our empirical clinical observations and provide information that can be used to set realistic goals for our patients. The purpose of this chapter is to describe the process of regression and how it can be used to interpret clinical data.

In its simplest form, linear regression involves the examination of two variables, *X* and *Y*, that are linearly related or correlated. The variable designated *X* is the **independent** or **predictor variable**, and the variable designated *Y* is the **dependent** or **criterion variable.** For example, we could look at systolic blood pressure (*Y*) and age (*X*) in a sample of 10 women. Using regression analysis we can use these data as a basis for predicting a woman's blood pressure by knowing her age. If we plot hypothetical data for this example on a scatter plot, as shown in Figure 24.1, we can see that the data tend to fall in a linear pattern, with larger values of *X* associated with larger values of *Y*. The correlation coefficient for these data, *r* = .87, describes a fairly strong association.

If the data were perfectly correlated, all data points would fall along a straight line. This line could then be used to predict values of *Y* by locating the intersection of points on the line for any given value of *X.* With correlations less than 1.00, however, as in this example, a prediction line can only be *estimated*. If we look at the scatter diagram in Figure 24.1 we might try to plot a line that goes through the middle of the data points—but how do we objectively find the middle? We might try drawing a line through a point that represents the mean of *X* and *Y*, but how do we determine its slope? Clearly, we cannot make this determination without statistical help. The process of regression allows us to find the one line that best describes the orientation of all data points in the scatter plot. This line is called the **linear regression line.**

*Sometimes we get so caught up in the application of statistics that we don't stop to think about where these measures came from. Someone had to think them up! Here's some interesting background on the origin of the concept of regression*.

Sir Francis Galton was bom near Birmingham, England in 1822. He was a tropical explorer and geographer, meteorologist, psychologist, inventor of fingerprint identification—and pioneer of statistical correlation and regression. He is best known for his study of human intelligence and his belief in eugenics. A cousin of Charles Darwin, Galton became interested in the concept of heredity, and was convinced that "genius" was almost entirely due to hereditary factors, in sharp contrast to the thinking of the day which basically held that everyone was bom with equal abilities.

In 1875 Galton began experimenting with sweet pea seeds, as this was a self-fertilizing plant, and he could look at simple hereditary characteristics. He found that the offspring peas of large seeds were usually smaller than the parent, and the offspring from small seeds were usually larger than the parent—but just a little.

Galton later collected extensive data on the heights of parents and children. Because it was known that taller parents had taller children and shorter parents had shorter children, he noted that it would seem logical that the variance in height should increase over time; that is, we should see people getting taller and shorter based on their parents' heights. However, his data supported the same relationship he had found with the sweet peas. He coined the term **regression** in his report of this phenomenon, *"Regression towards mediocrity in hereditary stature."* As shown in the plate from that work, Galton reasoned that the height of the children depends on the average height of both the father and the mother, and that variance in the height of the population is reduced by "regression" towards the mean by just enough to keep it almost constant over time.

Although we tend to think of regression as an outgrowth of correlation, interestingly, Galton's work on regression was the foundation for Karl Pearson's development of correlation statistics.

See Chapter 5 for additional discussion of regression toward the mean as an issue in reliability.

The process of linear regression involves first determining an equation for the regression line and then using that equation to predict values of *Y*. The algebraic representation of the regression line is given by

The quantity *Ŷ* (said "Y-hat") is the *predicted* value of *Y*. The term *α* is the *Y*-intercept, representing the value of *Y* when *X* = 0. Graphically, it is the point at which the line intersects the *Y*-axis (see Figure 24.2). This can be a positive or negative value, depending on whether the line crosses the *Y*-axis above or below the *X*-axis. In regression analysis, *α* is called a **regression constant.** The term *b* is the *slope* of the line, which is the rate of change in *Y* for each one-unit change in *X.* In regression analysis, this term is the **regression coefficient.** When *b* is positive, *Y* increases as *X* increases. When *b* is negative, *Y* decreases as *X* increases. If *b* = 0, the slope of the line is horizontal, indicating no relationship between *X* and *Y* (*Y* is constant for all values of *X*). The positive or negative direction of the slope will correspond to a positive or negative correlation between *X* and *Y*.

We can illustrate these concepts by describing the linear equation *Y* = 1 + 2X. This equation represents a straight line that intersects the *Y*-axis at *Y* = 1. With a slope of 2, *Y* increases two units for every one-unit change in *X.* A line can be drawn from this equation by plotting any two points along the line and connecting them. We can arbitrarily choose any two values along the *X*-axis and solve for the corresponding values of *Y*. Thus, we can plot one point at *X* = 1, *Y* = 1 + 2(1) = 3. The second point, say at *X* = 3, is determined by *Y* = 1 + 2(3) = 7. This process is illustrated in Figure 24.2.

Figure 24.3 shows the regression of blood pressure (*Y*) on age (*X*), using the hypothetical data from Figure 24.1. The values that fall on the regression line are the predicted values, *Ŷ*, for any given value of *X*; however, with *r* < 1.00, we can see that this line is only partially useful for predicting *Y*. Some data points are above the line, some are below, and some fall close to the line. Therefore, if we substitute any *X* value in the regression equation and solve for *Ŷ*, we will obtain a predicted value that will probably be somewhat different from the actual value of *Y*. We can visualize this error component in Figure 24.4. The actual *Y* value for each data point is some positive or negative vertical distance from *Ŷ* on the regression line. These distances (*Y* – *Ŷ*) are called **residuals.** Residuals represent the degree of error in the regression line.^{∗}

The regression line, or the **line of best fit** for a set of data points, is the unique line that will minimize this error component and yield the smallest residuals. Conceptually, this involves finding the square of all the residuals (to eliminate minus signs) and summing these squares, *Σ*(*Y* – *Ŷ*)^{2}, for every possible line that could be drawn to these data. The one line that gives the smallest sum of squares is the line of best fit. Any other line, with any other values of *a* and *b*, would yield a larger sum of the squared residuals. This method of "fitting" the regression line is called the **method of least squares.** Of course, we do not actually go through the process of finding residuals for every possible line. Formulas have been developed that allow us to calculate the line of best fit based on the sample data.

We can illustrate the process of regression using the study for predicting systolic blood pressure (SBP) as a function of age. Table 24.1A shows hypothetical data on SBP measurements for a sample of 10 women between 34 and 73 years of age. We calculate the regression coefficient, *b*, and the regression constant, *a*, using the computational formulas shown in Table 24.1B.^{†} These values identify a line that intersects the *Y*-axis at 64.30 with a change of 1.39 units in *Y* for each unit change in *X.* Therefore, the line that best fits these data can be drawn from the regression equation *Ŷ* = 64.30 + 1.39*X* (also see Table 24.3C). This line is superimposed on the scatter plot for these data in Figure 24.3.

We can now calculate the predicted score (*Ŷ*) for each subject using the regression equation, as shown in Table 24.2. For example, if we were presented with a woman who was 38 years old, we would predict that her systolic blood pressure would be *Ŷ* = 64.30 + 1.39(38) = 117.1. The actual blood pressure value for the 38-year-old subject, however, was 130. Therefore, the residual or error component of prediction is (*Y* – *Ŷ*) = 130 – 117.1 = 12.9. Note that the data point for this subject falls above the regression line; therefore, the regression equation underestimates SBP for this subject, and we have a positive residual.

Subject | Age X | SBP Y | Ŷ | Residual Y − Ŷ | (Y − Ŷ)^{2} |
---|---|---|---|---|---|

1 | 34 | 110 | 112.0 | -2.0 | 4.0 |

2 | 38 | 130 | 117.5 | 12.5 | 156.3 |

3 | 42 | 105 | 123.1 | -18.1 | 327.6 |

4 | 45 | 124 | 127.3 | -3.3 | 10.9 |

5 | 48 | 136 | 131.5 | 4.5 | 20.3 |

6 | 57 | 145 | 143.5 | 1.5 | 2.3 |

7 | 57 | 157 | 143.9 | 13.1 | 171.6 |

8 | 63 | 138 | 152.3 | -14.3 | 204.5 |

9 | 66 | 158 | 156.4 | 1.6 | 2.6 |

10 | 73 | 167 | 166.2 | 0.8 | 0.6 |

Σ(Y − Ŷ)^{2} = 900.7 |

Residuals are shown under the column labeled (*Y* – *Ŷ*) in Table 24.2. For a woman aged 63, we would predict a SBP of 152.9, where the actual score was 138. Therefore, the regression equation overestimates the SBP score for this subject, and we have a negative residual of −14.9. Most of the errors of prediction in this example are relatively small, because the correlation for these data is high (*r* = .87), and the points cluster close to the regression line. Note that the points for subjects aged 34, 66 and 73 have almost negligible residuals, as these points rest very close to the regression line (Figure 24.4).

The sum of the residuals will always be zero, as the regression line is an average for all data points. Therefore, we take the sum of the squares of these error components, (*Y* – *Ŷ*)^{2}, as an estimate of the usefulness of the regression line for prediction. The smaller the sum of squares, the closer the data points are to the regression line and the better the prediction accuracy.

^{∗}We could more accurately represent the regression equation as *Y* = *a* + *bX* ± error.

^{†}An alternative formula for *b* can be used: where where *s _{y}* and

*s*are the standard deviations for the two variables.

_{x}In any regression procedure, we recognize that the straight line we fit to sample data is only an approximation of the true regression line that exists for the underlying population. To make inferences about population parameters from sample data, we must consider the statistical assumptions that affect the validity of the regression equation.

For any given value of *X,* we can assume that a random distribution of *Y* scores exists; that is, the observed value of *Y* in a sample for a given *X* is actually one random score from the larger distribution of possible *Y* scores for that *X.* In the example we have been using, the observed SBP for a given age is a random observation from the larger distribution of all possible blood pressure scores at that age. If we had studied several subjects at each age, we would see a range of blood pressure scores for the same value of *X.* Some of these *Y* values would be above the regression line, and some would be below it. For instance, subjects 6 and 7 were 57 years old in our sample, with different blood pressure scores. As shown in Table 24.2, subject 6 has a predicted score very close to the true score, and subject 7 has a larger residual. If we took many measurements for women at 57 years old, the mean of the distribution of *Y* scores would fall on the regression line.

Theoretically, we could obtain such a distribution for every value of *X,* as shown in Figure 24.5. Each of these distributions would have a different mean, . If these means were connected, they would fall on a straight line that estimates the population regression line. We assume that each of these distributions is normal and that their standard deviations are equal.

These assumptions help us to understand the relevance of residual error variance to regression analysis. Conceptually, it makes sense that the regression line will contain some degree of error, as it is unlikely that any one score randomly chosen from a distribution will equal the mean. Therefore, we tend to see a scatter of points around the regression line. The least-squares line that is fitted to the sample data is an estimate of the population regression line, and *Y* is an estimate of the population mean for *Y* at each value of *X*.

One way to determine if the assumptions for regression analysis have been met is to examine a plot of residuals, as shown in Figure 24.6. By plotting the residuals (on the *Y*-axis) against the predicted scores (on the *X*-axis), we can appreciate the magnitude and distribution of the residual scores. The central horizontal axis represents the mean of the residuals, or zero deviation from the regression line. When the linear regression model is a good fit, the residual scores will be randomly dispersed close to zero. The wider the distribution of residuals around the zero axis, the greater the error.

Several types of patterns can emerge in the residual plot. If the data meet all the basic assumptions, the pattern should resemble a horizontal band of points, as illustrated in Figure 24.6A. The horizontal orientation suggests that the residuals are evenly, but randomly, distributed around the regression line.

Figures 24.6B and C illustrate problematic residual distributions. The pattern in Figure 24.6B indicates that the variance of the residuals is not consistent, but dependent on the value of the predicted variable. Residual error increases as the predicted value gets larger; that is, the degree of accuracy in the regression model varies with the size of the predicted value. Therefore, the assumptions of normality and equality of variance are not met. The curvilinear pattern, shown in Figure 24.6C, reflects a nonlinear relationship, negating the validity of the linear model. Other deviant residual patterns may be observed, such as diagonal patterns or a run of positive or negative residuals, all indicating some problem in the interpretation of the regression model.

When data do not fall into the horizontal pattern, the researcher may choose to transform one or both sets of data to more closely satisfy the necessary assumptions. Such transformations may stabilize the variance in the data, normalize the distributions, or create a more linear relationship. Methods of data transformation are described in Appendix D. When curvilinear tendencies are observed, polynomial regression models may be used to better represent the data. This approach is discussed later in this chapter.

Most computer programs for linear regression will provide options for calculating, printing and plotting residuals in a variety of formats. **Standardized residuals**, obtained by dividing each residual score by the standard deviation of the residual distribution, are often used instead of observed residuals to normalize the scale of measurement. Standardized residuals are analogous to *z*-scores, allowing the residuals to be expressed in standard deviation units. This approach is especially useful when different distributions are compared.

If a set of data points represents a distribution of related scores, the points will tend to cluster around their regression line. Sometimes, one or two deviant scores are separated from the cluster, so that they distort the statistical association. For example, the data points in Figure 24.7A show some variability, but most of the points fall within a definite linear pattern (*r* = .70). In Figure 24.7B, this distribution has one additional point, at *X,* *Y* = 1, 20, that does not seem to fit with the rest of the scores. Such a point is called an **outlier**, because it lies outside the obvious cluster of scores. The correlation for these data with the outlier included is quite low, *r* = .06. One extreme value has significantly altered the statistical description of the data.

What accounts for the occurrence of outliers? Researchers must consider several possibilities. The score may, indeed, be a true score, but an extreme one, because the sample is too small to generate a full range of observations. If more subjects were tested, there might be less of a discrepancy between the outlier and the rest of the scores. There may be also be circumstances peculiar to this data point that are responsible for the large deviation. For example, the score may be a function of error in measurement or recording, equipment malfunction, or some miscalculation. It may be possible to go back to the original data to find and correct this type of error. Other extraneous factors may also contribute to the aberrant score, some of which are correctable, others that are not. For instance, the data for the point may have been collected by a different tester who is not reliable. Or the researcher may find that the subject was inappropriately included in the sample; that is, the subject may have characteristics very different from the rest of the sample, accounting for the deviant response.

Outliers should be examined because they can have serious effects on the outcome of regression. Residual plots are often helpful for identifying outliers. Some researchers consider scores beyond three standard deviations from the mean to be outliers. The researcher must determine if the deviant score should be retained or discarded in the analysis. This decision should be made only after a thorough evaluation of the experimental conditions, the data collection procedures, and the data themselves. As a general rule, there is no statistical rationale for discarding an outlier; however, if a causal factor can be identified, the point should probably be omitted, provided that the causal factor is unique to the outlier.^{5} It may be helpful to perform the regression with and without the outlier, to demonstrate how inclusion of the outlier changes the conclusions drawn from the data.

Once a regression line is derived, it can be used to predict *Y* scores based on values of *X.* It is important to remember that a regression line can be calculated for any set of data, even though it may not represent the data very well. The value of the correlation coefficient, *r*, is a rough indicator of the "goodness of fit" of the regression line. When *r* is close to ±1.00, the regression line provides a strong basis for prediction. As *r* gets smaller, the errors of prediction will increase; however, the value of *r* is limited in its interpretation because it represents only the strength of an association. It will not evaluate the accuracy of prediction from the regression line. Several statistical approaches can be used for this purpose.

Statisticians have shown that the square of the correlation coefficient, *r ^{2}*, represents the percentage of the total variance in the

*Y*scores that can be explained by the

*X*scores. Therefore,

*r*is a measure of proportion, indicating the accuracy of prediction based on

^{2}*X.*This term is called the

**coefficient of determination**.

For the regression of blood pressure on age, *r* = .87 and *r ^{2}* = .76 (see Table 24.3). Therefore, 76% of the variance in systolic blood pressure can be accounted for by knowing the variance in age. We have 76% of the information we would need to make an accurate prediction. Obviously, some other unknown or unidentified factors must account for the remaining variance. The complement of

*r*, or 1 −

^{2}*r*, reflects the proportion of variance that is not explained by the relationship between

^{2}*X*and

*Y*, in this case 24%. Using age as a predictor will result in a reasonable, but not thoroughly accurate, estimate of blood pressure.

Values of *r ^{2}* are more meaningful for conceptualizing the extent of an association between variables than values of

*r*alone. For example, with a high correlation like

*r*= .70,

*r*= .49. This means that less than 50% of the variance in

^{2}*Y*is accounted for by knowing

*X,*less than one might think with a correlation coefficient that seems fairly strong. When strength of association is of interest,

*r*will be properly interpreted; however, when

*Y*is predicted from

*X,*

*r*provides a more meaningful description of the relationship. Values of

^{2}*r*will range between 0.00 and 1.00. No negative ratios are possible as it is a squared value.

^{2}Another way to establish the accuracy of prediction is to consider the variance of the errors on either side of the regression line, or the residuals. If the variance in the residuals is high, then the scores are widely dispersed around the regression line, indicating a large error component. The standard deviation of the distribution of errors is called the **standard error of the estimate (SEE).** For the blood pressure data, SEE = 10.61 (see Table 24.3➊).^{‡}

The better the fit of the regression line, the less variability there will be around it and the smaller the standard error of the estimate. The SEE can be thought of as an indicator of the average error of prediction for the regression equation. Therefore, the SEE is helpful for interpreting the usefulness of a regression equation where reliance on a correlation coefficient can be misleading.

Researchers can reduce standard error, and thereby improve accuracy of prediction, by including more than one observation at each value of *X* within a single study. This improves the estimation of variability at each *X,* thereby making the regression line a better estimate of the population mean.

^{‡}The standard error of the estimate is defined by SEE = where Σ(*Y* – *Ŷ*)^{2} is the sum of the squared residuals, and *n* represents the number of pairs of scores. For the data in Table 24.2, SEE =

Up to this point, we have used regression analysis primarily as a descriptive technique. We can also draw statistical inferences about the regression equation, to document that the observed relationship between *X* and *Y* did not occur by chance. We do this by an **analysis of variance of regression**. In essence, this analysis tests the null hypothesis *H*_{0}: *b* = 0 and is analogous to testing the significance of the correlation between *X* and *Y*. If *H*_{0} is true, the regression line is essentially horizontal, perhaps with some deviation as a result of sampling error. If *H*_{0} is false, *b* is significantly different from zero.^{§}

The variance components in a regression analysis are partitioned similarly to those in a regular analysis of variance. The total variance, represented by the total sum of squares (*SS _{t}*), reflects the variance explained by the regression of

*Y*on

*X*and the unexplained error variance. These variance components are illustrated in Figure 24.8. For a given

*X*we can locate the observed value of

*Y*and the predicted score

*Ŷ*, which lies on the regression line. We can also establish the value for , the mean of all

*Y*scores. Without the regression line, the best we can do to predict

*Y*is the mean of the distribution, . For example, if we knew that the mean height for men was 5 ft 8 in., and we wanted to predict the height of any random man on the street, our best estimate would be 5 ft 8 in. But if a man's height is related to his parents' height, then we can improve this estimate if we also know the height of this man's mother and father. We know more about his height (

*Y*) by knowing his parents' height (

*X*). Therefore, by using the regression line we have improved our prediction by the amount

*Ŷ*– , which is the deviation of the predicted score from the mean. This distance tells us how much better we can predict

*Y*by knowing

*X*.

If we look at Σ(*Ŷ* – ) for all the data points in a distribution, we will be able to determine how much of the total variation in the sample is accounted for by knowing the regression of *Y* on *X*. The sum of the squares of these differences, Σ(*Ŷ* – )^{2}, is called the **regression sum of squares** (*SS _{reg}*), or that part of

*Y*that is explained by

*X*.

The rest of the variance is attributed to the deviation of each observed score from the regression line, (*Y* – *Ŷ*), or the residual. It is that part of *Y* that is not explained by *X*. This value is an indication of how good or poor a fit the regression line is. When the fit is good, the observed scores will fall close to the line, and the residuals will be small. This means that *X* is a good predictor of *Y*. When the fit is poor, *X* and *Y* are not strongly related, and these deviations will be large. The term Σ(*Y* – *Ŷ*)^{2} is called the **residual sum of squares** (*SS _{res}*), or the unexplained variance attributable to the residuals. A linear regression analysis will generate an analysis of variance table that provides these values.

The ANOVA summary table, shown in Table 24.3, represents the regression of systolic blood pressure (*Y*) on age (*X*) for a sample of 10 women (from Figure 24.3). This output follows the format of a standard analysis of variance, with a total of *N* – 1 degrees of freedom. In the linear model, one degree of freedom is always associated with the regression; therefore, *N* – 2 degrees of freedom are attributed to the residuals (the error term). The value of *F* is equal to *MS _{reg}*/

*MS*.

_{res}In this example, the observed *F*-ratio for the regression is 24.74 with 1 and 8 degrees of freedom. As shown in Table 24.3➋, this test is significant at .001. This tells us that the relationship between *X* and *Y* is not likely to be the result of chance. It does not indicate how strong this relationship is. When the analysis of variance of regression results in a nonsignificant *F*-test, the researcher concludes that the observed relationship could have occurred by chance; that is, the regression line does not provide a reasonable basis for predicting values of *Y*.

^{§}The slope of the regression line can also be tested using the *t*-test:, where *S _{x}* is the standard deviation of the

*X*stores. The statistical result of this test will be the same as for the analysis of variance of regression, based on the relationship

*F*=

*t*

^{2}

Regression equations are derived from a set of known scores, and the accuracy of the regression line for prediction for the individuals in the test sample is reflected in the size of the residuals. The ultimate purpose of regression analysis is not, however, to predict scores we already know. The intent is to predict scores for a new sample of observations from the findings on the known data. Therefore, it is important that the reference population for the analysis be clearly specified, because predictions will not be applicable to those who do not meet population criteria. Most importantly, predictions cannot be validly made for values of *X* that go beyond the range of scores that were used to generate the regression line. If we determine a regression line for predicting blood pressure from age based on a sample of women 34 to 73 years, we cannot apply the equation to males or to younger subjects. We cannot know if the shape of the distribution would be altered with the addition of scores at lower age ranges. Therefore, generalization of a regression procedure is inherently limited by the range of scores used to derive the equation.

A second consideration in the interpretation of regression data is the adequacy of a linear fit. Just as with correlation, linear regression procedures are useful only if the distribution of scores demonstrates a linear association between *X* and *Y*. The lack of a significant slope does not necessarily mean that *X* and *Y* are unrelated, but may indicate that the relationship does not follow a straight line. We discuss the application of regression to curvilinear relationships in the next section.

There are obvious limitations inherent in linear regression for describing curvilinear relationships. Because linear regression is the most commonly used regression model, researchers should be wary about interpreting outcomes that demonstrate no relationship between *X* and *Y*. For example, look at the data plotted in Figure 24.9, showing the relationship between psychomotor ability and age for a hypothetical sample of 30 subjects aged 10 to 50 years. Using linear techniques, the correlation coefficient is low (*r* = .32). Based on this information alone, one would assume that *X* and *Y* were not strongly related; however, examination of the scatter plot reveals that the data form a distinctly curved pattern. The measured skill improves until age 30, when a slow decline begins. Therefore, it makes more sense to draw a curve that more accurately reflects the relationship between *X* and *Y*, as shown in Figure 24.10. We can express this curve statistically in the form of a quadratic equation:

Equation 24.2 defines a parabolic curve, that is, a curve with one turn. This curve is also called a **quadratic curve.**^{∗∗} The process of deriving its equation is called **polynomial regression.** Clearly, this fitted curve is more representative of the data points than the linear regression line.

The method of calculating the regression coefficients for this equation goes beyond this text. It is advisable to use a computer to perform these more complex mathematical manipulations; however, the application of this model is similar to that of linear regression. Polynomial regression is also based on the concept of least squares, so that the vertical distance of each point from the curve is minimized. Therefore, the curve can be used for predicting *Y* scores in the same way as a linear regression line.

Researchers often have to decide whether a linear or polynomial regression model best fits their data. This decision is greatly facilitated by examining a scatter plot of the data. The analysis of variance for regression can be applied to determine if the linear or polynomial regression model provides a better fit for a given set of data.^{††}

Table 24.4 shows the analysis of variance for both a linear (A) and a quadratic (B) regression of psychomotor ability on age. The *F*-ratio for the linear regression is not significant (*p* = .087), as we might expect from looking at the data in Figure 24.9. This tells us that the linear model is not adequate for describing this relationship. In the bottom panel of Table 24.4 we see that the quadratic regression is significant (*p* = .000), indicating that the quadratic curve is a good fit for these data. The equation for the curve is

A closer look at the analysis of variance helps us see how differently these two approaches explain the data. Note that the total sum of squares for both analyses is the same; that is, the total variability in the sample is the same, regardless of which type of regression is performed. What is different is the amount of that variance that is explained by each of the regression models. The sum of squares attributable to the regression in the linear analysis is 14.404, whereas for the quadratic regression it is 66.477. This demonstrates how a greater proportion of the total variability is explained by the curve.

^{∗∗}A quadratic curve, with one turn, is considered a *polynomial of the second order*. A linear "curve," or straight line, is a polynomial of the first order. See discussion of trends in Chapter 21.

^{††}It is also possible to transform nonlinear data to achieve a linear fit by transforming one or both variables, often using log values (see Appendix D).

The function of experimental design is to explain the effect of an independent variable on a dependent variable while controlling for the confounding effect of extraneous factors. When extraneous factors are not controlled, the results of measurement cannot be attributed solely to the experimental treatment. Statistically, we speak of controlling the *unexplained variance* in the data, that is, the variance in scores that cannot be explained by the independent variable. All experiments will have some unexplained variance, sometimes because of the varied individual characteristics of the subjects and sometimes because of unknown or random factors that affect responses. When we cannot control these factors by purposefully eliminating them or manipulating them, we use principles of experimental design to decrease the error variance they cause.

In Chapter 9 we described several design strategies that can reduce chance variability in data, such as using homogeneous groups or matching. There are times, however, when design strategies are not capable of sufficient control. Even when random assignment is used, there is no guarantee that potentially confounding characteristics will be equally distributed, especially when dealing with small samples. The issue of concern is the ability to equate groups at the outset, so that observed differences following treatment can be attributed to the treatment and not to other unexplained factors. When the research design cannot provide adequate control, statistical control can be achieved by measuring one or more confounding variables in addition to the dependent variable, and accounting for the variability in the confounding factors in the analysis. This is the conceptual basis for **analysis of covariance (ANCOVA)**.

The ANCOVA is actually a combination of analysis of variance and linear regression. It is used to compare groups on a dependent variable, where there is reason to suspect that groups differ on some relevant characteristic, called a **covariate**, before treatment.

The variability that can be attributed to the covariate is partitioned out, and effectively removed from the analysis of variance, allowing for a more valid explanation of the relationship between the independent and dependent variables.

We can clarify this process with a hypothetical example. Suppose we wanted to compare the effect of two teaching strategies on the clinical performance of students in their first year of clinical training. We hypothesize that training with videotaped cases (Strategy 1) will be more effective than discussion and reading groups (Strategy 2). We randomly assign 12 students to two groups (*n* = 6 per group). We are concerned, however, that the students' academic performance would be a potential confounding factor in making this comparison, based on the assumption that there is a correlation between academic and clinical performance. Therefore, we would want to know if the grade point average (GPA) in the two groups had been evenly distributed. If one group happened to have a higher GPA than the other, our results could be misleading. In this example, teaching strategy is the independent variable, clinical performance is the dependent variable, and GPA is the covariate. By knowing the values of the covariate, we can determine if the groups are different on GPA, and we can use this information to adjust our interpretation of the dependent variable if necessary.

To illustrate how the ANCOVA offers this control, let us first look at a hypothetical comparison between the two teaching groups, without considering GPA. Suppose we obtain the following means for clinical performance on a standardized test (scored 0–100):

The analysis of variance comparing these two groups is shown in Table 24.5A, demonstrating that these two means are not statistically different (*p* = .734).^{‡‡} Based on this result, is it reasonable to conclude that the teaching strategies are not different? Or might we suspect that GPA may be differentially distributed between the two groups, which has biased the results? To answer these questions, we must take a closer look at the data to see how these variables are related.

^{‡‡}An unpaired *t*-test could also have been performed with the same result (*t* = .349, *df* = 10, *p* = .734); however, to adjust scores with a covariate, an analysis of variance must be used. Therefore, we have used the ANOVA here to facilitate comparison of outcomes with the ANCOVA.

Figure 24.11 shows us the distribution of GPA and clinical performance scores for Strategy 1 (•) and Strategy 2(○) with their respective regression lines. The dependent variable, clinical performance score, is plotted along the *Y*-axis, and the covariate, GPA, is plotted along the *X*-axis. We can see from this scatter plot that these variables are highly correlated for both groups (*r* = .93 and .99), and that the slopes of the two regression lines are fairly similar (*b* = 53.7 and 46.5).

We can also see that the regression line for Strategy 1 is higher than that for Strategy 2, indicating that Group 1 had higher values of clinical performance for any given GPA, even though the sample means for clinical score are not significantly different. There is, however, another important difference. If we look at the mean GPA for each group, we can see that the students using Strategy 1 have substantially lower GPAs than those using Strategy 2 (*X̄*_{1} = 2.55, *X̄*_{2} = 3.11). Knowing that GPA is a correlate of clinical performance, it is reasonable to believe that this difference could have confounded the statistical analysis.

To eliminate this effect, we want to *artificially equate* the two groups on GPA, using the mean GPA for the total sample as the best estimate for both groups. The mean GPA for both groups combined is 2.84. If we assign this value as the mean GPA for each group, we can use the regression lines to predict what the mean score for clinical performance (*Y*) would be at that value of *X*. That is, what average clinical score would we expect for Strategy 1 and Strategy 2 if the groups were equivalent on GPA? As shown in Figure 24.12, we would expect = 62.0 and = 30.4. These are the **adjusted means** for each group.

Note that the adjusted mean for Strategy 1 (62.0) is higher than the observed mean for Strategy 1 (48.5), and the adjusted mean for Strategy 2 (30.4) is lower than the observed mean for Strategy 2 (43.8). These differences reflect variation in the covariate; that is, on average Strategy 2 students had a higher GPA than Strategy 1 students. By setting a common mean GPA, we moved the average GPA up for Strategy 1 (2.55 to 2.84), increasing the corresponding clinical score; and we moved the average GPA down for Strategy 2 (3.11 to 2.84), decreasing the corresponding clinical score. Therefore, we have adjusted scores by removing the effect of GPA differences so we could compare clinical scores as if both groups had the same GPA.

This example illustrates the situation where a covariate obscures the true nature of the difference between group means. This process may also work in the opposite direction, however; that is, group means may initially appear significantly different when in fact they are not. In that case, the analysis of covariance may result in no significant difference. For example, consider a comparison of strength between men and women. We would expect to see a difference between them, with men being stronger. But this difference could be due to the weight of men versus women, rather than just gender. If we were to use weight as a covariate, we might find that the groups no longer appear different in strength.

After scores are adjusted according to the regression lines, an analysis of variance is run on the adjusted values. Table 24.5B shows the results of this analysis for the teaching strategy data from Figure 24.12. Recall that the original analysis of variance showed no significant difference between these strategies (see Table 24.5A).

In the summary table for the ANCOVA, the first line of the table represents the variance attributable to the covariate, or the regression of GPA on clinical score (see Table 24.5➋). This component tests the hypothesis that the slope of the regression line is significantly different from zero. If it is not significant, the covariate is not linearly related to the dependent variable, and therefore, the adjusted mean scores will be meaningless. In this example, we can see that the covariate of GPA is significant (*p* = .000). The researcher should always examine the covariate effect first, to determine that the ANCOVA is an appropriate test. The degrees of freedom associated with this factor equal the number of covariates used in the analysis. In this case, with one covariate, we have used one degree of freedom.

The between-groups effect for Strategy is based on a comparison of the adjusted group means (see Table 24.5➎). As in a standard analysis of variance, the degrees of freedom will equal *k* – 1. Now we find that the difference between the strategy groups is significant (*p* = .000), and we can reject the null hypothesis (see Table 24.5➌). We conclude that clinical performance does differ between those exposed to videotaped cases and discussion groups when adjusted for their grade point average. We have, therefore, increased the sensitivity of our test by decreasing the unexplained variance. We have accounted for more of the variance in clinical performance by knowing GPA and teaching strategy than we did by knowing teaching strategy alone.

The third line of the table shows the error variance (see Table 24.5➍), that is, all the variance that is left unexplained after the between-groups and covariate sources have been accounted for. When the covariate is a good linear fit, the error variance will be substantially reduced. This is evident if we compare the error sums of squares in Tables 24.5A and B for the ANOVA and ANCOVA of the same data. In fact, if we look at the error (within groups) sum of squares for the ANOVA (*SS*_{e} = 5360.33), we can see that it is equal to the combined sums of squares for the covariate and the error component in the ANCOVA (4977.26 + 383.07 = 5360.33). By removing the effect of GPA from the unexplained variance, we have left less variance unexplained. Therefore, the ANCOVA allows us to demonstrate a statistical difference between the groups, where the ANOVA did not.

Before running an ANCOVA, several assumptions should be satisfied to assure validity of the analysis.

**Linearity of the Covariate.** The analysis of covariance model is appropriate only if there is a linear relationship between the covariate and the dependent variable. It is most effective when *r* > .60.^{6} For example, it would be unreasonable to use height or weight as a covariate for clinical performance. The researcher should check correlations before starting a study, to be sure that data are being collected on a useful covariate. Relationships that are curvilinear will invalidate the analysis of covariance, although the relationship may be made linear by mathematical transformation.

**Homogeneity of Slopes.** The ANCOVA requires that the slopes of the regression lines for each group be parallel. Unequal slopes indicate that the relationship between the covariate and dependent variable is different for each group. Therefore, the adjusted means will be based on different proportional relationships, and their comparison will be meaningless. A test for **homogeneity of slopes** should be done before the ANCOVA is attempted, to be sure that the procedure is valid.^{6} The null hypothesis for this test states that the regression coefficients (slopes) for the two groups will not be significantly different: *H*_{0}: β_{1} = β_{2}. If GPA is a "good" covariate, then it will allow adjustments based on proportional values that are the same in both strategy groups.

**Independence of the Covariate.** The variable chosen as the covariate must be related to the dependent variable, but must also be independent of the treatment effect; that is, the independent variable cannot influence the value of the covariate. For example, suppose we wanted to study the effect of a general exercise program on balance, using lower extremity strength as a covariate. If we were to measure the subjects' strength after the treatment was completed, we might find that the exercise program increased the strength of the lower extremities. Therefore, the strength value would not be independent of the treatment effect and would not be a valid covariate. To avoid this situation, covariates should always be measured prior to initiation of treatment.

**Reliability of the Covariate.** The validity of the ANCOVA is also founded on the assumption that the covariate is not contaminated by measurement error.^{6} Any error found in the covariate is compounded when the regression coefficients and adjusted means are calculated. Therefore, justification for using the adjusted scores is based on accuracy of the covariate. Although it may be impossible to obtain totally error-free measurement, every effort should be made to ensure the greatest degree of reliability possible.

The analysis of covariance can be extended to accommodate any number of covariates. There may be several characteristics that are relevant to understanding the dependent variable. For example, if we wanted to compare strength at different age ranges, we might use a combination of height, weight, limb girth, or percentage body fat as covariates. With multiple covariates, the analysis of covariance involves multiple regression procedures, where several *X* variables are correlated with one *Y* variable, and a predicted value for *Y* is determined, based on those covariates that are most highly correlated. Multiple regression techniques are discussed further in Chapter 29.

When several covariates are used, the precision of the analysis can be greatly enhanced, as long as the covariates are all highly correlated with the dependent variable and not correlated with each other. If, however, the covariates are correlated with each other, they provide redundant information and no additional benefit is gained by including them. In fact, using a large number of interrelated covariates can be a disadvantage, because each covariate uses up one degree of freedom in the analysis. This decreases the degrees of freedom left for the error term, which increases the *F* needed for significance between groups. The analysis then loses statistical power. With smaller samples, this could have a biasing effect.

It is important, therefore, to make educated choices about the use of covariates. Previous research and pilot studies may be able to document which variables are most highly correlated with the dependent variable and which are least likely to be related to each other.

The ANCOVA is often used to control for initial differences between groups based on a pretest measure. When intact groups are tested or when randomization is used with small groups, the initial measurements on the dependent variable are often different enough to be of concern for further comparison. For example, suppose we were studying the effect of two exercise programs on strength. We randomly assign subjects to two groups and would like to assume that their initial strength levels are similar; however, after the pretest we find that one group is much stronger on average than the other, a difference that occurred just by chance. We can use the ANCOVA to equate both groups on their pretest scores and adjust posttest scores accordingly. The analysis between groups is then done using the adjusted posttest scores, as if both groups had started out at the same level of strength.

Researchers are often tempted to control for initial differences by using difference scores as the dependent variable in a pretest-posttest design. There are disadvantages to this approach, however, because the potential for measurement error is increased when using difference scores (see Chapter 6). In experimental studies, this situation can reduce the power of a statistical test; that is, the greater the amount of measurement error, the less likely we will find a significant difference between two difference scores, even when the treatment was really effective. Therefore, many researchers prefer the analysis of covariance for statistically controlling initial differences. This approach is not, however, a remedy for a study with poor reliability. Although some research questions may be more readily answered by the use of change scores, the researcher should consider what type of data will best serve the analysis.

The analysis of covariance is a powerful statistical tool that has often been looked on as a cure-all for design imperfections. Although it does have the power to increase the sensitivity of a test by removing many forms of bias, it does not provide a safeguard against problems in the design of a study. The ANCOVA cannot substitute for randomization. Quasi-experimental designs that use intact groups suffer from many interpretive biases, some of which the ANCOVA is able to control better than others. Indeed, unless a covariate is totally reliable, it will introduce some biases of its own. Some researchers have used the ANCOVA to compensate for failures in their design, such as the discovery of uncontrolled variables after data collection has been started, but this is not its intent. The analysis of covariance is correctly used in situations where experimental control of relevant variables is not possible and where these factors are identified and measured at the outset.

The ANCOVA has some limitations that should be considered in this context. One major criticism is that the adjusted means are not real scores, and therefore, the generalization of data from an analysis of covariance is compromised. It is also important to realize that one covariate may be insufficient for removing extraneous effects and that the outcome of an ANCOVA could be significantly altered if different combinations of covariates were used. In addition, researchers must decide which covariates will be most meaningful, and decide early so that data are collected on the proper variables. Covariates that are quantitative variables, such as height, weight and age, provide the most precision for adjusting scores; however, dichotomous variables such as sex and disability can be used as covariates.

Two issues related to generalization of regression analysis should be mentioned here. First, just as with correlation, it is important to refrain from interpreting predictive relationships as causal. Statistical associations by themselves do not provide sufficient evidence of causality. The researcher must be able to establish the methodological, logical and theoretical rationales behind such claims; that is, causal inference is a function of how the data were produced, not how they were analyzed.^{7} Second, it is important to restrict generalization of predictive relationships to the population on which the data were obtained. The characteristics of subjects chosen for a regression study define this population.

Simple linear regression analysis is limited in that it accounts for the effect of only one independent variable on one dependent variable. Most behavioral phenomena cannot be explained so simply. For instance, when we examined the predictive accuracy of the regression of blood pressure on age, we established that *r ^{2} = .76*. This indicates that 76% of the variance in blood pressure could be predicted by knowing a woman's age; however, 24% of the variance was unaccounted for. Some other variable or variables must be identified to improve the prediction equation. Multiple regression procedures have been developed that provide an efficient mechanism for studying the combined effect of several independent variables on a dependent variable for purposes of improving predictive accuracy. We present these techniques in Chapter 29.

*Plast Reconstr Surg*2005;116:791–797. [PubMed: 16141817]

*J Child Psychol Psychiatry*2003;44:520–528. [PubMed: 12751844]

*Am J Phys Med Rehabil*2005;84:604–612. [PubMed: 16034230]

*Spine*2005;30:235–240. [PubMed: 15644763]

*Statistical Methods*(8th ed.). Ames, IA: Iowa State University Press, 1991.

*Using SPSS for Windows and Macintosh: Analyzing and Understanding Data*(4th ed.). Upper Saddle River, NJ: Prentice Hall, 2004.

*Applied Multiple Regression/Correlation Analysis for the Behavioral Sciences.*Hillsdale, NJ: Lawrence Erlbaum Associates, 1975.