Multiple regression is an extension of simple linear regression analysis, described in Chapter 24. The multiple regression equation allows the researcher to predict the value Ŷ using a set of several independent variables. It can accommodate continuous and categorical independent variables, which may be naturally occurring or experimentally manipulated. The dependent variable, Y, must be a continuous measure. A common purpose of regression analysis is prognostic, predicting a given outcome based on identified factors. For instance, Stineman and Williams1 developed a model to predict rehabilitation length of stay based on the patient's admitting diagnosis, referral source and admission functional status. A second purpose of regression is to better understand a clinical phenomenon by identifying those factors associated with it. To illustrate this application, Walker and Sofaer2 studied sources of psychological distress in patients attending pain clinics. They identified that 60% of the variance associated with psychological distress was explained by a combination of fears about the future, regrets about the past, age, practical help, feeling unoccupied and personal relationship problems. This type of analysis will often present opportunities for the analysis of theoretical components of constructs.
Recall that the regression equation, Ŷ = a + bX, defines a line that can be used to make predictions, with an inherent degree of random error. This error, or residual variance, represents variance in Y that is not explained by the predictor variable, X. For example, suppose we were interested in predicting cholesterol level using body weight as the independent variable, with r = .48 and r2 = .23. Based on the limited strength of this relationship, we would expect that a regression equation would provide estimates of cholesterol that would be different from actual values, as body weight by itself does not adequately explain cholesterol level. Therefore, the remaining unexplained variance in cholesterol (77%) must be a function of other factors. For instance, cholesterol may also be related to variables such as blood pressure, gender, age, weight or diet. If we were to add these variables to the regression equation, the unexplained portion of variance would probably be decreased (although not necessarily completely). This expanded analysis results in a multiple regression equation.
In multiple regression, the regression equation accommodates multiple predictor variables:
where Ŷ is the predicted value for the dependent variable, a is a regression constant, and b1, b2, b3 through bk are regression coefficients for each independent variable. The subscript, k, denotes the number of independent variables in the equation.† Like simple linear regression, multiple regression is also based on the concept of least squares, so that the model minimizes deviations of Ŷ from Y.
Once regression coefficients and a constant are obtained, we can predict values of Y by substituting values for each independent variable in the equation. For instance, suppose we wanted to evaluate the predictive relationship between serum blood cholesterol (CHOL) and potential contributing factors including age (AGE), daily dietary fat intake in grams (DIET), gender (GENDER), systolic blood pressure (SBP), and weight (WT). Table 29.1A shows the intercorrelations among these variables. The coefficients for the regression equation are shown in Table 29.1B➍, including the constant:
Ŷ = 19.116 + .012(AGE) + 3.094 (DIET) + .218 (SBP) + 4.158 (GENDER) + .511 (WT)
Based on this equation, for a 34-year-old subject, with DIET = 20.0 g, GENDER = 1 (coded for male), SBP = 100 mmHg and WT = 150 pounds, we can predict cholesterol value as follows:
CHOL = 19.116 + .012(34) + 3.094(20.0) + .218(100) + 4.158(1) + .511(150) = 184.01
TABLE 29.1OUTPUT FOR MULTIPLE REGRESSION ANALYSIS: PREDICTION OF CHOLESTEROL LEVEL FROM AGE, DIET, BLOOD PRESSURE, GENDER AND WEIGHT (N = 100) ||Download (.pdf) TABLE 29.1 OUTPUT FOR MULTIPLE REGRESSION ANALYSIS: PREDICTION OF CHOLESTEROL LEVEL FROM AGE, DIET, BLOOD PRESSURE, GENDER AND WEIGHT (N = 100)
If this person's true cholesterol level was 175, the residual would be 175 − 184.01 = −9.01 (Y − Ŷ). Scatter plots can also be requested to analyze the residuals, typically plotting the predicted values on the X-axis against the residuals on the Y-axis. Visual analysis of residuals can reveal if the assumption of linearity in the data is violated (see Chapter 24, Figure 24.6).
Regression coefficients are interpreted as weights that identify how much each variable contributes to the explanation of Y. As part of the regression analysis, a test of significance is performed on each regression coefficient, to test the null hypothesis, H0: b = 0. Depending on the statistical package this will be done using either an F-test or a t-test, as shown in Table 29.1B➏. In this example, the coefficients for AGE, GENDER and SBP are not significant (p > .05). Therefore, these three variables are not making a significant contribution to the prediction of cholesterol level.
Standardized Regression Coefficients
Researchers often want to establish the relative importance of specific variables within a regression equation. The regression coefficients cannot be directly compared for this purpose because they are based on different units of measurement. When it is of interest to determine which variables are more heavily weighted, we must convert the weights to standardized regression coefficients, called beta weights. These standardized values are interpreted as relative weights, indicating how much each variable contributes to the value of Ŷ. For example, the beta weights listed in Table 29.1B➎ show that DIET and WT are the most important variables for predicting cholesterol. The sign of the beta weight indicates the positive or negative relationship between each variable and Y, but only the absolute value is considered in determining the relative weight. Some authors present beta weights in addition to regression coefficients in a research report, to provide the reader with a full and practical interpretation of the observed relationships.
A problem occurs in the interpretation of beta weights if the independent variables in the regression equation are correlated with each other. This situation is called multicolinearity. The coefficients assigned to variables within the equation are based on the assumption that each variable provides independent information, contributing a unique part of the total explanation of the variance in Y. If independent variables are related to each other, the information they provide to the model is partially redundant. In that case, one variable may be seen as contributing a lot of information, and the second variable may be seen as contributing little; that is, one variable may have a larger beta weight. Each variable may be highly predictive of Y when used alone, but they are redundant when used together. This situation can be avoided by determining the intercorrelations among predictor variables prior to running a regression analysis and selecting independent variables that are not highly correlated with each other.
The interpretation of multicolinearity is based on the concept of partial correlation; that is, each regression coefficient represents the importance of a single variable after having accounted for the effect of all other variables in the equation. Therefore, the value of a regression coefficient is dependent on which other independent variables are in the equation. With different combinations of variables, it is likely that a particular regression coefficient will vary. It is important to remember, therefore, that the relationships defined by a regression equation can be interpreted only within the context of the specific variables included in that equation.
The overall association between Y and the complete set of independent variables is defined by the multiple correlation coefficient, R. This value will range from 0.00 to 1.00; however, because R represents the cumulative association of many variables, its interpretation is obscure. Therefore, its square (R2) is used more often as an explanation of the functional relationship between Y and a series of X values.
As an analogue of r2, the value of R2 represents the proportion of the total variance in Y that is explained by the set of independent variables in the equation; that is, it is the variance attributable to the regression. R2 is the statistic most often reported in journal articles to indicate the accuracy of prediction of a regression analysis. Higher values of R2 reflect stronger prediction models. The complement, 1 − R2, is the proportion of the variance that is left unexplained, or the variance attributable to deviations from the regression. Table 29.1B➋ shows that R2 = .534 for the cholesterol analysis, indicating that this group of variables accounts for slightly more than half of the variance in cholesterol.
An adjusted R2 is also generated for the regression (Table 29.1B➌). This value represents a chance-corrected value for R2; that is, we can expect some percent of explained variance to be a function of chance. Some researchers prefer to report the adjusted value as a more accurate reflection of the strength of the regression, especially with a large number of variables in the equation.
Many regression programs will also generate a value for the standard error of the estimate (SEE), as shown in Table 29.1B. This value represents the degree of variability in the data around the multidimensional "regression line," reflecting the prediction accuracy of the equation (see Chapter 24 for discussion of the SEE).
Analysis of Variance of Regression
A multiple regression analysis generates an analysis of variance to test the linear fit of the equation. The ANOVA partitions the total variance in the data into the variance that is explained by the regression and that part that is left unexplained, or the residual error. The degrees of freedom associated with the regression will equal k, where k represents the number of independent variables in the equation. The probability of F associated with the regression will indicate if the equation provides an explanation of Y that is better than chance. The ANOVA in Table 29.1B demonstrates a significant model for the cholesterol data (F = 21.512, p < .001).
Stepwise Multiple Regression
Multiple regression can be run by "forcing" a set of variables into the equation, as we have done in the cholesterol example. With all five variables included, the equation accounted for 53% of the variance in cholesterol values, although the results demonstrated that the four independent variables did not all make significant contributions to that estimate. We might ask, then, if the level of prediction accuracy achieved in this analysis could have been achieved with fewer variables. To answer this question, we can use a procedure called stepwise multiple regression, which uses specific statistical criteria to retain or eliminate variables to maximize prediction accuracy with the smallest number of predictors. It is not unusual to find that only a few independent variables will explain almost as much of the variation in the dependent variable as can be explained by a larger number of variables. This approach is useful for honing in on those variables that make the most valuable contribution to a given relationship, thereby creating an economical model.
Stepwise regression is accomplished in "steps" by evaluating the contribution of each independent variable in sequential fashion.‡ First, all proposed independent variables are correlated with the dependent variable, and the one variable with the highest correlation is entered into the equation at step 1. For our cholesterol example, Table 29.1A shows us that DIET has the highest correlation with CHOL (r = .634). Therefore, DIET will be entered on the first step. With this variable alone, R2 = .401 (see Table 29.2➋). The regression coefficients for this first step are shown in Table 29.2➌:
TABLE 29.2OUTPUT FOR STEPWISE MULTIPLE REGRESSION ANALYSIS: PREDICTION OF CHOLESTEROL ||Download (.pdf) TABLE 29.2 OUTPUT FOR STEPWISE MULTIPLE REGRESSION ANALYSIS: PREDICTION OF CHOLESTEROL
At this point, the remaining variables (those "excluded" from the equation) are examined for their partial correlation with Y, that is, their correlation with CHOL with the effect of DIET removed (see Table 29.2➍). The variable with the highest significant partial correlation coefficient is then added to the equation, in this case, WT (partial r = .462, p = .000). Therefore, WT is added in step 2 (see Table 29.2➎). With the addition of this variable, we have achieved an R2 of .529 (see Table 29.2➏), only slightly lower than the value obtained with the full model. The adjusted R2 is higher, however, because there are fewer variables in this model.
Another criterion for entry of a variable is its tolerance level. Tolerance refers to the degree of colinearity in the data. Tolerance ranges from 0.00, indicating that the variable is perfectly correlated with the variables already entered, to 1.00, which means that the other variables are not related (see Table 29.2➐). The higher the tolerance, the more new information a variable will contribute to the equation. Some computer programs will automatically generate tolerance levels for each variable. Others offer options that must be specifically requested to include tolerance values (colinearity statistics) in the printout.
The stepwise regression continues, adding a new variable at each successive step of the analysis if it meets certain inclusion criteria; that is, its partial correlation is highest of all remaining variables, and the test of its regression coefficient is significant. This process continues until, at some point, either all variables have been entered or the addition of more variables will not significantly improve the prediction accuracy of the model. In the current example, Table 29.2➑ shows us that none of the partial correlations of the remaining three variables is significant. Therefore, no further variables were entered after step 2. As shown in Table 29.2➒, the final model for the stepwise regression is
Ŷ = 48.21 + 3.12(DIET) + .508(WT)
Note that the coefficients in the equation have changed with the addition of WT as a variable. There are times when no variables will be entered if none of them satisfy the minimal inclusion criteria. In that case, the researcher must search for a new set of independent variables to explain the dependent variable.
One of the general assumptions for regression analysis is that variables are continuous; however, many of the variables that may be useful predictors for a regression analysis, such as gender, occupation, education and race, or behavioral characteristics such as smoker versus nonsmoker, are measured on a categorical scale. It is possible to include such qualitative variables in a regression equation, although the numbers assigned to categories cannot be treated as quantitative scores. One way to do this is to create a set of coded variables called dummy variables.
In statistics, coding is the process of assigning numerals to represent categorical or group membership. For regression analysis we use 0 and 1 to code for the absence and presence of a dichotomous variable, respectively. All dummy variables are dichotomous. For example, with a variable such as smoker-nonsmoker, we code 0 = nonsmoker and 1 = smoker. For sex, we can code male = 0 and female = 1. In essence we are coding 1 for female and 0 for anyone who is not female. We can use these codes as scores in a regression equation and treat them as interval data.
For instance, we could include gender as a predictor of cholesterol level, to determine if men or women can be expected to have higher cholesterol levels. Assume the following regression equation was obtained:
Using the dummy code for females, Ŷ = 220 − 27.5(1) = 194.5, and for males Ŷ = 220 − 27.5(0) = 220. With only this one dummy variable, these predicted values are actually the means for cholesterol for females and males. The regression coefficient for X is the difference between the means for the groups coded 0 and 1.
When a qualitative variable has more than two categories, more than one dummy variable is required to represent it. For example, consider the variable of college class, with four levels: freshman, sophomore, junior and senior. We could code these categories with the numbers 1 through 4 on an apparent ordinal scale; however, these numerical values would not make sense in a regression equation, because the numbers have no quantitative meaning. A senior is not four times more of something than a freshman. Therefore, we must create a dichotomous dummy variable for each category, as follows:
Each variable codes for the presence or absence of a specific class membership. We do not need to create a fourth variable for seniors, because anyone who has zero for all three variables will be a senior. We can show how this works by defining each class with a unique combination of values for X1, X2 and X3:
The number of dummy variables needed to define a categorical variable will always be one less than the number of categories.
Suppose we wanted to predict a student's attitude toward the disabled, on a scale of 0 to 100, based on class membership. We might develop an equation such as
Ŷ = 85 − 55X1 − 25X2 − 15X3
Therefore, the predicted values for each class would be
Several dummy variables can be combined with quantitative variables in a regression equation. Because so many variables of interest are measured at the nominal level, the use of dummy variables provides an important mechanism for creating a fuller explanation of clinical phenomena. Some computer programs will automatically generate dummy codes for nominal variables. For others, the researcher must develop the coding scheme.