The technological progress of data management systems has provided clinical researchers with a sophisticated statistical framework within which to examine the multifaceted and complex relationships inherent in many clinical phenomena. **Multivariate analysis** refers to a set of statistical procedures that are distinguished by the ability to examine several response variables within a single study and to account for their potential interrelationships in the analysis of the data. These tests are distinguished from **univariate analysis** procedures, such as the *t*-test and analysis of variance, in that univariate methods accommodate only one dependent variable.

Given the types of questions being asked today and the types of data being used to examine clinical procedures, multivariate statistics have become quite important for those who do research and those who read research reports. The purpose of this chapter is to introduce the basic concepts behind several of the most commonly used multivariate methods: partial correlation, multiple regression, logistic regression, discriminant analysis, factor analysis, multivariate analysis of variance and survival analysis.

The application of multivariate procedures necessitates the use of a computer, and may require the assistance of a statistician for more advanced operations. In a short introduction such as this, it is not possible to cover the full scope of these procedures. Therefore, this discussion focuses on a conceptual understanding of multivariate tests and interpretation of the output a computer analysis will generate.

The product-moment correlation coefficient, *r*, offers the researcher a simple and easily understood measure of the association between two variables, *X* and *Y.* The interpretation of *r* is limited, however, because it cannot account for the possible influence of other variables on that relationship. For instance, in a study of the relationship between age and length of hospital stay, we might find a correlation of .70, suggesting that older patients tend to have longer hospital stays (as shown by the shaded overlapped portion in Figure 29.1A). If, however, older patients also tend to have greater functional limitations, then the observed relationship between hospital stay and age may actually be the result of their mutual relationship with function; that is, the hospital stay may actually be explained by the patient's functional status. We can resolve this dilemma by looking at the relationship between hospital stay and age with the effect of functional status controlled, using a procedure called partial correlation.

###### FIGURE 29.1

Representation of partial correlation between hospital stay (*Y*) and age (*X*), with the effect of function (*Z*) removed. In (**A**), (**B**) and (**C**), the simple correlations between each pair of variables are illustrated. In (**D**) the shaded area represents those parts of hospital stay and age that are explained by function. The black area shows the common variance in hospital stay and age that is not related to function, or their partial correlation.

The **partial correlation coefficient** is the correlation between two variables, *X* and *Y*, with the effect of a third variable, *Z*, statistically removed. For instance, in the preceding example, assume *X* is age, *Y* is hospital stay, and *Z* is functional status. We would want to know how much of the observed relationship between age and hospital stay (*r*_{XY}) can be attributed to the confounding influence of function, and how much is purely the relationship between age and hospital stay. The term *r _{XY·Z}* is used to represent the correlation of

*X*and

*Y*, with the effect of

*Z*eliminated.

For example, suppose we are given the following correlations for a sample of 50 patients:

We "remove" the effect of function from *r*_{XY} by first determining how much of the variance in both hospital stay and age is explained by function, as shown in Figure 29.1B and C. The overlapped, shaded portions represent the correlation between the two variables. Figure 29.1D shows how these relationships intersect. Once we remove the effect of function, the remaining overlap between hospital stay and age is reduced (the black area in Figure 29.1D). This area represents the relationship between hospital stay and age with the effect of function canceled out. This is the partial correlation.^{∗}

For the data in our example, *r _{XY·Z}* = .34. When we compare this partial correlation to the original correlation of

*X*and

*Y*(

*r*= .70), we can see that age and hospital stay no longer demonstrate as strong a relationship. A large part of the observed association between them could be accounted for by their common relationship with functional status.

_{XY}The term *r _{XY·Z}* is called a

**first-order partial correlation**, because it represents a correlation with the effect of one variable eliminated. The simple correlation between

*X*and

*Y*is called a

**zero-order correlation**. The significance of a first-order partial correlation can be determined by referring to critical values of

*r*in Appendix Table A.4, using

*n*– 3 degrees of freedom. Partial correlation can be expanded to control for more than one variable at a time. A

*second-order partial correlation*is symbolized by This value can be checked for significance using Table A.4 with

*n*− 4 degrees of freedom. This process can continue with higher order partial correlations.

Partial correlation is a useful analytic tool for eliminating competing explanations for an association, thereby providing a clearer explanation of the true nature of an observed relationship and ruling out extraneous factors.

^{∗}The partial correlation coefficient is calculated using the formula

**Multiple regression** is an extension of simple linear regression analysis, described in Chapter 24. The multiple regression equation allows the researcher to predict the value Ŷ using a set of several independent variables. It can accommodate continuous and categorical independent variables, which may be naturally occurring or experimentally manipulated. The dependent variable, *Y*, must be a continuous measure. A common purpose of regression analysis is prognostic, predicting a given outcome based on identified factors. For instance, Stineman and Williams^{1} developed a model to predict rehabilitation length of stay based on the patient's admitting diagnosis, referral source and admission functional status. A second purpose of regression is to better understand a clinical phenomenon by identifying those factors associated with it. To illustrate this application, Walker and Sofaer^{2} studied sources of psychological distress in patients attending pain clinics. They identified that 60% of the variance associated with psychological distress was explained by a combination of fears about the future, regrets about the past, age, practical help, feeling unoccupied and personal relationship problems. This type of analysis will often present opportunities for the analysis of theoretical components of constructs.

Recall that the regression equation, *Ŷ* = *a* + *bX*, defines a line that can be used to make predictions, with an inherent degree of random error. This error, or **residual variance**, represents variance in *Y* that is not explained by the predictor variable, *X.* For example, suppose we were interested in predicting cholesterol level using body weight as the independent variable, with *r* = .48 and *r*^{2} = .23. Based on the limited strength of this relationship, we would expect that a regression equation would provide estimates of cholesterol that would be different from actual values, as body weight by itself does not adequately explain cholesterol level. Therefore, the remaining unexplained variance in cholesterol (77%) must be a function of other factors. For instance, cholesterol may also be related to variables such as blood pressure, gender, age, weight or diet. If we were to add these variables to the regression equation, the unexplained portion of variance would probably be decreased (although not necessarily completely). This expanded analysis results in a **multiple regression equation**.

In multiple regression, the regression equation accommodates multiple predictor variables:

where *Ŷ* is the predicted value for the dependent variable, *a* is a **regression constant**, and *b*_{1}, *b*_{2}, *b*_{3} through *b*_{k} are **regression coefficients** for each independent variable. The subscript, *k*, denotes the number of independent variables in the equation.^{†} Like simple linear regression, multiple regression is also based on the concept of least squares, so that the model minimizes deviations of *Ŷ* from *Y*.

Once regression coefficients and a constant are obtained, we can predict values of *Y* by substituting values for each independent variable in the equation. For instance, suppose we wanted to evaluate the predictive relationship between serum blood cholesterol (CHOL) and potential contributing factors including age (AGE), daily dietary fat intake in grams (DIET), gender (GENDER), systolic blood pressure (SBP), and weight (WT). Table 29.1A shows the intercorrelations among these variables. The coefficients for the regression equation are shown in Table 29.1B➍, including the constant:

*Ŷ*= 19.116 + .012(AGE) + 3.094 (DIET) + .218 (SBP) + 4.158 (GENDER) + .511 (WT)

Based on this equation, for a 34-year-old subject, with DIET = 20.0 g, GENDER = 1 (coded for male), SBP = 100 mmHg and WT = 150 pounds, we can predict cholesterol value as follows:

If this person's true cholesterol level was 175, the residual would be 175 − 184.01 = −9.01 (*Y* − *Ŷ*). Scatter plots can also be requested to analyze the residuals, typically plotting the predicted values on the *X*-axis against the residuals on the *Y*-axis. Visual analysis of residuals can reveal if the assumption of linearity in the data is violated (see Chapter 24, Figure 24.6).

Regression coefficients are interpreted as *weights* that identify how much each variable contributes to the explanation of *Y.* As part of the regression analysis, a test of significance is performed on each regression coefficient, to test the null hypothesis, H_{0}: *b* = 0. Depending on the statistical package this will be done using either an *F*-test or a *t*-test, as shown in Table 29.1B➏. In this example, the coefficients for AGE, GENDER and SBP are not significant (*p* > .05). Therefore, these three variables are not making a significant contribution to the prediction of cholesterol level.

^{†}The number of independent variables included in the regression equation is effectively limited by the sample size. Power analysis can be done to estimate the number of subjects that would be needed to identify a significant effect, based on the number of independent variables in the equation. See Appendix C.

Researchers often want to establish the relative importance of specific variables within a regression equation. The regression coefficients cannot be directly compared for this purpose because they are based on different units of measurement. When it is of interest to determine which variables are more heavily weighted, we must convert the weights to standardized regression coefficients, called **beta weights**. These standardized values are interpreted as relative weights, indicating how much each variable contributes to the value of *Ŷ*. For example, the beta weights listed in Table 29.1B➎ show that DIET and WT are the most important variables for predicting cholesterol. The sign of the beta weight indicates the positive or negative relationship between each variable and *Y*, but only the absolute value is considered in determining the relative weight. Some authors present beta weights in addition to regression coefficients in a research report, to provide the reader with a full and practical interpretation of the observed relationships.

A problem occurs in the interpretation of beta weights if the independent variables in the regression equation are correlated with each other. This situation is called **multicolinearity**. The coefficients assigned to variables within the equation are based on the assumption that each variable provides independent information, contributing a unique part of the total explanation of the variance in *Y.* If independent variables are related to each other, the information they provide to the model is partially redundant. In that case, one variable may be seen as contributing a lot of information, and the second variable may be seen as contributing little; that is, one variable may have a larger beta weight. Each variable may be highly predictive of *Y* when used alone, but they are redundant when used together. This situation can be avoided by determining the intercorrelations among predictor variables prior to running a regression analysis and selecting independent variables that are not highly correlated with each other.

The interpretation of multicolinearity is based on the concept of partial correlation; that is, each regression coefficient represents the importance of a single variable after having accounted for the effect of all other variables in the equation. Therefore, the value of a regression coefficient is dependent on which other independent variables are in the equation. With different combinations of variables, it is likely that a particular regression coefficient will vary. It is important to remember, therefore, that the relationships defined by a regression equation can be interpreted only within the context of the specific variables included in that equation.

The overall association between *Y* and the complete set of independent variables is defined by the **multiple correlation coefficient, R.** This value will range from 0.00 to 1.00; however, because

*R*represents the cumulative association of many variables, its interpretation is obscure. Therefore, its square (

*R*

^{2}) is used more often as an explanation of the functional relationship between

*Y*and a series of

*X*values.

As an analogue of *r*^{2}, the value of *R*^{2} represents the proportion of the total variance in *Y* that is explained by the set of independent variables in the equation; that is, it is the *variance attributable to the regression. R*^{2} is the statistic most often reported in journal articles to indicate the accuracy of prediction of a regression analysis. Higher values of *R*^{2} reflect stronger prediction models. The complement, 1 − *R*^{2}, is the proportion of the variance that is left unexplained, or the variance attributable to deviations from the regression. Table 29.1B➋ shows that *R*^{2} = .534 for the cholesterol analysis, indicating that this group of variables accounts for slightly more than half of the variance in cholesterol.

An *adjusted R*^{2} is also generated for the regression (Table 29.1B➌). This value represents a chance-corrected value for *R*^{2}; that is, we can expect some percent of explained variance to be a function of chance. Some researchers prefer to report the adjusted value as a more accurate reflection of the strength of the regression, especially with a large number of variables in the equation.

Many regression programs will also generate a value for the **standard error of the estimate (SEE)**, as shown in Table 29.1B. This value represents the degree of variability in the data around the multidimensional "regression line," reflecting the prediction accuracy of the equation (see Chapter 24 for discussion of the SEE).

A multiple regression analysis generates an analysis of variance to test the linear fit of the equation. The ANOVA partitions the total variance in the data into the variance that is explained by the regression and that part that is left unexplained, or the residual error. The degrees of freedom associated with the regression will equal *k*, where *k* represents the number of independent variables in the equation. The probability of *F* associated with the regression will indicate if the equation provides an explanation of *Y* that is better than chance. The ANOVA in Table 29.1B demonstrates a significant model for the cholesterol data (*F* = 21.512, *p* < .001).

Multiple regression can be run by "forcing" a set of variables into the equation, as we have done in the cholesterol example. With all five variables included, the equation accounted for 53% of the variance in cholesterol values, although the results demonstrated that the four independent variables did not all make significant contributions to that estimate. We might ask, then, if the level of prediction accuracy achieved in this analysis could have been achieved with fewer variables. To answer this question, we can use a procedure called **stepwise multiple regression**, which uses specific statistical criteria to retain or eliminate variables to maximize prediction accuracy with the smallest number of predictors. It is not unusual to find that only a few independent variables will explain almost as much of the variation in the dependent variable as can be explained by a larger number of variables. This approach is useful for honing in on those variables that make the most valuable contribution to a given relationship, thereby creating an economical model.

Stepwise regression is accomplished in "steps" by evaluating the contribution of each independent variable in sequential fashion.^{‡} First, all proposed independent variables are correlated with the dependent variable, and the one variable with the highest correlation is entered into the equation at step 1. For our cholesterol example, Table 29.1A shows us that DIET has the highest correlation with CHOL (*r* = .634). Therefore, DIET will be entered on the first step. With this variable alone, *R*^{2} = .401 (see Table 29.2➋). The regression coefficients for this first step are shown in Table 29.2➌:

*Ŷ*= 121.65 + 3.55(DIET)

At this point, the remaining variables (those "excluded" from the equation) are examined for their partial correlation with *Y*, that is, their correlation with CHOL with the effect of DIET removed (see Table 29.2➍). The variable with the highest significant partial correlation coefficient is then added to the equation, in this case, WT (partial *r* = .462, *p* = .000). Therefore, WT is added in step 2 (see Table 29.2➎). With the addition of this variable, we have achieved an *R*^{2} of .529 (see Table 29.2➏), only slightly lower than the value obtained with the full model. The adjusted *R*^{2} is higher, however, because there are fewer variables in this model.

Another criterion for entry of a variable is its **tolerance level**. Tolerance refers to the degree of colinearity in the data. Tolerance ranges from 0.00, indicating that the variable is perfectly correlated with the variables already entered, to 1.00, which means that the other variables are not related (see Table 29.2➐). The higher the tolerance, the more new information a variable will contribute to the equation. Some computer programs will automatically generate tolerance levels for each variable. Others offer options that must be specifically requested to include tolerance values (colinearity statistics) in the printout.

The stepwise regression continues, adding a new variable at each successive step of the analysis if it meets certain *inclusion criteria*; that is, its partial correlation is highest of all remaining variables, and the test of its regression coefficient is significant. This process continues until, at some point, either all variables have been entered or the addition of more variables will not significantly improve the prediction accuracy of the model. In the current example, Table 29.2➑ shows us that none of the partial correlations of the remaining three variables is significant. Therefore, no further variables were entered after step 2. As shown in Table 29.2➒, the final model for the stepwise regression is

*Ŷ*= 48.21 + 3.12(DIET) + .508(WT)

Note that the coefficients in the equation have changed with the addition of WT as a variable. There are times when no variables will be entered if none of them satisfy the minimal inclusion criteria. In that case, the researcher must search for a new set of independent variables to explain the dependent variable.

^{‡}Stepwise procedures may be classified as *stepwise, forward* or *backward* inclusion. Forward inclusion means that the model starts with no variables, and adds variables one by one until the inclusion criterion is satisfied. This procedure is differentiated from stepwise regression in many statistical programs. While both proceed using a forward selection method, adding a new variable at each step, the stepwise procedure can also remove a variable at any step, if that variable no longer contributes significantly to the model, given the current variables in the equation. The procedure will specify a significance criterion to enter variables as well as to remove them. In the backward inclusion method, the model starts with all variables in the equation, and partial correlations are calculated as if each one were the last variable to be entered. Using criteria for removal, the variable with the smallest partial correlation is taken out. Steps proceed until no remaining variables are qualified for removal.

One of the general assumptions for regression analysis is that variables are continuous; however, many of the variables that may be useful predictors for a regression analysis, such as gender, occupation, education and race, or behavioral characteristics such as smoker versus nonsmoker, are measured on a categorical scale. It is possible to include such qualitative variables in a regression equation, although the numbers assigned to categories cannot be treated as quantitative scores. One way to do this is to create a set of coded variables called **dummy variables**.

In statistics, **coding** is the process of assigning numerals to represent categorical or group membership. For regression analysis we use 0 and 1 to code for the absence and presence of a dichotomous variable, respectively. All dummy variables are dichotomous. For example, with a variable such as smoker-nonsmoker, we code 0 = nonsmoker and 1 = smoker. For sex, we can code male = 0 and female = 1. In essence we are coding 1 for female and 0 for anyone who is not female. We can use these codes as scores in a regression equation and treat them as interval data.

For instance, we could include gender as a predictor of cholesterol level, to determine if men or women can be expected to have higher cholesterol levels. Assume the following regression equation was obtained:

*Ŷ*= 220 − 27.5

*X*

Using the dummy code for females, *Ŷ* = 220 − 27.5(1) = 194.5, and for males *Ŷ* = 220 − 27.5(0) = 220. With only this one dummy variable, these predicted values are actually the means for cholesterol for females and males. The regression coefficient for *X* is the difference between the means for the groups coded 0 and 1.

When a qualitative variable has more than two categories, more than one dummy variable is required to represent it. For example, consider the variable of college class, with four levels: freshman, sophomore, junior and senior. We could code these categories with the numbers 1 through 4 on an apparent ordinal scale; however, these numerical values would not make sense in a regression equation, because the numbers have no quantitative meaning. A senior is not four times more of something than a freshman. Therefore, we must create a dichotomous dummy variable for each category, as follows:

Each variable codes for the presence or absence of a specific class membership. We do not need to create a fourth variable for seniors, because anyone who has zero for all three variables will be a senior. We can show how this works by defining each class with a unique combination of values for *X*_{1}, *X*_{2} and *X*_{3}:

X_{1} | X_{2} | X_{3} | |
---|---|---|---|

Freshman | 1 | 0 | 0 |

Sophomore | 0 | 1 | 0 |

Junior | 0 | 0 | 1 |

Senior | 0 | 0 | 0 |

The number of dummy variables needed to define a categorical variable will always be one less than the number of categories.

Suppose we wanted to predict a student's attitude toward the disabled, on a scale of 0 to 100, based on class membership. We might develop an equation such as

*Ŷ*= 85 − 55

*X*

_{1}− 25

*X*

_{2}− 15

*X*

_{3}

Therefore, the predicted values for each class would be

Several dummy variables can be combined with quantitative variables in a regression equation. Because so many variables of interest are measured at the nominal level, the use of dummy variables provides an important mechanism for creating a fuller explanation of clinical phenomena. Some computer programs will automatically generate dummy codes for nominal variables. For others, the researcher must develop the coding scheme.

Many questions of prediction or explanation involve outcomes that are categorical. For example, we might ask why some individuals experience recurrent falls. VanSwearingen et al^{3} identified mobility and functional characteristics that could predict whether a person did or did not have a history of falls. We might look for factors related to whether or not a patient returns to work following rehabilitation. Cifu et al^{4} examined several measures of physical and psychological function as predictors of successful return to work one year after traumatic brain injury. These examples illustrate the application of **logistic regression**, where the dependent variable has only two values—the occurrence or nonoccurrence of a particular event, or the presence or absence of a condition, typically coded 0 and 1.^{§} We cannot use multiple regression for this purpose, as a categorical dependent variable cannot meet the assumption of a normal distribution (see Chapter 24, Figure 24.5). The independent variables in logistic regression may be continuous, ordinal or categorical. Logistic regression can be run using a full set of independent variables, or it may be run using a stepwise procedure.

In logistic regression, rather than predicting the value of an outcome variable, we are actually predicting the probability of an event occurring. Using the regression equation, we determine if the independent variables can predict whether an individual is likely to belong to the group coded 0 (the reference group) or the group coded 1 (the target group). Consider the following hypothetical example. Suppose we wanted to predict the discharge disposition for patients following rehabilitation, as either "return to home" (coded 0) or "long-term care" (coded 1). We would like to set appropriate goals and begin suitable discharge planning as soon as possible, and we would like to determine if characteristics upon admission will be useful predictors of discharge status. We will use the following variables, coded as present (1) or absent (0), except for age, which is continuous:

Functional status | ADL | 0 = independent; 1 = limited |

Age | AGE | Continuous |

Marital status | MAR | 0 = married; 1 = not married |

Gender | GENDER | 0 = male; 1 = female |

We will examine data from 100 patients, 46 of whom went to long term care (LTC). The statistical question is: What is the likelihood that an individual will be discharged to LTC given this combination of factors? The results of a logistic regression for these variables are shown in Table 29.3.

^{§}Logistic regression can be used when the outcome variable has more than two categories, an approach that is beyond the scope of this text.

We can think of the **logistic function** as a linear combination of these variables, similar to the linear regression equation. The likelihood of the predicted outcome is based on the odds of being discharged to LTC, or more accurately, the logarithm of the odds:

where *Z* is the natural logarithm of the odds, called a **logit**, *a* is a regression constant, and *b* is the regression coefficient. Even though we are using a different mathematical base (logarithms), this equation is conceptually the same as the multiple regression equation—but with two major differences. First, the dependent variable (the logit) is a dichotomous outcome, resulting in prediction of group membership. Second, where multiple regression uses the least squares criterion for finding the equation with the smallest residuals, logistic regression uses the concept of **maximum likelihood**, which means that the equation will present the "most likely" solution that demonstrates the best odds of achieving accurate prediction of group membership.

Coefficients for the logistic regression for our discharge status question are shown in Table 29.3➋ This logistic regression equation would, therefore, be written:

We can use the coefficients in the logistic regression equation to predict the probability that an individual belongs to the target group, as follows:

where *e* is the base of the natural logarithm.^{∗∗} The probability associated with the outcome will be 0 if the subject is discharged home, and 1 if long-term care. We can expect, however, that the logistic regression will yield probabilities between 0 and 1. A value closer to 1.0 (above .5) will suggest a probability in favor of discharge to long-term care, and a value closer to zero (below .5) would predict that this event is not likely to occur; that is, the subject is likely to be discharged home. A probability of .5 would mean that the individual has an equal likelihood of either outcome.

When this model is applied to an individual's data, we obtain the probability of that individual being discharged to long-term care. Consider, for example, a subject who ultimately was discharged to LTC, who had the following scores: ADL = 1, AGE = 78, MAR = 1 and GENDER = 0,

Therefore,^{††}

Using this model, we would have correctly predicted that this individual would be discharged to LTC, as the probability is greater than .5.

Let's look at another example for a subject who was also discharged to LTC, with the following data: ADL = 1, AGE = 80, MAR = 0, and GENDER = 0:

Therefore,

We would incorrectly predict that this individual would be discharged home because the probability is less than .5.

A histogram helps us visually understand how these predictions are interpreted. In Figure 29.2 we see such a graph of the predictions for subjects in this example, where the symbol "0" represents those who were actually discharged home, and the symbol "1" represents those who went to long-term care. The *X* axis shows the predicted probabilities associated with each individual's scores. In this instance, probabilities above .5 are assigned to group 1, whereas probabilities of .5 or below are assigned to the group coded 0. Therefore, on the left half of the graph we can see that nine of those who actually went to long-term care (coded 1) were predicted to go home, whereas on the right half we find that eight of those who went home (coded 0) were predicted to go to longterm care. These incorrect classifications are shaded on the graph. Classification results are also given in Table 29.3➊. A total of 83% of the sample was correctly classified using this logistic model. Over 80% of those who actually went to long-term care (1) were correctly classified; approximately 85% of those who went home (0) were correctly assigned using the logistic regression model.

###### FIGURE 29.2

Histogram (*N* − 100) of estimated probabilities of being discharged home (0) versus discharged to long term care (1), derived from logistic regression. Each symbol represents one subject. Shaded symbols represent misclassifications using a cutoff score of .50. (*Histogram obtained using SPSS 8.0 logistic regression procedure.*)

The histogram also allows us to see the effect of using this model and the consequences of misclassification. For instance, in this example we can see that most of the misclassifications occur in the region around .5. In setting discharge plans, we might want to reserve judgment for this group. We would be more confident, however, in setting up home discharge plans for those with probabilities below .25, and similarly confident in securing a bed in a skilled nursing facility for those with probabilities above .75.

^{∗∗}Find the key marked *e ^{x}* on your scientific calculator.

^{††}The value *e*^{2.264} = 9.62 and *e*^{−2264} = .104.

It is generally more useful to interpret regression coefficients in terms of *odds* rather than probability. Odds tell us how much more likely it is that an individual belongs to the target group than the reference group. If the odds are 1.00, then either outcome is equally likely. With odds greater than 1.00, the individual is more likely to belong to the target group; conversely, with odds less than 1.00, the individual is more likely to belong to the reference group.

The **odds ratio** is used to estimate the odds of membership in the target group, given the presence of specific independent variables (see Chapter 28 for discussion of the odds ratio). The regression coefficient in the equation is the logarithm of the odds for each independent variable. Therefore, an odds ratio can be computed for each variable by using the regression coefficient as the exponent of *e* (see Table 29.3➍). For a subject who is limited in ADL (ADL = 1), the odds of going to LTC are *e*^{2.384} = 10.848. This number represents the odds of going to LTC with a one-unit change in the value of *X.* With a dichotomous variable, this means that an individual who is limited in ADL is almost 11 times more likely to go LTC as compared to one who is independent (a change from 0 to 1 for ADL). Confidence intervals can also be determined for each odds ratio (see Table 29.3➎). A significant odds ratio will not contain the null value, 1.0, within the confidence interval. We can see that this is true for the odds ratios associated with ADL and MARital status.

When the logistic regression equation includes several independent variables, as in our example, each odds ratio is actually corrected for the influence of the other variables. Just as independent variables in multiple regression exhibit colinearity, independent variables with logistic regression will affect each other. This is an important consideration for prediction models. For instance, if we were to look at the simple association between discharge status and ADL, we would find an odds ratio of 15.909 (see Table 29.4). This means that individuals who are limited in ADL are almost 16 times more likely to be discharged to long-term care than those who are independent. However, if we look at the results of the logistic regression in Table 29.3➍, we find that the odds ratio associated with ADL is 10.848. This discrepancy is a function of the other variables in the equation; that is, the odds ratio for ADL is *adjusted* for the influence of the other factors. Therefore, the odds ratios shown in Table 29.3➍ are considered **adjusted odds ratios**.

When an independent variable is continuous, the interpretation of logistic regression is more complex. Consider the effect of AGE on discharge status, with an odds ratio of 1.11. Remember that an odds ratio of 1.0 indicates that either outcome is equally likely. Because the odds ratio relates to the relative increase in odds with a one-unit increase in *X*, we can interpret this value as the odds associated with a 1-year difference in age, such as from 87 to 88, or any other 1-year difference. Therefore, with a 1-year difference in age, the odds of going home or to long-term care are essentially even. As the unit difference increases, however, we must multiply the regression coefficient for age (B = .104, Table 29.3➋) to obtain the odds ratio. With a 2-year difference in age, then, we determine the odds ratio by *e*^{(2×.104)} = 1.23. Not much of a change. To determine the odds related to a 10-year difference in age, we find *e*^{(10×.104)} = 2.83. Now the odds of going to longterm care are almost three times greater for someone who is 80 as compared to someone who is 70, or for someone who is 75 compared to someone who is 65. Many researchers choose to categorize continuous variables to simplify this interpretation.

The presentation of results from a logistic regression will depend on the research question. In many research situations, the investigator is actually interested in one particular variable, but wants to control for potential confounders. Using our discharge study, we might be specifically interested in the effect of function on discharge status, but we would want to account for the influence of demographic factors. In that case we might report that the odds ratio for ADL was 10.848, adjusted for age, marital status and gender.

Alternatively, we could approach this analysis using a broader question, asking which of these four factors is related to discharge status. For this approach, we would summarize results, suggesting that ADL and MARital status are most influential in predicting discharge status, adjusted for age and gender. In addition to the increased likelihood of going to LTC if the patient is functionally limited, those who are not married are almost 19 times more likely to be sent to LTC than those who are married.

Another consideration in presenting results is the significance associated with each independent variable. In the current example, only ADL and MARital status have significant regression coefficients (Table 29.3➌). Some authors will present coefficients and odds ratios for all independent variables, regardless of their significance. Others will provide odds ratios only for significant variables.

**Discriminant analysis** is another analogue of multiple regression, also used when the dependent variable is categorical. It is a technique for distinguishing between two or more groups based on a set of characteristics that are predictors of group membership. Based on the equation generated by the discriminant analysis, subjects are classified according to their scores, and the model is then examined to see if the classifications were correct. Discriminant analysis has an important distinction from logistic regression, in that the independent variables are assumed to be normally distributed, and variances are assumed to be equal across groups. Dichotomous independent variables can be used, but with a mixture of continuous and dichotomous variables, discriminant analysis may be less than optimal.

The ability to classify individuals into distinct groups can be useful in many areas of clinical and behavioral science, for purposes of prevention, evaluation, screening, and diagnosis. For example, Ermer and Dunn^{5} studied three groups of children: with autism, attention deficit disorder and without disabilities. The researchers conducted a discriminant analysis to determine if these groups could be differentiated on the basis of their scores on nine factors of a Sensory Profile. Nearly 90 percent of the cases were correctly classified using the resulting model, supporting its validity.

The discriminant analysis develops a statistical model, called a **discriminant function**, that will allow us to describe the existing groups and to assign new individuals to a group when it is not known to which group they belong. Discriminant analysis can be performed using a fixed set of variables or in a stepwise manner to reduce the discriminant function to a minimum of relevant variables.

To demonstrate this process, consider a hypothetical example in which we are interested in distinguishing between athletes who are likely to sustain an injury over the course of a season (designated group 1) versus those who will remain uninjured (designated group 0). Using a group of athletes from one school, we will consider overall strength, flexibility, balance and time in play as risk factors. To illustrate these relationships, consider only the first two variables for a moment. In Figure 29.3 we have plotted scores representing strength (*Y*) and flexibility (*X*) for injured and noninjured groups. In Figure 29.3A, the variables clearly discriminate between the groups, with those who were not injured demonstrating greater strength and flexibility; however, even with this degree of separation, we can see that discrimination will not be totally accurate because there is some overlap between the groups. Figure 29.3B represents a different situation, where there is much less differentiation between the groups, and it is likely that the independent variables would not be successful in distinguishing between them. When we incorporate many more variables into the analysis, we cannot visualize discrimination in a two-dimensional plot, but we can extend this illustration conceptually to visualize the discrimination between groups in multiple planes.

Any number of predictor variables can be used to develop the **discriminant function**, which is analogous to the multiple regression prediction equation. The equation takes the form:

where D is the *discriminant score*, *a* is a constant, *d* is the *discriminant function coefficient,* and *k* is the number of predictor variables in the equation. The discriminant score for each subject is calculated by substituting scores for each predictor variable into the equation (see Table 29.5➎). The purpose of the discriminant function is to determine the linear combination of variables that makes the groups as statistically distinct as possible; that is, it provides maximum discrimination between the groups. Discriminant function coefficients are often expressed as *standardized coefficients*, without a constant in the equation, similar to a beta weight in linear regression.

When more than two criterion groups are used, discriminant analysis becomes more complex, necessitating the development of more than one discriminant function. With *k* groups, we will require *k −* 1 discriminant functions. For example, in the study by Ermer and Dunn,^{5} where three groups were used, two discriminant functions were generated, one distinguishing normal children from the two disabled groups, and the second distinguishing the two disabled groups from each other.

The ability of the discriminant function to distinguish between groups can be assessed in several ways. The statistics associated with the equation are shown in Table 29.5. An **eigenvalue** (see Table 29.5➋) is a measure of variance, indicating how well the discriminant function discriminates between the groups; the higher the eigenvalue, the greater the discrimination.^{‡‡} This value is difficult to interpret, however, as it has no upper limit. Therefore, it is usually preferable to use a measure of correlation that ranges from 0 to 1, similar to the interpretation of *R*^{2}. The **canonical correlation** expresses this relationship, conceptually serving as a correlation of group membership with the discriminant function (see Table 29.5➌). The square of the canonical correlation reflects the extent to which the variance in scores in the discriminant function account for differences among the groups. In this example, with a canonical correlation of .809, approximately 66% of the variability in scores is accounted for by the differences between injured and noninjured athletes. A chi-square test is used to determine the significance of this relationship (see Table 29.5➍).

^{‡‡}An eigenvalue is analagous to an *F* ratio, the ratio of the between-groups sum of squares to the withingroups sum of squares that would be generated in an analysis of variance, with group as the independent variable and the discriminant function as the dependent variable (the discriminant function is interpreted as a weighted sum of the values on the predictor variables).

Probably the most useful test of the discriminant function is the degree to which it accurately predicts group membership. Obviously, when we calculate D it will not be exactly equal to 1 or 0. Therefore, a cutoff score must be defined, below which subjects are assigned to group 0 and above which they are assigned to group 1. The discriminant analysis will establish the coefficients and cutoff score that will maximize accuracy of classification. Unless the predictor variables are completely different from each other, with no overlapping variance (correlation), we can anticipate that this classification will not be 100% correct. A summary of classification results is included as the final step in the discriminant analysis. Because we know the true group assignment for each subject, we can determine if the discriminant function has correctly classified each individual. For example, Table 29.5➐ shows the results of the discriminant analysis for classifying athletes who were and were not injured. This summary shows that of those who actually had no injury, 94.9% were correctly classified, and of those who were injured, 86.0% were correctly classified. In the entire sample of 109 subjects, 90.8% were placed in the correct group by the discriminant function. This would be considered excellent discrimination. Therefore, based on these hypothetical data, measures of strength, flexibility, balance and time in play will be useful predictors of an athlete's risk of injury.

In essence, analysis of variance and the *t*-test for independent samples are special forms of discriminant analysis. Questions that are analyzed using these tests would often be equally well suited to discriminant analysis, and results would be identical; that is, where groups are significantly different using an analysis of variance, the discriminant analysis would show that the predictor variables are capable of discriminating among the groups. For instance, using the current example, we could have done five separate *t*-tests to determine if the injured and noninjured athletes were different from each other for each of the five identified risk factors. The discriminant analysis approach is more useful, however, when several measured variables are studied, accounting for their interdependence in the analysis, and controlling for potential Type I errors with multiple univariate analyses.

The technique of **factor analysis** is quite different from any of the statistical procedures we have examined thus far. Rather than using data for comparison or prediction, factor analysis takes an exploratory approach to data analysis. Its purpose is to examine the structure within a large number of variables, in an attempt to explain the nature of their interrelationships. This procedure is more controversial than other analytic methods because it leaves room for subjectivity and judgment; however, factor analysis makes an important contribution to multivariate methods because it can provide insights into the nature of abstract constructs and allows us to superimpose order on complex phenomena.

The concept of factor analysis is illustrated in Figure 29.4. The larger set of variables at the top is composed of several overlapping circles with various degrees of "gray" or "green." We can assume that there is some relationship among circles that have similar shades. Through factor analysis, these variables are reorganized into two relatively independent circles, each one representing a set of related variables. Each set of green and gray variables represents a unique *factor.* A factor consists of a cluster of variables that are highly correlated among themselves, but poorly correlated with items on other factors. Therefore, we assume that circles with blue shades are related to other circles with green, but not to circles with gray, and vice versa.

In real terms, we use factor analysis to examine a large set of variables that represents elements of an abstract construct, and to reduce it to a smaller, more manageable set of underlying concepts. For example, we could examine a large set of behaviors within an individual and categorize them as representing different conceptual elements of the person's psychological state. Loss of appetite, lack of motivation and withdrawal might reflect underlying "depression." Sleeplessness, inability to concentrate, and nail biting might be indicative of "anxiety." Depression and anxiety would each be composed of a set of related elements, with each set of elements unrelated to the other set. The inter-correlation of variables within a factor suggests that those variables, taken together, represent a singular concept that can be distinguished from other factors. Therefore, depression can be distinguished from anxiety.

We might also be interested in the relative strength of the association between each of the variables within a factor and the concept that the factor represents. For instance, what is the relationship between sleeplessness and the concept of anxiety? In addition to grouping variables into factors, factor analysis also weights each variable within a factor. These coefficients, called **factor loadings**, are measures of the correlation between the individual variable and the overall factor.

The determination of what variables make up a factor is not determined a priori. The factor analysis approaches a set of data by looking at the intercorrelations among all the variables and arranging them into sets of statistically related variables. Through a complex series of manipulations that can only be envisioned by a computer, the analysis derives the factors and shows which variables fit best into each factor.

To demonstrate this application using a practical example, suppose we are interested in studying behaviors that are related to chronic pain in a sample of 150 patients with low back pain. For this hypothetical example, we will examine seven variables (although many more would probably be of interest in such a study). These variables have all been measured on a 5-point Likert scale, based on the frequency with which each behavior is observed, from 1 = "never observed" to 5 = "almost always observed." The seven variables are (1) COMPLAINs about pain; (2) CHANGES position frequently while sitting; (3) GROANS, moans, or sighs; (4) RUBS painful body parts; (5) ISOLATES herself or himself; (6) MOVES rigidly and stiffly; and (7) drags feet when WALKING. We will interpret a computer printout for a factor analysis on these seven variables.

The first step in a factor analysis is the creation of a correlation matrix for all the test items. On the basis of these correlations, the factor analysis attempts to identify the **principal components**^{§§} of the data; that is, the analysis proceeds to identify sets of variables that are linearly correlated with each other. Conceptually, this method looks at the data in a multidimensional space and configures the variables in all possible combinations to determine groupings that "go together" statistically; that is, they demonstrate strong correlations. These clustered variables represent "components" of the total data set and are derived through a process called *extraction.* The process is as mathematically complex as it sounds.

Principal components analysis "extracts" a factor from the overall data matrix by determining what combination of variables shows the strongest linear relationship and accounts for a large portion of the total variance in the data. The first factor that is "extracted" will account for as much of the variance in the data as possible. The second factor represents the extraction of the next highest possible amount of variance from the remaining variance. Each successive factor that is identified "uses up" another component of the total variance, until all the variance within the test items has been accounted for. These factors are abstract statistical entities only. This process does not indicate which variables are related to which factors.

As shown in the printout in Table 29.6➊, this analysis has extracted seven factors. The number of factors derived from a set of variables will always equal the number of variables, as it does here. These factors are statistical representations of variance and cannot be interpreted as any real concept yet. The computer is simply looking at patterns within the data and manipulating numbers. It will not be until the end of the analysis that these "factors" will make sense.

Even though seven factors have been identified, several of these factors account for small amounts of variance, and do not really contribute to an understanding of the structure of the data. We can usually characterize the data most efficiently using only the first few components. Therefore, we need to establish a cutoff point to limit the number of factors for further analysis. The statistic used to set this cutoff is called an **eigenvalue** (Table 29.6➋). Eigenvalues tell us how much of the total variance is explained by a factor. Factor 1 will always account for more variance than the other factors (in this example 27.1%). The most common approach restricts retaining factors to those with an eigenvalue of at least 1.00. Using this criterion, then, we limit further analysis to the first four factors, which taken together account for 72.5% of the variance in the data (see Table 29.6➌). Alternatively, the researcher may specify the number of factors to be used.

The result of a principal components analysis is a factor matrix (see Table 29.6➍), which contains the factor loadings for each variable on each factor. Loadings are interpreted like correlation coefficients, and range from 0.00 to ±1.00. Ideally we want each variable to have a loading close to 1.00 on one factor and loadings close to 0.00 on all other factors.^{∗∗∗} Factor loadings greater than .30 or .40 are generally considered indicative of some degree of relationship. We consider only the absolute value of the loading in this interpretation. The sign indicates if the variable is positively or negatively correlated with the factor.

Unfortunately, this factor matrix is usually difficult to interpret because it does not provide the most unique structure possible; that is, several variables may be "loaded" on more than one factor. For instance, if we look across the row for COMPLAIN, we can see that factor loadings are moderately strong for both Factors 1 and 4. Therefore, the next step is to develop a unique statistical solution so that each variable relates highly to only one factor. This process is called *factor rotation*.

^{§§}There are actually several different approaches to factor analysis, of which *principal components analysis* (PCA) is one. As this is the more common approach reported in the literature, we have chosen to present it here. Those interested in other approaches should consult manuals for different statistical packages, as well as references listed at the end of this chapter.

^{∗∗∗}The ideal outcome of a factor analysis would be the generation of factors that are composed of variables with high loadings on only that one factor. These would be considered "pure" factors. This does not always happen, however. When one variable loads heavily on two factors, those factors do not represent unique concepts, and there is some correlation between them. The researcher must then reconsider the nature of the variables included in the analysis and how they relate to the construct that is being studied.

Factor rotation is also a complex, multidimensional concept. Envision multiple axes in space, all intersecting at a central point, each one representing one factor. In this example, we would imagine four planes, or axes, one for each of the four factors we have identified. Each of the seven variables sits somewhere in this four-dimensional space, with factor loadings that identify its location relative to each of the four axes. The factor loadings can be considered multidimensional coordinates. In the ideal solution to this analysis, each of the variables would be located directly on one of the axes, which would indicate that the variable was "loaded" on that factor. We would then be able to identify which variables "belonged" to each factor.

We can illustrate this concept more simply using a two-dimensional example. Assume we have identified only two factors, Factor 1 and Factor 2. We could plot each of the seven variables against these two axes, as shown in Figure 29.5A.^{†††} The vertical axis represents Factor 1 and the horizontal axis represents Factor 2. As we can see, none of the variables sits directly on either of the axes. Some variables are located close to the origin, indicating that they are not related to either factor (their factor loading is small). The other variables sit in space somewhere between the two factors. This plot does not present a clear "structure" in the data in terms of specific factor assignments.

If, however, we could rearrange the orientation of axes and variables, we might be able to create a structure that will help us interpret these relationships. We do this by *rotating* the two axes in such a way as to maximize the orientation of variables near one of the axes. There are actually several ways that factor axes can be statistically rotated to arrive at this solution. In this example, we have used the most common approach, called **varimax rotation**, which tries to minimize the complexity of the loadings within each factor.^{‡‡‡}

This rotation is shown in Figure 29.5B. The rotation improves the spatial structure of the variables so that distinct factors are now visible; that is, several of the variables lie directly on or close to one of the axes. We find that variables 2, 6 and 7 now have the closest orientation to Factor 1, and variables 1 and 5 have the closest orientation to Factor 2. Variables 3 and 4, still clustered around the origin, show little or no relationship to either factor. This type of two-dimensional plot can be requested as part of a computer analysis for combinations of factors.

This form of rotation is called **orthogonal rotation** because the axes stay perpendicular to each other as they are rotated. This means that the two factors are independent of each other (orthogonal means independent); that is, they maintain maximal separation. **Oblique rotation**, used less often, allows the axes to change their orientation to each other. Therefore, some variables could be close to both factors, and the factors would be correlated. This might lead to a more realistic solution in some cases; however, the orthogonal solution will typically be easier to interpret, and in many cases will provide a comparable solution to oblique rotation.

In the actual factor analysis, this rotation process is carried out for all four planes simultaneously. Clearly, it would be impossible to conceive of this type of analysis without a computer. We must visualize a spatial solution that provides the one best linear combination for all variables.

This process results in the creation of a *rotated factor matrix* shown in Table 29.6➎. This matrix provides new factor loadings that represent the spatial coordinates of each variable in the reoriented multiaxial rotated solution. This new configuration should provide a cleaner statistical picture. We interpret this information by looking across each row of the matrix to determine which factor has the highest loading for that variable. We have highlighted the one loading for each variable that shows the strongest relationship to one of the factors. MOVES and WALKING load highest on Factor 1; COMPLAIN and GROAN load highest on Factor 2; CHANGES and RUBS load highest on Factor 3; and ISOLATE is loaded highest on Factor 4.

^{†††}For this illustration, the factor loadings are hypothetical. We cannot use the factor loadings given in Table 29.5, as these represent coordinates in a four-dimensional space.

^{‡‡‡}Other forms of rotation that are used less often are *quartimax rotation*, which is based on simplifying row loadings, and *equimax rotation*, which simplifies loadings on rows and columns. Each of these methods will result in a slightly different positioning of the axes. Varimax rotation is used most often because it generally presents the clearest factor structure. For some analyses it may be necessary to try different solutions to develop the one that best differentiates factors. Fortunately, these processes are easily requested in a computer analysis. It is important to recognize that different mathematical solutions can be generated, depending on the rotation approach used.

The final solution to a factor analysis is the naming of factors according to a common theme or theoretical construct that characterizes the important variables in the factor. This is a subjective and sometimes difficult task, especially in situations where the variables within a factor do not have obvious ties. The computer runs the analysis without any preconceived judgments on its part as to what "should" go together or what combinations "make sense." The researcher must look for commonalties and theoretical relationships that will explain the statistical outcome. When the factor labels are not so obvious, it may be necessary to re-examine the very nature of the construct being studied.

Table 29.7 shows how we have assigned the seven variables to four factors, using the strongest factor loadings for each variable as the criterion. Factor 1 could be called "mobility." Factor 2 could be labeled "verbal complaints." Factor 3 is concerned with "nonverbal complaints," and Factor 4 is associated with "nonsocial behavior." We are able to specify the percentage of the total variance in the data that each factor explains, using the information given in Table 29.6. Together, these four factors account for 72.5% of the total variance. Table 29.7 illustrates the type of information that would be included in a published report of factor analysis.

What we have, then, is a set of variables that contribute to a construct we are calling "chronic pain behavior." The variables demonstrate different components of this construct. We can begin to understand the structure of pain behavior by focusing on four elements that we have called mobility, verbal complaints, nonverbal complaints, and nonsocial behavior. As we move forward in this research, we can explore how each of these elements contributes to a patient's reactions to treatment, interactions with family, participation in social activities, and so on. The factor analysis has provided a framework from which we can better understand these types of theoretical relationships.

Factor analysis can be used to answer many types of research questions. As an exploratory approach, it can be used to sort through a large number of variables in an effort to reveal patterns of relationships that were not obvious before. This type of analysis may represent early stages of inquiry, when concepts and relationships are not yet sufficiently understood to propose relevant hypotheses. A classic example of this approach was presented by Thurstone and Thurstone^{6} in their studies of intelligence. They factor analyzed 60 tests and identified six primary abilities: verbal, number, spatial, word fluency, memory, and reasoning. Through repeated testing, these have come to be accepted as some of the elements that underlie the construct of intelligence, and are used as the basis for many intelligence tests.

Factor analysis can also be used to simplify a test battery, by determining which elements of the test are evaluating the same concepts. This approach can result in reducing the number of items that are used, or it may provide the basis for creating composite summary scores for each concept. For example, Jette^{7} used factor analysis to look at a set of 45 items on a functional capacity evaluation, with the intent of reducing the number of items without sacrificing the comprehensiveness of the assessment. The test items were structured into factors that identified distinct functional constructs, such as physical mobility, personal care, home chores, transfers and kitchen chores. Jette suggested that two or more items from each functional category should be assessed as part of the evaluation, substantially reducing the time needed to complete the test, while maintaining the validity of the information it produces. This method of sorting through a large number of items is preferable to the intuitive or empirical classification of functional tasks into categories.

One of the most interesting uses of factor analysis is the creation of a smaller set of composite scores, to be used as evaluative data or to be used as data in a statistical analysis. Subscores are created for each factor by multiplying each variable value by a weighting, and then summing the weighted scores for all variables within the factor. This result is called a **factor score**. The advantage of using composite scores is that the total number of variables needed for further analysis is decreased. This, in turn, will improve variance estimates for analyses such as regression or discriminant analysis. For example, Warren and Davis^{8} used a discriminant analysis to differentiate patients with running-related injuries. They started with 72 anatomical variables and performed a factor analysis to reduce these data to nine factors. Factor scores for each factor were then used as predictors in a discriminant analysis to predict membership in six pain groups. This simplified the analysis, which would have been quite cumbersome with 72 variables. Unfortunately, their classification was successful for only 29.1% of their cases, and they concluded that the identified factors were not good predictors of type of pain.

Many behavioral and clinical constructs, such as intelligence or motor development, cannot be measured directly. Therefore, they must be defined by relevant measurable variables that together form a conceptual package, indicative of the construct. Most tests of this sort contain many items that supposedly evaluate different components of the construct. These components can be considered factors, each one addressing a separate concept within the total construct. The construct validity of these tests must be established to document that they are indeed measuring the abstract behavior they supposedly define. This approach is basically one of theory testing; that is, the results of testing should conform to the theoretical premise for the construct. For example, suppose we developed a new intelligence test for use with learning-disabled children. If we accept the theoretical premise of intelligence defined by Thurstone and Thurstone,^{6} then we could hypothesize, a priori, which variables or test items should go together to reflect each of the six primary abilities. After the test is administered to a large sample, the scores can be factor analyzed, and we should see factors emerge that fit with this theory. If the factors do not match the hypothesized variable groupings, the test items are probably not measuring what they were intended to measure. This approach to construct validity testing is an important one that should be replicated on several samples before any conclusions are drawn about the appropriate or inappropriate inclusion of test items.

Factor analysis can also be used to support research hypotheses, when the focus of treatment or intervention is a set of behaviors that define a construct. For instance, educators could evaluate the effects of changes in professional curricula, such as moving from a fact-based to a problem solving approach, by examining differences in factor structure before and after program changes. One would expect to find different loadings and combinations of variables following this type of change.^{9} Because of the complex and interactive nature of curriculum characteristics, it would be difficult to evaluate change using only individual variables that represent small pieces of overall performance.

Although factor analysis has a unique statistical role in multivariate analysis, its subjectivity is often the basis for serious criticism. Researchers must be cautious about how "factors" are interpreted, as they are not real measurement entities, but only hypothetical statistical concepts. Giving a factor a name does not make it real. Similar analyses on different samples may organize data differently, as will other approaches to a single analysis, such as different methods of extraction or rotation. These differences can alter a factor's essential meaning. Indeed, factor analysis may generate factors that are totally uninterpretable within the framework of the research question. Because of the subjective and judgmental nature of some decisions, we recommend consulting an experienced statistician to document the rationale for using particular methods under specific research conditions.

Researchers are often interested in the underlying structures in a set of data. We can look at such structures in two ways. Factor analysis is one approach to determine how clusters of correlated variables contribute to that structure. In an analogous process, **cluster analysis** looks for groupings of people that demonstrate similar characteristics. Rather than generating factors of variables, this analysis generates homogenous clusters of subjects.

To illustrate this approach, Michel et al^{10} studied the prognosis of functional recovery in patients who had experienced a hip fracture. They looked at prefracture characteristics as well as function and mobility measures 1 year postsurgery. Their analysis generated 4 clusters of patients with similar profiles in terms of 13 predictor variables and 7 outcome variables.

Figure 29.6 shows the hierarchical structure, or cluster tree, that was generated for this study. The researchers started with 207 patients. Cluster analysis moves in a hierarchical fashion, reorganizing the data in steps to determine how the patients' characteristics relate to each other. In the first iteration, two groups were created, with 79 and 128 subjects. These groupings were further reduced in successive steps. The authors noted that the first cluster *(n* = 79) was clearly homogeneous, as it stayed intact until the seventh step. The second cluster *(n* = 128) was sufficiently heterogenous to form two more smaller groupings of 89 and 39 subjects in the third step. The smaller of these two clusters stayed intact through the next step, whereas the larger cluster was further classified into groupings of 27 and 62 subjects. These two clusters stayed intact through one more step, indicating a reasonable level of homogeneity in these subjects.

###### FIGURE 29.6

Number and size of clusters at successive steps in a study of functional recovery following hip fracture. (From Michel JP, Hoffmeyer P, Klopfenstein C, et al. Prognosis of functional recovery 1 year after hip fracture: Typical patient profiles through cluster analysis. *J Gerontol* 2000;55:M508–515, Figure 1, p. 511. Used with permission of the Gerontological Society of America.)

This pattern led the researchers to determine that the classification using four clusters was the best organization to describe this sample. Beyond four clusters, the groupings became too small to allow meaningful descriptions. Just as with factor analysis, this statistical technique has room for judgment in exploring the data.

Table 29.8A shows a small portion of the data that were generated to describe these clusters. We can see, for instance, that Cluster 1 was younger, had better mobility, and the shortest hospital stay; Cluster 2 had a longer hospital stay and low mobility; Cluster 3 had a larger number of patients in a nursing home, but no one who was disoriented; and patients in Cluster 4 were most likely to live in a nursing home, be disoriented, and have poor mobility.

The researcher is responsible for classifying the clusters by describing the characteristics that distinguish them. For this example, the researchers looked specifically at measures of ambulation and function prior to fracture and 1 year following surgery, to show how the members of each cluster varied. In Table 29.8B we can see that the patients in Cluster 1 were high functioning before and after their hip fracture. Those in Cluster 2 were functional prior to the fracture, but showed limitations 1 year later. Those in Cluster 3 were already limited prior to their fracture, and declined even further in ambulation 1 year later. And those in Cluster 4 started with some limitations and declined in both function and ambulation 1 year later. By understanding how these profiles emerge, clinicians can develop specific management strategies that are appropriate to their patients.

Many clinical research designs incorporate tests for more than one dependent variable. For example, if we were interested in the physiological effects of exercise, we might measure heart rate, blood pressure, respiration, oxygen consumption, and other related variables on each subject at the same time. Or if we wanted to document muscle activity during a particular exercise, we might record electromyographic data from several muscles in the upper and lower extremities simultaneously. It makes sense to do this because it is efficient to collect data on as many relevant variables as possible at one time, and because it is useful to see how one person's responses vary on all these parameters concurrently. These types of data are usually analyzed using *t*-tests or analyses of variance, with each dependent variable being tested in a separate analysis.

This approach to data analysis presents two major problems. First, the use of multiple tests of significance within a single study can increase the probability of a Type I error. This means that the more tests we perform, the more likely we are to find significant differences, just by chance. The second problem is related to the univariate basis of the *t*-test and analysis of variance. The validity of these tests is based on the assumption that each test represents an independent event; however, if we measure heart rate, blood pressure, and respiration on one person, we cannot assume that the responses are unrelated. Most likely, changes in one variable will influence the others. Therefore, these responses are not independent events and should not be analyzed as if they were.

The purpose of a **multivariate analysis of variance (MANOVA)** is to account for the relationship among several dependent variables when comparing groups. This test can be applied to all types of experimental designs, including repeated measures, factorial designs, and analyses of covariance. In many situations, a MANOVA can be more powerful than multiple analyses of variance if the dependent variables are correlated.

To illustrate the concept of multivariate analysis, suppose we wanted to measure systolic blood pressure (SBP) and diastolic blood pressure (DBP) to study the effects of three different medications for reducing hypertension. Hypothetical means for such a study are shown in Table 29.9. If we were to use a standard analysis of variance for this study, we would perform two separate analyses, one for systolic and one for diastolic presssure. In each analysis, we would compare means across the three treatment groups. In a multivariate model, we no longer look at a single value for each treatment group, but rather we are concerned with the overall effect on both dependent variables. We conceptualize this effect as a multidimensional value, called a **vector**. The mean vector, , for each group represents the means of all dependent variables for that group. In statistical terms, a vector can be thought of as a list of group means. In this example, there would be two values in each of the three vectors, representing systolic and diastolic blood pressure measures for each medication group. Therefore, = (50,120), = (60,110) and = (90,135). Figure 29.7 illustrates how these values would be oriented in a two-dimensional framework. The center point in each group, called the **group centroid**, represents the intersection of the means for both dependent variables, or the spatial location of the mean vector. The purpose of the MANOVA is to determine if there is a significant difference among the group centroids.

Treatment Group | |||
---|---|---|---|

1 | 2 | 3 | |

Systolic | 120 | 110 | 135 |

Diastolic | 50 | 60 | 90 |

The multivariate null hypothesis states

where represents the mean vector for each group. The alternative multivariate hypothesis states that at least one group has a population centroid that is different from the others. Just as with an ANOVA, follow-up tests are necessary to explain significant differences.

The concept behind the multivariate analysis of variance is really the same as that for the analysis of variance. The total variance in the sample is partitioned into parts that represent between-groups and error effects, although in the multivariate case, variability is measured against centroids rather than individual group means.

The statistics associated with multivariate analysis of variance are not as clear cut as using *F* or *t* in univariate models. When two groups are compared, *Hotelling's T ^{2}* can be used, which is a multivariate extension of Student's

*t*-test. With more than two groups, four statistical procedures are usually reported in a computer analysis: Wilk's lambda, the Hotelling-Lawley trace, the Pillai-Bartlett trace, and Roy's maximum characteristic root (MCR). Each of these tests is a variance ratio, although each has a slightly different interpretation. For the sake of consistency in generating critical values for these statistics, most programs convert these values to

*F*-values.

Unfortunately, statisticians are not in agreement as to which one of these procedures should be used. In most cases, the tests yield similar results. The rationale for choosing one test over the others is based on a complex consideration of statistical power and how well the assumptions underlying each test are met. Rather than attempt to define these rationales, a task that goes beyond the scope of this text, we encourage researchers to consult with a statistician to make these decisions based on the specific research situation. We advise that Wilks' lambda is used most often, and will probably be the most easily interpreted.

When the MANOVA demonstrates a significant effect, follow-up analyses are usually based on univariate analyses of variance or a discriminant analysis. The latter procedure is considered preferable, because it maintains the integrity of the multivariate research question. Discriminant analysis will show if the values for the response variables, in this example SBP and DBP, can discriminate among the treatment groups. Some MANOVA programs will offer discriminant analysis or univariate analyses of variance as an optional part of the output.

Many research questions focus on the effectiveness of intervention, generally in comparison to a placebo or standard treatment. Measurement of short-term effects may be important for assessing immediate benefit. Long-term outcomes, however, may be of greater interest in relation to survival or prevention because they better reflect the intervention's true effectiveness. Long-term effects are typically evaluated with reference to survival time to an identified "event." The concept of **survival analysis** is important to understanding prognosis and treatment effectiveness. It answers questions relating to time: "How long will it be before I am better?" "How long am I going to live?" "When in the future will the risk of recurrence of my disorder decrease?"

For many diseases, such as cancer or cardiovascular disease, the terminal event of interest is death. Life expectancy can also be examined in relation to functional conditions. For example, Strauss et al^{11} looked at decline in function and life expectancy in older persons with cerebral palsy. They were able to demonstrate that survival rates of ambulatory older adults were only moderately worse than the general population, but were much poorer for those who had lost mobility. Survival time can also refer to other events, such as time to relapse, injury or loss of function. For instance, Ruland et al^{12} used survival analysis to examine time to recurrence of stroke. Grossman and Moore^{13} followed the longitudinal course of aphasia to determine how grammatical and working memory factors contribute to decline of sentence comprehension. Researchers have also looked at the prognosis of walking capacity in patients with rheumatoid arthritis who underwent multiple arthroplasty.^{14} They found that within the first 5 years after the first surgery, 92% of patients were still able to walk independently. This decreased to 79% in the 10th year, and 60% in the 15th year.

Estimates of survival present a special dilemma for analysis because it is not possible to follow all subjects to the event of interest. Even in long-term studies, there will be an end to data collection and some patients will not have reached the terminal event at that point. Therefore, we could not know how long these subjects will "survive." Subjects may drop out of a study, leaving their end point undocumented. There may also be a variation in the onset of disease or treatment, often resulting in patients entering a study at different times.

When individuals are followed for this type of analysis, those who have not yet reached the terminal event by the end of the study are considered **censored observations**. These censored survival times will underestimate the true (but unknown) time to the event because it will occur beyond the end of the study.^{15} Therefore, special methods of analysis are needed to account for censored data.

Techniques such as analysis of variance and regression are often used to follow a subject's responses over time. Because of censored observations, however, they are not appropriate for survival analysis. Taking a mean survival time for a cohort of patients will be misleading because the mean will continually change as different individuals reach the terminal event. A mean survival time can actually only be accurate when all subjects in the cohort have reached that end point.

The oldest method of analyzing survival was developed in the 17th century using actuarial or **life tables** (see Box 29.1). In this approach, time intervals are created to provide estimates of an individual's probable survival, a technique still used by insurance companies to establish premiums. Within each interval, several indices can be computed.

The

**number of cases at risk**is the number of individuals who enter the time interval (those who have survived) minus half the number of cases lost to follow-up within that interval.The

**proportion failing**is the ratio of the number of cases who did not survive into the interval, divided by the number of cases at risk. The**proportion surviving**is 1 minus the proportion failing.The

**probability density**is the probability of reaching the terminal event in the given time interval per unit of time. It is computed as the proportion surviving at the start of the interval minus the proportion surviving at the end of the interval, divided by the width of the interval.The

**hazard rate**is the probability that an individual who has survived to the beginning of a time interval will reach the terminal event during that interval. It is the number of individuals who reach the event divided by the mean number of surviving cases at the midpoint of the interval.The

**survival function**is a cumulative proportion of cases surviving up to the given interval. It is computed by multiplying the probabilities of survival across all previous intervals.The

**median survival time**is the point at which the cumulative survival function is equal to 0.5, or the 50th percentile. Because of censored observations, this will not necessarily be the same as the time up to which 50% of the sample survived.

The most common method of determining survival time is the **Kaplan-Meier product limit method**, which does not depend on grouping data into specific time intervals.

Edmund Halley (1656–1742) was a 17th century mathematician and astronomer, who is famous for identifying the recurrence of a comet every 79 years, now called Halley's Comet. But did you know that he was also responsible for developing the process of estimating life insurance premiums?

In 1693 Halley published the first life table that presented mortality data for the city of Breslau, Germany based on specific ages for the five years between 1687 and 1691. With a city population of 34,000, he documented 6,193 births and 5,869 deaths per year. He showed that on average 348 died yearly in the first year of life, and another 198 died between 1 and 6 years of age. His table (below) showed mortality from ages 7 through 100, with the mortality figures listed below each age (11 people died at age 7; 11 people died at age 8, and so on). He noted how deaths in teen years decreased markedly, and that after age 70 the number increased, with a gradual decline in later years until "there be none left to die."

Halley cited several uses for his table. First, it could be used to determine the number of men in the city eligible to bear arms between the ages of 18 and 56, with the assumption that those under 18 were "too weak to bear the *Fatigues of War* and the *Weight of Arms*, and [those over 56 were] too crasie [sic] and infirm from *Age*, notwithstanding particular *Instances* to the contrary" (italics from original text).

Second, the table identified different mortality rates in specific age groups. And third, the data could be used to estimate of the price of life insurance and the valuation of annuities, based on the probability that the person would survive to collect the installment. Halley's work was considered the founding of actuarial science, and resulted in a profitable insurance practice for the British government.

*Source:* Halley E. An estimate of the degrees of mortality of mankind, drawn from curious tables of the births and funerals at the city of Breslaw, with an attempt to ascertain the price of annuities on lives. *Philosophical Transactions of the Royal Society of London* 1693;17:596–610, 654–656.

This approach generates a step function, changing the survival estimate each time a patient dies (or reaches the terminal event). Graphic displays of survival functions computed with this technique provide a useful visual understanding of the survival function as a series of steps of decreasing magnitude. This method can account for censored observations over time. Confidence intervals can also be calculated.

The Kaplan-Meier estimate can also be used to compare groups of patients. Figure 29.8 shows survival curves over a 5-year period for a cohort of elderly men and women who participated in an aging study.^{16} Subjects were differentiated on the basis of their gait abnormalities. The graph shows that subjects had a greater risk of death or institutionalization if they exhibited abnormal gait characteristics, than if they had a normal gait. The distinction between the groups became most evident after the first year of follow-up. By looking at the survival rate along the *Y*-axis, we can see that the median survival time for those with abnormal gait was approximately 3 years, and for those with normal gait approximately 4.5 years.

###### FIGURE 29.8

Kaplan-Meier survival curve comparing risk of death and institutionalization over 5 years for subjects with abnormal gaits and those with normal gaits. The small vertical tick marks represent censored observations. The median survival time is determined by looking at the 50% cumulative survival rate along the *Y*-axis (horizontal line). (Adapted from Verghese J et al. Epidemiology of gait disorders in community-residing older adults. *J Am Geriatr Soc* 2006; 54:255–261, Figure 1, p. 259. Used with permission of Blackwell Publishing.)

Survival time is often dependent on many interrelated factors that can contribute to increased or decreased probabilities of survival or failure. A regression model can be used to adjust survival estimates on the basis of several independent variables. Standard multiple regression methods cannot be used because survival times are typically not normally distributed—an important assumption in least squares regression. And of course, the presence of censored observations presents a serious problem. The most commonly used method is the **Cox proportional hazards model**, which is conceptually similar to multiple regression, but without assumptions about the shape of distributions. For this reason, this analysis is often considered a nonparametric technique.

The proportional hazards model is based on the **hazard function**, which is related to the survival curve. This function represents the risk of dying (or the terminal event) at a given point in time, assuming that one has survived up to that point. The dependent variable is the hazard (risk), and the independent variables, or covariates, are those factors thought to explain or influence the outcome. Variables may be continuous, dichotomous or ordinal.^{17} Factors such as age, gender, occupation and so on, are typically used for this purpose. When treatment is used as an independent variable, the model allows for comparison of the hazard, and therefore survival time, associated with placebo compared to treatment.

Like odds ratios generated from a logistic regression, a **hazard ratio (HR)** can be generated from coefficients in the hazard function. A HR of 1.0 indicates that there is no excess risk associated with the covariates. A value greater than 1.0 indicates that a covariate is positively associated with the probability of the terminal event—thereby decreasing survival. A HR less than 1.0 indicates that the covariate is protective, decreasing the probability of the terminal event, and thereby increasing survival time. Confidence intervals can be expressed for the hazard ratio to indicate significance, with a null value of 1.0. Table 29.10 shows a portion of the data from the study of gait abnormalities. We can see that those who had moderate to severe gait abnormalities were 3.7 times more likely to die and 2.6 times more likely to become institutionalized than those with normal or mildly abnormal gaits. The confidence intervals for these hazard ratios do not contain 1.0, and therefore are significant. These values were generated from a Cox regression that included age and sex as covariates.

Hazard Ratio (95% CI) | ||||
---|---|---|---|---|

Gait | N = 468 | Institutionalization (n = 75) | Death (n = 30) | Institutionalization or Death (n = 99) |

Normal | 300 | 1.0 (reference) | 1.0 (reference) | 1.0 (reference) |

Mild Abnormality | 118 | 1.99 (1.18–3.36) | 0.89 (0.32–2.51) | 1.76 (1.01–2.84) |

Moderate-Severe | 50 | 2.67 (1.47–4.84) | 3.66 (1.62–8.29) | 3.18 (1.94–5.21) |

Multivariate analyses have become popular in behavioral research because of the increased availability of computer programs to implement them. Their applications are, however, not well understood by many clinical researchers, and many studies using multivariate designs are still analyzed using univariate methods.

Multivariate techniques can accommodate a wide variety of data and are able to account for the complex interactions and associations that exist in most clinical phenomena. Many research questions could be investigated more thoroughly if investigators considered multivariate models when planning their studies. We have limited this chapter to a discussion of the conceptual elements of multivariate analysis, but with enough of an introduction to terminology and application that the beginning researcher should be able to communicate effectively with a statistician and follow the computer output. This information will also facilitate understanding research reports that present the results of these analyses.

Although we have emphasized the potential for improving explanations of clinical data using multivariate methods, we must also include the caveat that clinical research need not be complicated to be meaningful. A problem is not necessarily better solved by a complex analysis, nor should such an approach be taken just because computer programs are available. The indiscriminate use of multiple measurements is not a useful substitute for a well defined study with a select number of variables. To be sure, the results of multivariate analyses are harder to interpret and involve some risk of judgmental error, such as in factor analysis. In addition, multivariate tests require the use of larger samples. Many important and concise research questions can be answered using simpler methods and designs. Many clinical variables can be studied effectively using a single criterion measure. On the other hand, simple analysis is not necessarily better just because the interpretation of results will be easier and clearer. The choice of analytic method should be based on the research question and the theoretical foundation behind it. When dealing with constructs that reflect several abstract phenomena, multivariate methods offer the most powerful means for developing and explaining theory. The purpose of this chapter was to present alternatives that provide the researcher with useful choices for planning the most effective study possible.

*Arch Phys Med Rehabil*1990;71:881–887. [PubMed: 2222156]

*J Adv Nurs*1998;27:320–326. [PubMed: 9515642]

*J Gerontol*1998;53A:M457–M464.

*Arch Phys Med Rehabil*1997;78:125–131. [PubMed: 9041891]

*Am J Occup Ther*1998;52:283–290. [PubMed: 9544354]

*Factorial Studies of Intelligence.*Chicago: University of Chicago Press, 1941.

*Arch Phys Med Rehabil*1980;61:85–89. [PubMed: 7369845]

*Phys Ther*1988;68:647–651. [PubMed: 3362976]

*Multivariate Analysis of Variance.*Beverly Hills, CA: Sage Publications, 1986.

*J Gerontol*2000;55:M508–515.

*Neuro Rehabilitation*2004;19:69–78. [PubMed: 14988589]

*Neurology*2006;67:567–571. [PubMed: 16924005]

*J Neurol Neurosurg Psychiatry*2005;76:644–649. [PubMed: 15834020]

*Mod Rheumatol*2005;15:241–248. [PubMed: 17029072]

*Br J Cancer*2003;89:232–238. [PubMed: 12865907]

*J Am Geriatr Soc*2006;54:255–261. [PubMed: 16460376]

*Brit J Cancer*2003;89:431–436. [PubMed: 12888808]