As knowledge and clinical theory have developed, clinical researchers have proposed more complex research questions, necessitating the use of elaborate multilevel and multifactor experimental designs. The **analysis of variance (ANOVA)** is a powerful analytic tool for analyzing such designs, where three or more conditions or groups are compared. The analysis of variance is used to determine if the observed differences among a set of means are greater than would be expected by chance alone. The ANOVA is based on the *F* statistic, which is similar to *t* in that it is a ratio of between-groups treatment effects to within-group variability. The test can be applied to independent groups or repeated measures designs.^{∗}

The purpose of this chapter is to describe the application of the analysis of variance for a variety of experimental research designs. An introduction to the basic concepts underlying analysis of variance is most easily addressed in the context of a single-factor experiment (one independent variable) with independent groups. We then follow with discussions of more complex models, including factorial designs and repeated measures designs.

^{∗}As with all parametric tests, the ANOVA is based on the assumption that samples are drawn randomly from normally distributed populations with equal variances. Tests for homogeneity of variance can be performed to validate the latter assumption. With samples of equal size, the analysis of variance is considered "robust" in that reasonable departures from the assumptions of normality and homogeneity will not seriously affect the validity of inferences drawn from the data.^{1} With unequal sample sizes, gross violations of homogeneity of variance can increase the chance of Type I error. In such cases, a nonparametric analysis of variance can be applied (see Chapter 22), or data can be transformed to a different scale that improves homogeneity of variance within the sample distribution (see Appendix D).

In a single-factor experiment, the one-way analysis of variance is applied when three or more independent group means are compared. The descriptor "one-way" indicates that the design involves one independent variable, or factor, with three or more levels.

Although the ANOVA can be applied to two-group comparisons, the *t*-test is generally considered more efficient for that purpose.^{†}

The null hypothesis for a one-way multilevel study states that there is no significant difference among the group means, represented by

where *k* is the number of groups or levels of the independent variable. The alternative hypothesis (H_{1}) states that *at least two means* will differ.

^{†}The results of a *t*-test and analysis of variance with two groups will be the same. The *t*-test is actually a special case of the analysis of variance, with the relationship *F* = *t*^{2}.

In the last chapter we established that mean differences can be evaluated using a statistical ratio that relates the treatment effect to experimental error. The analysis of variance uses the same process, except that the ratio must now account for the relationships among several means. The *F*-test (named for Sir Ronald Fisher, who developed the test) is used to determine how much of the total observed variability in scores can be explained by differences among several treatment means and how much is attributable to unexplained differences among subjects. To analyze this variability with several groups, we must refer to the concept of **sum of squares (SS)**, introduced in Chapter 17. The sum of squares is calculated by subtracting the sample mean from each score (*X* – X̄), squaring those values, and taking their sum (*SS* = Σ(*X* — X̄)^{2}). The larger the sum of squares, the greater the variability of scores within a sample.

To illustrate how this concept is applied to analysis of variance, consider a hypothetical study of the effect of using different modalities for 10 days to gain pain-free range of motion (ROM) in patients with tendonitis. Through random assignment, we create four independent groups: one to get ultrasound (US), a second to get ice, a third to get massage, and a fourth group to serve as a control (see Figure 20.1). We use a lowercase *n* to indicate the number of subjects in each group (*n* = 11) and an uppercase *N* to represent the total number of subjects in the study (*N* = 44). The independent variable, type of modality, has four levels (*k* = 4). Therefore, this is a single-factor, multilevel design. The dependent variable is elbow ROM, measured in degrees. Hypothetical data for this study are reported in Table 20.1A.

To estimate the total variability in these data, consider the set of 44 scores as one total sample, ignoring group assignment. We can calculate a mean for this total sample, called the **grand mean**, X̄_{G}, around which all 44 scores will vary. For the data in Table 20.1, the sum of all 44 scores is 1,638, and X̄_{G} = 37.23. The sum of squares for this total sample (Σ(*X* − X̄_{G})^{2}) represents the deviations of each individual score from the grand mean. This *total sum of squares* (*SS*_{t}) reflects the *total variability* that exists within this set of 44 scores. This variability is illustrated in Figure 20.2A, showing the entire distribution of scores above and below the grand mean.

###### FIGURE 20.2

Scores from tendonitis study (Table 20.1). **A**. The total variance in the sample is reflected in the distribution of scores from all four groups around the grand mean (X̄_{G}). **B**.The between-groups variance is determined by the distribution of the four group means. The error variance reflects the variability of scores within each of the groups around the group mean.

As we have described before, total variability in a set of data can be attributed to two sources: a treatment effect (*between* the groups), and unexplained sources of variance, or **error variance**, among the subjects (*within* the groups). As its name implies, the analysis of variance partitions the total variance within a set of data (*SS _{t}*) into these two components. The

*between-groups sum of squares*(

*SS*) reflects the spread of group means around the grand mean. The larger this effect, the greater the separation between the groups. The within-groups or

_{b}*error sum of squares*(

*SS*) reflects the spread of scores within each group around the group mean, or the differences among subjects. In Figure 20.2B, we can see that the means for groups 1 and 2 are close together, and both appear separated from groups 3 and 4. The spread of scores in group 4 appears to be less than in the other groups.

_{e}Because hand calculations are complex, computer programs will most often be used to obtain results for an ANOVA. For those who like to see the math, computational formulae for calculating total, between-groups and error sums of squares are shown in Table 20.2.

The total degrees of freedom (*df*_{t}) within a set of data will always be one less than the total number of observations, in this case *N* – 1. In our example, *N* = 44 and *df _{t}* = 43. The number of degrees of freedom associated with the between-groups variability (

*df*

_{b}) is one less than the number of groups (

*k*– 1), in this case

*df*= 3. There are

_{b}*n*– 1 degrees of freedom within each group, so that the number of degrees of freedom for the within-groups error variance (

*df*

_{e}) for all groups combined will be (

*n*

_{1}− 1) + (

*n*

_{2}− 1) + … + (

*n*

_{k}− 1), or

*N*–

*k*. For the data in Table 20.1,

*df*= 44 – 4 = 40. The degrees of freedom for the separate variance components are additive, so that (

_{e}*k*− 1) + (

*N − k*) = (

*N*– 1).

The concepts of between-groups and within-groups variability are once again used to define a statistical ratio. These sources of variability are defined as between-groups and error sums of squares. We convert the sums of squares to a variance estimate, or **mean square (MS)**, by dividing each sum of squares by its respective degrees of freedom. A mean square can be calculated for the between- and error-variance components as follows:

Mean square values are used to calculate the ** F statistic** as a ratio of the between-groups variance to the error variance:

When *H*_{0} is true and no treatment effect exists, the total variance in a sample is due to error, and *MS*_{e} is equal to or larger than *MS*_{b}, yielding an *F*-ratio of approximately 1.0 or less. When *H*_{0} is false and the treatment effect is significant, the between-groups variance is large, yielding an *F*-ratio greater than 1.0. The larger the *F*-ratio, the greater the difference between the group means relative to the variability within the groups. In our example, *F* = 11.89, as shown in Table 20.2C.

Like *t*, the calculated *F*-ratio is compared to a critical value to determine its significance. Table A.3 in the Appendix contains critical values of *F* at *α* = .05. Because mean squares are based on squared values, the *F*-ratio cannot be a negative number, and therefore, we do not distinguish tails for an *F* test.

The critical value of *F* for the desired *α* is located in the table by the degrees of freedom associated with the between-groups and error variances, with *df _{b}* across the top of the table and

*df*along the side. For our example,

_{e}*df*= 3 and

_{b}*df*= 40 (always given in that order). Therefore, from Table A.3,

_{e}We compare this critical value with our calculated value, *F* = 11.89. The calculated value must be *greater than or equal to* the critical value to achieve statistical significance. In this case, we can reject *H _{0}*.

A significant *F*-ratio does not indicate that each group is different from all other groups. Actually, it only tells us that there is a significant difference between at least two of the means (largest versus smallest). At this point, a separate test must be done to determine exactly where the significant differences lie. Various **multiple comparison tests** are described for this analysis in the next chapter. When the *F*-ratio is smaller than the critical value, *H*_{0} is not rejected and no further analyses are appropriate.

Computer-generated output will present the results of an analysis of variance in a summary table that provides sums of squares and mean square data for determination of the *F* ratio. The table presents data for the between-groups and error sources of variance, as shown in Table 20.1B. The probability level associated with the *F*-ratio is given in the last column of the summary table. This table may be included in the results section of a research report. Terminology used in the table will vary among computer programs and research reports. Rather than listing "between groups" as a source of variance, some programs list the name of the independent variable. The error variance may be called the within-groups variance, residual or between-subjects variance.

In reporting the results of an ANOVA, some researchers may simply indicate if the *F* ratio has achieved significance, indicating *p* < .05, although most reports will include the exact probability obtained by computer analysis. Some authors do not include summary tables in their research reports, choosing instead to report *F*-ratios in the body of the text. When this is done, the calculated value of *F* is given, along with the associated degrees of freedom and probability. For example, for the data in Table 20.1, we would say:

There was a significant difference among the four experimental groups (*F* = 11.89, *df* = 3,40, *p* < .001).

Because of the complexity of human behavior and physiological function, many clinical investigations are designed to study the simultaneous effects of two or more independent variables. This approach is often more economical than testing each variable separately and provides a stronger basis for generalization of results to clinical practice.

As an example, let us assume we wanted to compare the effect of prolonged versus quick stretch for improving ankle range of motion against a control (Factor A). At the same time we are interested in determining if the position of the knee during stretch (flexed or extended) will affect the outcome (Factor B). Instead of looking at each of these factors separately, we can examine their combined influence using a two-way factorial design. This design involves two independent variables: type of stretch (with three levels) and knee position (with two levels). Within the 3 × 2 framework, there are six treatment combinations. As shown in Figure 20.3, we can arrange the design in a table with six cells, so that rows correspond to type of stretch and columns to positions. Each cell represents a unique combination of levels for A and B. We could allocate 10 subjects per cell, for a total of 60 subjects. The design of factorial experiments was discussed in Chapter 10.

###### FIGURE 20.3

Two-way (3 × 2) factorial design testing the effects of (**A**) stretch (*k* = 3) and (**B**) knee position (*k* = 2) on ankle range of motion. Sixty subjects are randomly assigned to each of six experimental conditions (*n* = 10). The marginal means for each independent variable are obtained by pooling data across the second variable.

The appropriate statistical analysis for this design is a **two-way analysis of variance**. The descriptor "two-way" indicates a two-dimensional analysis, involving two independent variables. In this example, each variable is an independent factor (not repeated). The two-way ANOVA is an extension of the one-way analysis. It, too, partitions the total variance in the set of scores into between-groups and error components. The between-groups variance explains the independent variable effects, and the error variance accounts for all sources of variation unexplained by treatment; however, because the design incorporates two independent variables, the between-groups component must be further partitioned to account for the separate and combined effect of each independent variable. Therefore, we can ask three questions of these data:

What is the effect of variable A, independent of variable B?

What is the effect of variable B, independent of variable A?

What is the joint effect or interaction of variables A and B?

These components are called main effects and interaction effects, each explaining part of the total treatment effect.

In a two-way design, the effect of each independent variable can be examined separately, essentially creating two single-factor experiments. These effects are called **main effects**, illustrated in Figure 20.4. For instance, using the preceding example, we can study the main effect of stretch (Factor A) by collapsing or pooling data for the two knee positions. With 10 subjects in each of the original cells, we would now obtain a mean for 20 scores at each level of stretch (, , and in Fig. 20.4A). These three means represent the average between-groups effect of stretch, independent of the effect of knee position. The sum of squares associated with this main effect accounts for the separation among groups that received different forms of stretch.

Similarly, we can collapse the levels of stretch to obtain two means for the main effect of knee position (Factor B). There will be 30 scores per cell ( and ), as shown in Figure 20.4B. These two means reflect the average between-groups effect of knee position, independent of type of stretch. A second sum of squares will be calculated to account for the separation between these two groups.

The means for levels of the main effects are called **marginal means**. They represent the average separate effect of each independent variable in the analysis. Comparison of the marginal means within each factor indicates how much of the variability in all 60 scores can be attributed to the overall effect of stretch alone or knee position alone.

In addition to the analysis of main effects, the factorial experiment has the added advantage of being able to look at combinations of levels of each independent variable. Statistically, these are referred to as **interaction effects**. Interaction is present when the effects of one variable are not constant across different levels of the second variable, that is, when various combinations of levels cause differential effects.

To illustrate this concept, consider the hypothetical means given for the six treatment groups in Figure 20.5. Each mean represents a unique combination of stretch and knee position. We can plot these means to more clearly illustrate these relationships. In Figure 20.5A, we have represented range of motion, the dependent variable, along the *Y*-axis. The three stretch groups are represented along the *X*-axis. The means for range of motion for each knee position are plotted at each level of stretch, with lines connecting the means. Note that in this example, the lines are parallel, which means that the pattern of response at each knee position is consistent across all levels of stretch. We can reverse the plot, as shown in Figure 20.5B, with knee position on the *X*-axis, demonstrating a constant pattern for each level of stretch across both knee positions. These graphs are called interaction plots, in this case demonstrating a situation where there is no interaction; that is, prolonged stretch (*A*_{1}) will generate the highest response under both knee conditions and knee extension scores are higher across all levels of stretch.

Now consider a different set of results for the same study, given in Figure 20.6. The interaction plots for these data show lines that are not parallel; that is, the pattern of the baseline variable across all levels of the second variable is not constant. For example, in Figure 20.6A, the plot for knee flexion indicates little difference across levels of stretch. On the other hand, the line for knee extension shows a distinct difference for prolonged stretch. In Figure 20.6B, we see that the three flexion measures are fairly close (between 8 and 10 degrees), and two of the extension measures are also close (between 3 and 4 degrees), but the effect of prolonged stretch with knee extension is quite different. When lines are not parallel or when they cross, interaction is present. In this example, it is not the use of prolonged stretch alone that makes the treatment more effective. It is the combination of prolonged stretch with knee extension. Therefore, there is an interaction between the two independent variables. The analysis of variance will account for this difference among the interaction means as a third component of the between- groups sum of squares.

###### FIGURE 20.6

Plots of data showing interaction between stretch treatment and knee position. Lines that are not parallel or that cross indicate that responses on one variable will vary depending on the level of the second variable. In this example, prolonged stretch with knee extension consistently produces a greater response than other combinations of the two variables.

When two independent variables are examined in a single experiment, three statistical hypotheses are usually proposed, one for each main effect and one for the interaction effect. For example, for a 3 × 2 factorial design, the following null hypotheses would be proposed:

*H*_{0}:*H*_{0}:*H*_{0}:

An alternative hypothesis can be proposed for each null hypothesis. These hypotheses may be general statements of difference, or they may specify differences between specific means. An *F*-ratio is calculated to test each null hypothesis.

We have chosen not to include a mathematical example for a two-way analysis of variance, as we expect that all such analyses will be done by a computer. Those interested in details of the computations should refer to advanced statistical texts. We will examine the format for presentation of results of a two-way ANOVA, as shown in Table 20.3.

Sum of Squares | df | Mean Square | F | Sig. | |
---|---|---|---|---|---|

STRETCH | 1080.729 | 2 | 540.364 | 41.843 | .000➊ |

POSITION | 2.017 | 1 | 2.017 | .156 | .694 |

STRETCH × POSITION | 707.722 | 2 | 353.861 | 27.401 | .000 |

Error ➋ | 697.362 | 54 | 12.914 | ||

Total | 2487.830 | 59 | 42.167 |

➊ Computer programs will generate *p* values with a specified level of precision, that is, to a set number of decimal places. Therefore, a *p* value of .000 does not indicate zero probability. It simply means that the probability is <.001, but the precision of the output does not allow the exact value to be printed.

➋ Error variance may also be called residual variance.

*Note:* Some elements of data generated by SPSS, which are not essential to understanding the output, are not included in this table.

Note that there are three between-groups sources of variance listed, two main effects and the interaction effect. These are usually listed in the summary table according to the name of the independent variable. Thus, for our example, type of "stretch" and "knee position" are listed as main effects. The interaction between two variables is signified by ×, such as Stretch × Knee Position, read "stretch by knee position." The error term represents the unexplained variability between subjects within all combinations of stretch and knee position.

The number of degrees of freedom associated with each main effect is one less than the number of levels of that independent variable (*k* – 1). To clarify this notation, we use (A – 1) degrees of freedom for Factor A, and (*B* – 1) for Factor B, where the letters A and *B* represent the number of levels of each factor. Therefore, for stretch with three levels, *df* = 2. For knee position with two levels, *df* = 1. The number of degrees of freedom for the interaction between these variables is the product of their respective degrees of freedom, (*A* – 1)(*B* – 1). Therefore, the interaction effect in this example has 2 × 1 = 2 degrees of freedom.

The total degrees of freedom associated with an experiment will always be one less than the total number of observations, *N* – 1. In this study, with *n* = 10 per group (*N* = 60), *df*_{t} = 59. The error degrees of freedom can be determined by using (*A*)(*B*)(*n* – 1) with equal-size groups or by subtracting the combined between- groups degrees of freedom from the total degrees of freedom. For this example, *df _{e}* = (3) (2) (9) = 59 − 2 − 1 − 2 = 54.

Calculation of *F* is based on the ratio of between-groups to error mean squares. Mean square values are determined by dividing the sum of squares for each effect by its associated degrees of freedom. Each between-groups effect generates an *F*-ratio, based on its own mean square divided by the mean square for the common error term, *MS*_{e}. For example, for the data shown in Table 20.3, the *F*-ratios for the main effects of stretch (*A*) and knee position (*B*)^{§} are obtained by

Similarly, the *F*-ratio for the interaction term, *A* × *B*, is calculated according to

Each *F*-ratio is compared with a critical value from Appendix Table A.3. The degrees of freedom associated with the specific between-groups effect (main effect or interaction) are located across the top and the degrees of freedom associated with the error term are listed along the side. The critical values for each effect shown in Table 20.3 are:

Therefore, this ANOVA demonstrates a significant main effect for type of stretch and a significant interaction effect between stretch and knee position.

^{§}The use of the subscript *B* to denote Factor B should not be confused with the use of subscript *b* to denote "between-groups" in previous examples. In this example, both *A* and *B* represent between-groups sources of variance for the two independent variables.

The information contained in an ANOVA table provides a convenient summary of a study's design and results. For example, from Table 20.3 we can tell that there are two independent variables, stretch and knee position, with three and two levels, respectively (by looking at degrees of freedom); that there are 60 subjects in the study (*df*_{t} = *N* – 1 = 59); and that the outcome was dependent on which type of stretch was used in a particular knee position.

In most cases, researchers develop factorial designs with the expectation of specific patterns of interaction between the independent variables; that is, they hypothesize that certain combinations of treatments will be most effective. If this were not the case, the researcher could just as easily design separate one-way studies. Clinical interpretation of interaction is often facilitated by dividing the factorial design into several smaller "single-factor" experiments, each represented by the rows and columns in the design, as shown in Figure 20.7. These separate effects are called **simple effects**. Interaction is defined as a significant difference between simple effects.^{2} Each line in an interaction plot (see Figures 20.5 and 20.6) represents a simple effect. Simple effects are distinguished from main effects, which are based on averaged values across a second variable. An analysis of simple effects will reveal differential patterns within each of the independent variables. Such an analysis can be carried out on row effects, column effects or both. With simple effects, the researcher can inspect the data to determine which levels of either variable contribute most to the observed differences. The analysis of simple effects is similar to carrying out several single-factor analyses of variance, with between-groups effects extracted from the larger factorial design.^{∗∗}

^{∗∗}See Keppel^{2} and Green^{3} for detailed discussion of statistical procedures for analyzing simple effects.

When there is no interaction effect in an experiment, the main effects are easily interpreted by referring to the outcome of the *F*-test for each independent variable. In that case, the analysis is essentially reduced to a one-way design, and combinations of treatments are ignored. If an interaction effect is present and main effects are not significant, interpretation is also straightforward; however, when an interaction effect is present, significant main effects are more difficult to interpret. For example, look again at the interaction plots in Figure 20.5. With no interaction present, it is easy to see that range of motion was consistently higher with the knee in extension and with prolonged stretch.

In Figure 20.6, where there is a significant interaction, the separate effects of knee position and type of stretch must be examined more carefully. In Figure 20.6A, with type of stretch (Factor A) along the baseline, we can see that the level of response at different knee positions changes at different levels of stretch. Therefore, we cannot draw any general conclusions about the main effect of knee position. This is called a *disordinal interaction* and the main effect of knee position is ignored in the interpretation of results. In Figure 20.6B, however, where knee position is plotted on the baseline, we can see that although prolonged stretch with knee extension shows the largest difference, it is also true that prolonged stretch is consistently above all other levels of stretch. This illustrates an *ordinal interaction*, where the relative ranking of the levels of Factor A does not change at different levels of Factor B. Therefore, it would be appropriate to conclude that, in general, treatment with prolonged stretch consistently results in greater range of motion than treatment with quick stretch or no stretch.

**Multiple comparison tests** are also used to compare means for each significant effect following an analysis of variance. For significant main effects, the marginal means are compared. For example, we would compare , and (see Figure 20.4) to examine the main effect of stretch. When a main effect has only two levels, as with knee position in this example, a multiple comparison is unnecessary. The *F*-test functions like a *t*-test. Therefore, if *F* is significant, one need only look at the two means to determine which is greater.

For significant interaction effects, the individual group means are compared. For example, we could determine which of the six combinations of stretch and knee position would produce the greatest changes in ankle range of motion. Based on the data shown in Figure 20.6, we might expect to find that prolonged stretch with knee extension (*A*_{1}*B*_{2}) elicits a more effective response than the other five combinations.^{††}

^{††}See Tables 21.6 and 21.7 in the next chapter for results of multiple comparison tests for the two-way analysis of variance shown in Table 20.3.

A multifactor analysis of variance can be performed with any number of independent variables, although we rarely see analyses beyond three dimensions. For example, we could expand the preceding study to look at the effects of stretch, knee position, and three forms of exercise for increasing ankle range of motion.

The analysis of a three-way design is a direct extension of the two-way ANOVA. With three independent variables, *A, B* and C, the total variability in the data is divided into seven parts: three main effects (one for each independent variable), three double interactions testing each pair of independent variables in combination (*A × B*, *A* × *C*, and *B* × *C*) and a triple interaction (*A* × *B* × *C*) testing all possible combinations of the three variables.^{‡‡} A sum of squares is calculated for each of these effects, to account for their contribution to the total variance in the sample. As in other analyses, each main effect has *k* – 1 degrees of freedom, and degrees of freedom for the interaction terms are the product of the degrees of freedom for each effect in the interaction. The total degrees of freedom will be *N* – 1, and the error term will have (*A*)(*B*)(*C*)(*n* – 1) degrees of freedom. An *F*-ratio is calculated for each main effect and each interaction effect, using the mean square for the error term in the denominator.

The advantage of higher order factorial designs is the ability to examine how combinations of several variables influence behavior. Because treatment variables rarely exist in isolation, this approach can greatly enhance the construct validity and generalization of research results to practice. Unfortunately, such designs can also become overly complex, requiring large numbers of treatment groups and subjects. In addition, because the statistical analysis breaks down the total variance into so many components, interaction tests for the ANOVA will generally have lower power.

^{‡‡}These interactions are illustrated in Figure 10.6 in Chapter 10.

Up to now we have discussed the analysis of variance only as it is applied to completely randomized designs. These designs, where subjects are randomly assigned to treatment groups, are also called *between-subjects designs* because all sources of variance represent differences between subjects (within a group and between groups). Clinical investigators, however, often use repeated factors to evaluate the performance of each subject under several experimental conditions. The repeated measures design is logically applied to study variables where practice or carryover effects are minimal and where differences in an individual's performance across treatment levels are of interest. This type of study can involve one or more independent variables.

In a repeated measures design, all subjects are tested under *k* treatment conditions. The analysis of variance is modified to account for the correlation among successive measurements on the same individual. For this reason, such designs are also called *within-subjects designs*. The statistical hypotheses proposed for repeated measures designs are the same as those for independent samples, except that the means represent treatment conditions rather than groups.

The statistical advantage of using repeated measures is that individual differences are controlled. When independent groups are compared, it is likely that groups will differ on extraneous variables and that these differences will be superimposed on treatment effects; that is, both treatment differences and error variance will account for observed differences between groups. With repeated measures designs, however, we have only one group, and differences between treatment conditions should primarily reflect treatment effects. Therefore, error variance in a repeated measures analysis will be smaller than in a randomized experiment. Statistically, this has the effect of reducing the size of the error term in the analysis of variance, which means that the *F*-ratio will be larger. Therefore, the test is more powerful than when independent samples are used.

The simplest repeated measures design involves one independent variable, where all levels of treatment are administered to all subjects. To illustrate this approach, let us consider a single-factor experiment designed to look at differences in isometric elbow flexor strength with the forearm in three positions: pronation, neutral and supination. The independent variable, forearm position, has three levels (*k* = 3). Logically, this question warrants a repeated measures design, where each subject's strength is tested in each position (see Figure 20.8).

In a repeated measures design, we are interested in a comparison across treatment conditions *within each subject*. It is not of interest to look at averaged group performance at each condition. Therefore, statistically, each subject is considered a unique block in the design. We can represent the design diagrammatically as shown in Figure 20.9, with rows corresponding to subjects (*n* = 9), and columns representing experimental conditions. Note that this diagram resembles a two-way factorial design, with forearm position as one independent variable and "subjects" as the other. Using this interpretation, each cell in the design has a sample size of *n* = 1. Each individual subject is considered a separate level of the independent variable *subjects*.

Using the format of a two-way analysis, the **repeated measures analysis of variance** will look at the main effect of forearm position, the main effect of subjects, and the interaction between these two factors. Because each cell in the design has only one score, there can be no variability within a cell. Therefore, the error term for this analysis is actually the interaction between subjects and treatment; that is, interaction reflects the inconsistency of subjects across the levels of treatment. This interaction represents the variance that is unexplained by the treatment variable and will serve as the denominator for the *F*-ratio.

The total degrees of freedom associated with a repeated measures design will equal one less than the total number of observations made, or *nk* – 1. In our example, *df*_{t} = (9)(3) − 1 = 26.

As in other analyses, the number of degrees of freedom associated with the main effects will be *k* – 1 for the independent variable, and *n* – 1 for subjects. The degrees of freedom for the error term are determined as they are for an interaction, so that *df _{e}* = (

*k*– 1)(

*n*– 1). Table 20.4C shows these values in a summary table for the current example.

The sums of squares for the treatment effect and the error effect are divided by their associated degrees of freedom to obtain the mean squares. These mean square values are then used to calculate the *F*-ratio for treatment according to

where *MS _{A}* is the mean square for the treatment variable, and

*MS*the mean square for the interaction of treatment and subjects, or the error term. For the data in Table 20.4,

_{A×S}We can calculate an *F*-ratio for the effect of subjects, using *F _{s}* =

*MS*/

_{S}*MS*; however, this is not a meaningful test. We expect subjects to differ from each other, and it is generally of no experimental interest to establish that they are different. The

_{A×S}*F*-ratio for subjects is not given in most computer printouts (Table 20.4➍), and this effect is generally ignored in the interpretation of data.

^{§§}

The critical value for the *F*-ratio for treatment is located in Appendix Table A.3, using the degrees of freedom for treatment (*df*_{b}) and the degrees of freedom for the error term (*df*_{e}). Therefore, the critical value for this effect will be = 3.63. The calculated *F*-ratio exceeds this critical value and, therefore, is significant. The null hypothesis for treatment effects is rejected. The summary table shows that this difference is significant at *p* < .001 (Table 20.4➌). We conclude that elbow flexor strength does differ across forearm positions. It will be appropriate at this point to perform a multiple comparison test on the three means to determine which forearm positions are significantly different from the others.^{∗∗∗}

^{§§}The one-way repeated measures ANOVA is used to generate MS values for calculation of models 2 and 3 of the 1CC reliability coefficient (see Chapter 26). For interpretation of the ICC, it is useful to determine that the between-subjects effect is significant. These computations are easily done by hand if they are not generated in the computer analysis.

^{∗∗∗}See Table 21.7 in the next chapter for the multiple comparison for the repeated measures analysis of variance shown in Table 20.4.

We have previously discussed the fact that the analysis of variance is based on an assumption about the homogeneity of variances among treatment groups. This assumption is also made with repeated measures designs; however, with repeated measures we cannot examine variances of different groups because only one group is involved. Instead, the variances of interest reflect difference scores across treatment conditions within a subject. For example, with three repeated treatment conditions, A_{1}, A_{2}, A_{3}, we will have three difference scores: A_{1} − A_{2}, A_{1} − A_{3}, and A_{2} − A_{3}. When used in this way with repeated measures, the homogeneity of variance assumption is called the **assumption of sphericity**, which states that the variances within each of these sets of difference scores will be relatively equal and correlated with each other.

We have also established that reasonable departures from the variance assumption would not seriously affect the validity of the analysis of variance, except in situations where sample sizes were grossly unequal. One might think, then, that violations of the variance assumption would be unimportant for repeated measures, where treatment conditions must have equal sample sizes. This is not the case, however. Because the repeated measures test examines correlated scores across treatment conditions, it is especially sensitive to variance differences, biasing the test in the direction of Type I error. In other words, the repeated measures test is considered too liberal when variances are not correlated, increasing the chances of finding significant differences above the selected *α* level.

To address this concern, most computer programs will run a repeated measures ANOVA in two different ways, using multivariate and univariate statistics. Multivariate tests are preferable in that they do not require the assumption of sphericity. Several multivariate tests are usually run simultaneously, with unfamiliar names such as *Piallai's Trace, Wilks' Lambda, Hotelling's Trace* and *Roy's Largest Root*. Because these tests are all based on different procedures, they are usually converted to a common reference, an *F*-ratio. These tests examine all possible sets of difference scores, and determine if there is a significant difference among them. If they are significant, multiple comparison tests should follow. Because researchers are generally less familiar with these multivariate tests, they do not tend to be reported, but they appear prominently in computer output.

The second approach, used more often in clinical research, involves the standard repeated measures *F*-test, but with an adjustment to the value of *p* to account for possible violations of sphericity. A test called **Mauchly's Test of Sphericity** (Table 20.4B) is performed first to determine if the adjustment is needed.^{†††} If the sphericity test is significant, correction is achieved by decreasing the degrees of freedom used to determine the critical value of *F*, thereby making the critical value larger. If the critical value is larger, then the calculated value of *F* must be larger to achieve significance. This compensates for bias toward Type I error by making it harder to demonstrate significant differences. Note that there is no difference in how the ANOVA is run, and the generated *F*-ratio with its associated degrees of freedom for the ANOVA remains unchanged. Only the probability associated with that *F* will change. This adjustment is only relevant, however, when the *F*-ratio is significant.

The degrees of freedom for the *F*-ratio are adjusted by multiplying them by a correction factor given the symbol **epsilon** (Table 20.4➋). Two different versions of epsilon are used: the **Greenhouse-Geisser correction**^{4} and the **Huynh-Feldt correction**.^{5} The Greenhouse-Geisser correction is usually considered first. If it results in a significant *F*, agreeing with the original analysis, then the probability associated with the Greenhouse-Geisser correction is used. When it does not result in a significant outcome, disagreeing with the original analysis, then the Huynh-Feldt correction is applied.

These correction factors are shown in Table 20.4➋ for the one-way repeated measures analysis for the comparison of elbow flexor strength across three forearm positions. Because the test for sphericity is not significant (*p* = .239), we are not concerned about this adjustment. If the test for sphericity had been significant, however, the probabilities generated in the computer analysis for the ANOVA table would be the corrected ones.

^{†††}The power of Mauchly's test will vary with sample size.^{3} With small samples it loses power. With large samples it may be significant even though the impact of violating the sphericity assumption is minor.

The concepts of repeated measures analysis can also be applied to multifactor experiments. Such designs can include all repeated factors or a combination of repeated and independent factors. When all factors are repeated, the design is referred to as a repeated measures or within-subjects design. When a single experiment involves at least one independent factor and one repeated factor, the design is called a **mixed design**. We present the general concepts behind these types of analyses and describe the format for presentation of results. We base our examples on a two-factor design, although these concepts can be easily expanded to accommodate more complicated designs.

With two repeated factors, the design is an extension of the single-factor repeated measures design. Suppose we redesigned our previous example to study isometric elbow flexor strength with the forearm in three positions and with the elbow at two different angles. We would then be able to see if the position of the elbow had any influence on strength when combined with different forearm positions. In this 3 × 2 repeated measures design, if *n* = 8, each subject would be tested six times, for a total of 48 measurements.

With two repeated factors, variance is partitioned to include a main effect for subjects and for each treatment variable, as well as for subject by treatment interactions (forearm × subjects, elbow × subjects, and forearm × elbow × subjects). These interactions represent the random or chance variations among subjects for each treatment effect. The mean squares for these interaction terms are used to calculate an error term for each repeated main effect, as shown in Table 20.5. The assignment of degrees of freedom for each of these variance components follows the rules used for the regular two-way analysis of variance: for each main effect *df* = *k* − 1; for each interaction effect *df* = (*A* − 1)(*B* − 1).

Each treatment effect in this study (forearm, elbow, and forearm × elbow) is tested by the ratio *F* = *MS/MS _{e}*, where the error term is the interaction of that particular treatment effect with subjects. As shown in Table 20.5, each repeated factor is essentially being tested as it would be in a single-factor experiment, with its own error term. By separating out an error component for each treatment effect, we have created a more powerful test than we would have with one common error term; that is, the error component is smaller for each separate treatment effect than it would be with a combined error term. Therefore,

*F*-ratios tend to be larger. In this example, only the main effect of forearm position is significant (

*p*= .013).

Once again, researchers will generally ignore ratios for the effect of subjects (Table 20.5➋). The effect of subjects is only important insofar as it is used to determine the error terms for the treatment effects. This effect will often be omitted from the summary table.

In a two-factor analysis, where only one factor is repeated, the overall format for the analysis of variance is a combination of between-subjects (independent factors) and within-subjects (repeated factors) analyses. In a mixed design, the independent factor is analyzed as it would be in a regular one-way analysis of variance, pooling all data for the repeated factor. The repeated factor is analyzed using techniques for a repeated measures analysis (see Table 20.6).

For example, suppose we wanted to look at the effect of ice applied to the biceps brachii on elbow flexor strength in three forearm positions. Ice is an independent factor, and forearm position is a repeated factor. Assume we have three levels of ice (ice pack, placebo and control), and three levels of forearm position, as before. We randomly assign eight subjects (*n* = 8) to each ice group, for a total of 24 subjects (*N* = 24), each tested in three forearm positions.

The first part of the analysis for this study is the *within-subjects* analysis, or the analysis of all factors that include the repeated factor (Table 20.6➊). This section lists the main effect for forearm position, the interaction between forearm position and ice, and a common error term to test these two effects. In this example, the main effect of forearm position is significant (*p* < .001), as is the interaction effect (*p* = .004).

The second part of the analysis addresses the independent factor, ice. Each level of this factor is assigned to eight different subjects. Comparison across these groups is a *between-subjects* analysis, shown in Table 20.6➋. This is actually a one-way analysis of variance for the effect of ice, with two sources of variance: the between-groups effect (ice) and the within-groups variance, or error term. In this example, there is also significant difference among the three levels of ice (*p* = .051).

The analysis of variance provides researchers with a statistical tool that can adapt to a wide variety of design situations. We have covered only the most common applications in this chapter. Many other designs, such as nested designs, randomized blocks and studies with unequal samples, require mathematical adjustments in the analysis that are too complex for us to cover here. Fortunately, computer packages are readily available for performing analyses of variance, and are generally flexible enough to accommodate all the design variations that researchers might require in clinical research. The **general linear model (GLM)** is usually used to accommodate the variety of design options for the ANOVA.

The *t*-test and analysis of variance are based on several assumptions about the nature of data. We have reviewed these assumptions in several places in this and previous chapters, in general, these tests are robust to violations of these assumptions (with the exception of repeated measures designs), so that they can be used with confidence in most research situations; however, when clinical experiments are performed with very small samples, the data may violate these assumptions sufficiently to warrant transforming the data to a different scale of measurement that better reflects the appropriate characteristics for statistical analysis (see Appendix D), or it may be appropriate to use nonparametric statistics that do not make the same demands on the data. In Chapter 22 we describe several nonparametric tests that can be used in place of the *t*-test and the single-factor analysis of variance.

When the analysis of variance results in a significant finding, researchers are usually interested in pursuing the analysis to determine which specific levels of the independent variables are different from each other. Multiple comparison tests, designed specifically for this purpose, are described in the next chapter. At that time we look at some of the data presented here, and show how those data can be analyzed further using multiple comparison techniques.

As we continue to discuss statistical tests in subsequent chapters, many readers will find it helpful to refer to the chart provided in Appendix B, which presents an overview of statistical tests and criteria for choosing a particular test for analyzing different types of data and designs.

*Statistical Analysis in Psychology and Education*(5th ed.). New York: McGraw-Hill, 1981.

*Design and Analysis: A Researcher's Handbook*(4th ed.). Englewood Cliffs, NJ: Prentice Hall, 2004.

*Using SPSS for Windows and Macintosh: Analyzing and Understanding Data*(4th ed.). Upper Saddle River, NJ: Prentice Hall, 2004.

*F*distribution in multivariate analysis.

*Ann Math Statist*1958;29:885–891.

CrossRef

*J Educ Statist*1976;1:69–82.

CrossRef