When an analysis of variance results in a significant *F*-ratio, the researcher is justified in rejecting the null hypothesis and concluding that not all *k* population means are equal; however, this outcome tells us nothing about which means are significantly different from which other means. In this chapter we describe the most commonly used statistical procedures for deciding which means are significantly different. These procedures are called ** multiple comparison tests**.

Several multiple comparison procedures are available, most given names for the individuals who developed them. Each test involves the rank ordering of means and successive contrasts of pairs of means. The pairwise differences between means are tested against a critical value to determine if the difference is large enough to be significant. The major difference between the various tests lies in the degree of protection offered against Type I and Type II error. A conservative test will protect against Type I error, requiring that means be far apart to establish significance. A more liberal test will find a significant difference with means that are closer together, thereby offering greater protection against Type II error.

Most multiple comparison procedures are classified as **post hoc** because specific comparisons of interest are decided *after* the analysis of variance is completed. These are considered **unplanned comparisons**, in that they are based on exploration of the outcome. Therefore, these tests are most useful when a general alternative hypothesis has been proposed. We will describe the three most commonly reported *post hoc* multiple comparison procedures: Tukey's honestly significant difference method, the Newman-Keuls test, and the Scheffe comparison. Other *post hoc* tests used less often include *Duncan's Multiple Range Test*^{1} and *Fisher's Least Significant Difference*.^{2} These tests are generally considered too liberal, resulting in too great a risk of Type I error (see Figure 21.1).

Other multiple comparison tests are classified as **a priori**, or **planned comparisons**, because specific contrasts are planned *prior* to data collection based on the research rationale. Technically, these comparisons are appropriate even when an *F*-test is not significant, as they are planned before data are collected, and therefore, the overall null hypothesis is not of interest. Although several planned comparison tests are available, we will describe one commonly used method called the Bonferroni *t*-test.^{∗}

As some statistical computer packages do not include multiple comparison tests for the analysis of variance, it is useful to be able to perform these tests by hand. Fortunately, most multiple comparison procedures are simple enough to be carried out efficiently with a hand calculator once the analysis of variance data are obtained.

^{∗} Other types of planned comparisons include *orthogonal contrasts,* which allow for comparison of specific combinations of means, and *Dunnett's test*^{3}, which focuses on comparison of a control group with each of several experimental groups.

At the end of Chapter 19 we discussed the inappropriate use of multiple *t*-tests when more than two comparisons are made within a single set of data. This issue is based on the desired protection against Type I error in an experiment, which is specified by *α*. At *α* = .05, we limit ourselves to a 5% chance that we will experience the random event of finding a significant difference when none exists. We take a 5% risk that we will be in error if we say that group means are different for any single comparison. We must differentiate this **per comparison error rate ( α_{PC})** from the situation where

*α*is set at .05 for each of several comparisons in one experiment. Although it is true that

*α*= .05 for each individual comparison, the potential cumulative error for the set of comparisons is actually greater than .05. This cumulative probability has been called the

**familywise error rate (**and represents the probability of making at least one Type I error in a set or "family" of statistical comparisons.

*α*_{FW})^{†}

The Type I error rate for a family of comparisons, where each individual comparison is tested at *α* = .05, is equal to

where *c* represents the total number of comparisons. The maximum number of pairwise contrasts for any set of data will be *k*(*k* – l)/2. If we want to compare three means, testing each comparison at *α* = .05, we will perform *c* = 3(3 − l)/2 = 3 comparisons. Therefore,

This means that if we perform three *t*-tests and find three significant differences, we risk a greater than 14% chance that at least one of these significant differences occurred by chance. This exceeds the generally accepted standard of 5% risk for Type I error.

As the number of comparisons increases, so does the probability that at least one significant difference will occur by chance. For example, with *α*_{PC} set at .05, tests involving four, five and six means will result in the following familywise probabilities of Type I error:

Clearly, the likelihood of finding significant differences among a set of means, even when *H*_{0} is true for all comparisons, will be extremely high as the number of comparisons increases.

Several of the multiple comparison procedures we describe base their critical values on per comparison error rates; others base their Type I error rate on the entire family of comparisons. There is no consensus about preferences for one approach over the other. Use of a per comparison error rate will result in greater statistical power, but with the potential for more Type I errors. Conversely, use of the familywise error rate will produce fewer Type I errors, but will result in fewer significant differences. Researchers must determine if Type I or Type II error is of greater concern in a particular study and apply statistical tests accordingly. In most cases, one hopes to strike a balance between the two types of statistical error.

*Post hoc* comparisons are usually tested using two-tailed probabilities. Where specific contrasts are not specified in advance, it follows that directions of difference cannot be predicted. One-tailed tests can be performed for planned comparisons; however, unless evidence in favor of directional hypotheses is quite strong, it is generally statistically safer to perform two-tailed tests. Contrasts involving two-tailed tests will always be based on the absolute difference between means. One-tailed tests must result in a statistical ratio that supports the directional hypothesis.

^{†}Some statistical references use the term *experimentwise error rate* to indicate the error for all effects within an experiment, whereas *familywise error rate* is used to indicate specific sets of effects, such as main effects and interaction effects in an analysis of variance.^{4}

To illustrate the concept of multiple comparison tests, we will use a hypothetical study introduced in Chapter 20, comparing the effects of ultrasound (US), ice and friction massage for relieving pain in 44 patients with elbow tendonitis. The four group means, shown in Table 21.1A, represent the change in pain-free range of motion for three treatment groups and a control group. Eleven subjects were tested in each group. The plot of means shows how these group means are distributed.

The null hypothesis for this study states that no differences exist among the four group means:

*H*

_{0}:

*μ*

_{1}=

*μ*

_{2}=

*μ*

_{3}=

*μ*

_{4}

The analysis of variance for these data, shown in Table 21.IB, is significant (*p* < .001), and it is now of interest to examine individual differences among means.

The process of testing differences among several means is fairly consistent for all multiple comparison procedures. In each test, means are first arranged in *ascending order of size*, and differences between pairs of means are obtained, as shown in Table 21.1C. This table shows the absolute differences between all pairs of means, using a triangular format. With *k* = 4, there will be a total of 4(4 − 1)/2 = 6 comparisons. The entries in the body of the table are the pairwise mean differences. Values are not entered below the diagonal to avoid redundancies. Each pairwise comparison, or contrast, is tested against a **minimum significant difference (MSD)**. If the absolute difference between a pair of means is *equal to or greater than* the minimum significant difference, then the contrast is considered significant.

If the pairwise difference is smaller than the minimum significant difference, the means are not significantly different from each other.

Calculation of the minimum significant difference is based on the error mean square, *MS _{e}*, taken from the analysis of variance, and a critical value taken from a statistical table. The

*MS*reflects the degree of variance within groups (between subjects). Logically, the greater the variance within groups, the less likely we will see a significant difference between means. Critical values for the MSD are used differently, depending on the number of means being compared and the type of error rate used (per comparison or familywise). The relevant critical values are located according to the degrees of freedom associated with the error term,

_{e}*df*, in the analysis of variance.

_{e}^{‡}For the example we are using, the error mean square is 88.54, with 40 degrees of freedom (see Table 21.1B).

^{‡}It is not uncommon to find that the exact value for the error degrees of freedom is not listed in these tables. In that case, it is usually sufficient to refer to the closest value for degrees of freedom for an approximate critical value. To be conservative, the next lowest value for degrees of freedom should be used.

Tukey developed one of the simplest multiple comparison procedures, which he called the **honestly significant difference (HSD)** method.^{5} Tukey's procedure sets a family-wise error rate, so that *α* identifies the probability that one or more of the pairwise comparisons will be falsely declared significant. Therefore, this test offers generous protection against Type I error.

Tukey's HSD test is calculated using the **studentized range statistic**, given the symbol *q*. Critical values of *q* are found in Appendix Table A.6. The *q* statistic is influenced by the overall number of means that are being compared. At the top of Table A.6, the number or "range" of means being compared is given the symbol *r*.^{§} Logically, as the number of sample means increases, the size of the difference between the largest and smallest means will also increase, even when *H*_{0} is true. The *q* statistic provides a mechanism for adjusting critical values to account for the effect of larger numbers of means.

The minimum significant difference for Tukey's HSD procedure is given by

where *MS _{e}* is the mean square error,

*n*is the number of subjects

*in each group*(assuming equal sample sizes

^{∗∗}), and

*q*is taken from Appendix Table A.6, for the desired level of

*α*,

*df*, and the number of means,

_{e}*r*. For the example we are using,

*q*= 3.79 for

*α*= .05,

*r*= 4, and

*df*= 40. Therefore,

_{e}This minimum significant difference is compared with each pairwise mean difference in Table 21.2B. Absolute differences that are equal to or greater than this value are significant. For example, the difference between the largest and smallest means is equal to 21.09. This value exceeds the minimum significant difference and is, therefore, significant. To present these results in a clear format, an asterisk denotes those differences that are significant in Table 21.2B. According to these results, the three experimental groups are different from the control, but the treatment groups are not different from each other.

A computer analysis of these data is presented in terms of *homogeneous subsets of means* (Table 21.2C). In this output, each subset (listed in the same column) represents means that are not significantly different. Means that are listed in separate columns are significantly different from one another. These results show that the mean for the control group is significantly different from the three treatment means.

^{§}In this case, *r* stands for *range*. This symbol should not be confused with the use of the *r* for the correlation coefficient.

^{∗∗}When samples are not of equal size, the *harmonic mean* of the sample size is used in calculations of the minimum significant difference. The harmonic mean, *n'*, is equal to where *k* is the number of groups, and *n* is the sample size for each group. For example, if there are two groups, with *n*_{1} = 10 and *n*_{2} = 5, . This procedure can be used with all multiple comparison tests.

The **Newman-Keuls (NK) test** (sometimes called Student-Newman-Keuls test) is similar to the Tukey method, except that it uses a per comparison error rate.^{5} Therefore, *α* specifies the Type I error rate for each pairwise contrast, rather than for the entire set of comparisons. Overall, then, as the number of comparisons increases, the chances of committing a Type I error are greater using this procedure than using Tukey's test.

The Newman-Keuls method is also based on the studentized range *q*; however, values of *q* are used differently for each contrast, depending on the number of *adjacent* means, *r*, within an ordered **comparison interval**. To illustrate how this is applied, consider the four sample means for the tendonitis study, ranked in ascending size order: (4) Control, (3) Massage, (1) US, (2) Ice (see Figure 21.2). If we compare the two smaller means, the comparison interval for Control → Massage includes two adjacent means. Therefore, *r* = 2 for that comparison. If we compare the largest and smallest means, the interval for Control → Ice contains four adjacent means (4-3-1-2), and so *r* = 4. Similarly, if we compare means for Massage and Ice, the comparison interval contains three adjacent means (3-1-2), so *r* = 3.

###### FIGURE 21.2

Comparison intervals for a set of four group means, arranged in size order. Based on data from Table 21.1.

Therefore, a comparison interval represents the steps between ordered means for a given comparison. As shown in Figure 21.2, with four means we will have intervals of two, three, and four means. In contrast to Tukey's approach which uses one critical difference for all comparisons, the Newman-Keuls test will use a larger critical difference as *r* increases. This adjusts for the fact that larger differences are expected with a greater range of means, even when *H*_{0} is true.

The minimum significant difference for the Newman-Keuls comparison is

where values of *q*(*r*) are obtained from Table A.6 for each comparison interval. For the example we are using, we find *q* for *α* = .05 and *df _{e}* = 40 for comparison intervals of

*r*= 2,3, and 4:

With *MS _{e}* = 88.54 and

*n*= 11, we find the corresponding minimum significant differences:

These minimum significant differences are compared with the appropriate mean differences in Table 21.3B. Significant differences are noted with an asterisk. For example, the difference between means for Control and Ice is 21.09, which exceeds the critical difference 10.75 for *r* = 4. Therefore, these two means are significantly different. The difference between means for Massage and Ice is 10.00, which exceeds the critical difference 9.76 for *r* = 3. These two means are also significantly different from each other. The difference between US and Ice is 1.09, which does not exceed the critical difference 8.11 for *r* = 2. These means are not significantly different. Of the six comparisons, five are significant. This test demonstrates that the three experimental groups are different from the control, and US and Ice are different from massage.

This result is also shown in Table 21.3C for subsets of means. Because the mean for the Control group is listed in a column by itself, it is different from all other means. The same is true for the mean for the Massage group. The means for US and Ice are listed in the same column, indicating that they are not different from each other.

The reader may note that the minimum difference for the Newman-Keuls test with *r* = 4 is the same as the minimum difference used for Tukey's test (in this case 10.75). The Tukey procedure uses this one minimum difference for all comparisons, whereas the Newman-Keuls test adjusts the minimum differences for smaller comparison intervals. Therefore, the minimum differences will be lower for some contrasts using the Newman-Keuls method. Consequently, the Newman-Keuls test can result in more significant differences (as it did here), and is the more powerful of the two comparisons; however, because the Newman-Keuls procedure does not control for the familywise error rate, it will produce a greater number of Type I errors than the Tukey method over the long run.

The **Scheffé comparison** is the most flexible and most rigorous of the post hoc multiple comparison tests.^{6} It is based on the familiar *F*-distribution. It is a conservative test because it adopts a familywise error rate that applies to all contrasts. This provides strong protection against Type I error, but it also makes the procedure much less powerful than the other tests we have described. Scheffé has recommended that a less stringent level of significance be used, such as *α* = .10, to avoid excess Type II error.^{7}

The minimum significant difference for the Scheffé comparison is given by

where *k* is the total number of means involved in the set of comparisons, and *F* is the critical value for d*f*_{b} and *df*_{e} obtained from Appendix Table A.3 (not the calculated value of *F* from the ANOVA). For the example we are using, *k* = 4 and *F* = 2.84 for 3 and 40 degrees of freedom at *α* = .05. Therefore,

All differences between means must meet or exceed this value to be significant. Therefore, as denoted by asterisks in Table 21.4B, this analysis results in two significant comparisons (fewer than with the Newman-Keuls or Tukey method), demonstrating the lower power associated with the Scheffé comparison. According to this test, the Control and Massage groups are not significantly different from each other, where they were considered significantly different with the other tests.^{††}

^{††}Note that the Control group is significantly different from US and Ice, but not from Massage. However, Massage is not different from US and Ice. This overlap may seem illogical. It occurs because of variance components from the different variables that are not independent of each other. This result suggests that the Scheffé comparison is not the most useful approach to understand the relationships in these data.

Researchers often designate specific contrasts of interest prior to data collection. These contrasts usually relate to theoretical expectations of the data. When comparisons are planned in advance and when they are relatively limited in number,^{‡‡} *a priori* tests can be used. The rationale for valid application of planned comparisons must be established before data are collected, so that the choice of specific hypotheses cannot be influenced by the data. Because the researcher is not necessarily interested in all possible contrasts, it is actually unnecessary to test the overall null hypothesis with the analysis of variance. Regardless of whether the ANOVA demonstrates a significant *F*-ratio, planned comparisons can be made.

The **Bonferroni comparison** (also called *Dunn's multiple comparison procedure)* is a planned comparison, using a familywise error rate that is the sum of the per comparison significance levels. Therefore, α_{FW} is dependent on the number of planned comparisons, *c:*

For example, with four planned comparisons, each tested at *α* = .01, the probability of one or more Type I errors for the entire family of contrasts is not greater than *α* = .04. Essentially, the procedure splits *α* evenly among the set of planned contrasts, so that each contrast is tested at *α*_{FW}/*c*. Therefore, if a researcher wants an overall probability of .05 for a set of four contrasts, each individual comparison will have to achieve significance at .05/4, or *p* = .013. This process of adjusting *α*, called **Bonferroni's adjustment (or correction)**, is used as a protection against Type I error.

The Bonferroni test is based on Student's *t*-distribution, with adjustments made for the number of contrasts being performed within a set of data. To facilitate these adjustments, a special table of critical values has been developed for Bonferroni's *t* (given the symbol *t*(*B*)).^{9}

The minimum significant difference for the Bonferroni test can be computed using

where *t*(*B*) is taken from Appendix Table A.7 for *α*_{FW}, *df _{e}*, and

*c*, where

*c*is the total number of comparisons in the experiment. Continuing with the example we have been using, for six comparisons performed at

*α*

_{FW}= .05, with

*df*= 40, we find

_{e}*t*(

*B*) = 2.77. Therefore,

All pairwise differences are compared with this one minimum significant difference, as shown in Table 21.5. In this case, three of the six comparisons are significant. According to these results, the three intervention groups are different from the control.

^{‡‡}Glass defines a small number of comparisons as less than *k*(*k* − 1)/4.^{8}

Multiple comparison procedures are applicable to all analysis of variance designs. So far, we have described their use following an analysis with only one independent variable. When multifactor experiments are analyzed, the multiple comparison procedures can be used to compare means for main effects and interaction effects.

To illustrate this application, let us refer back to a study presented in Chapter 20, involving the comparison of stretch and knee position for increasing ankle range of motion. Stretch (Factor A) had three levels: prolonged, quick and control. Knee position (Factor B) had two levels: flexion and extension. This design is shown in Figure 21.3. Ten subjects were tested in each of the six treatment combinations. Recall that the marginal means, and , represent main effects for each independent variable separately. The six cells of the design (*A*_{1}*B*_{1} through *A*_{3}*B*_{2}) represent all combinations of the two independent variables, or the interaction means.

The outcome of the analysis of variance for this study is shown in Table 21.6A. The main effect of stretch is significant, as is the interaction effect. In practice, we would usually ignore the main effects because of the significant interaction, and proceed to analyze the six individual cell means. For purposes of illustration, however, we will look at the main effect of stretch using a multiple comparison procedure. If the variable of knee position had been significant, we would not have to perform a multiple comparison because it has only two levels. Therefore, a significant effect could be interpreted by simply looking at the marginal means, as with a *t*-test.

The analysis of a significant main effect requires examination of differences among marginal means. For the main effect of stretch, we compare , , and . The application of multiple comparison tests to marginal means is the same as in previous examples, except that *n* must reflect the total number of subjects contributing to each mean in a contrast. Therefore, if *n* = 10 for each cell in the design, then *n* = 20 for each marginal mean for stretch (see Fig 21.3). The values of *MS _{e}* and

*df*used for calculations are taken from the analysis of variance summary table. In this case,

_{e}*MS*= 12.91 and

_{e}*df*= 54 (we will use

_{e}*df*= 60 for locating tabled values).

_{e}To apply Tukey's HSD to these data, we calculate the minimum significant difference:

with *q* taken from Table A.6 for *α* = .05, *df _{e}* = 60, and

*r*= 3. Note that

*n*= 20 represents the pooled sample size for each marginal mean. The pairwise differences between the marginal means are shown in Table 21.6C. Differences that exceed 2.73 are significant. Results in Table 21.6C and D show that prolonged stretch () is significantly different from quick stretch () and the control (), but that the latter two means are not different from each other.

When an interaction effect is significant, multiple comparison tests are usually performed on pairwise contrasts of individual cell means. Formulas are used exactly as they were for the one-way design. In this example, we would be comparing six means.

For the means in Figure 21.3, if we choose to analyze all pairwise differences with *k* = 6, we will obtain 6(6 − l)/2 = 15 comparisons. To use Tukey's HSD as an example, we calculate the minimum significant difference:

with *q* obtained from Table A.6 for *α* = .05, *df _{e}* = 60, and

*r*= 6. Note that

*n*= 10 reflects the sample size for each of the six individual cell means.

The mean differences, shown in Table 21.7, must exceed this minimum significant difference to be considered significant. Results demonstrate that range of motion achieved with prolonged stretch with knee extension (*A*_{1}*B*_{2}) is significantly greater than with all other treatment combinations. In addition, prolonged stretch with knee flexion (A_{1}B_{1}) is greater than quick stretch and control with knee extension. These effects are illustrated in Figure 20.6 in Chapter 20.

Interpretation of pairwise differences for interactions will often be more meaningful by limiting contrasts to row or column effects, eliminating comparisons that move diagonally within the design. In other words, we would not be interested in the contrast of prolonged stretch in flexion with the other forms of stretch in extension, which are diagonal comparisons (see Figure 21.2). This type of comparison is actually confounded, because it involves different levels of both variables. We are more interested in the contrasts across A_{1}, across A_{2}, and across A_{3}, and three contrasts within B_{1} and within B_{2}. This would result in a total of 9 contrasts, rather than 15. When using tests such as Bonferroni's *t*, where the number of comparisons is the basis for adjusting critical values, this process can significantly improve statistical power as well as clarify explanations.

The standard *post hoc* multiple comparisons procedures just described are not generally run for repeated measures analyses. Because repeated measures involve within-subject comparisons, the multiple comparison procedures do not fit logically, as they are based on overall group differences. Therefore, the paired *t*-test has been used as a reasonable approach for looking at differences between pairs of means within a repeated measures design.^{10,11} Each pairwise comparison is entered as a difference score, and the analysis will determine which means are significantly different. For example, let us reconsider the hypothetical study described in Chapter 20 that looked at elbow flexor strength in three forearm positions for nine subjects. The mean for pronation was 17.33, for neutral 27.56, and for supination 29.11 pounds. The results of the repeated measures analysis of variance are shown in Table 21.8A, indicating that a significant difference existed among the three forearm positions. Therefore, a *post hoc* multiple comparison test is warranted to compare the three means.

Table 21.8 presents the results of the paired *t*-test for three pairwise comparisons in this example. The differences for neutral-pronation and pronation-supination are significant (*p* = .000), but the difference for neutral-supination is not (*p* = .127). This analysis presents a problem, however, in terms of familywise error rate. Because several analyses are being run on the same sample, we risk inflating the value of *α* if each test is performed at the same .05 criterion. Therefore, this approach requires the use of the **Bonferroni adjustment**, whereby the overall value for *α* is divided by the number of comparisons. For instance, with three comparisons *α*_{FW} = .05/3 = .017. This means that the *p* value for each individual comparison must be .017 or less to be considered significant. In our current example, with the two significant effects at *p* = .000, we have clearly achieved this criterion. However, if we had found a difference for any one comparison at .02, for example, where it would typically be considered significant at *α* = .05, we would not consider it different for this multiple comparison.

An investigator is interested in comparing three treatments: experimental treatment A, experimental treatment B, and control treatment C, using three independent groups. He performs a one-way analysis of variance on the data, and finds a significant *F* test at *p* = .05. He therefore concludes that there is a significant difference among the three means and proceeds to perform a multiple comparison test to determine where those differences lie. But when he gets the results of the multiple comparison test, he finds that none of the means are significantly different! How can this be?

Now, in another laboratory, three different researchers are conducting three different experiments, each comparing two treatments using an unpaired *t*-test. One is comparing A with B, one is comparing B with C, and the third is comparing A with C. The first two of these researchers find no significant difference between their groups. The third researcher, however, does find a significant difference at *p* = .05. He is now able to report that his experimental treatment A worked.

Why should the investigator who analyzed all three treatments at once be unable to find a significant difference when the investigator who ran a single experiment can claim a successful outcome?

In the first case the investigator proposed a question that required the comparison of two treatments with respect to a control. His hypothesis is, "There will be a difference among these three groups." The comparison of the three groups is an important part of the rationale for this study, to account for the potential theoretical connections of these treatments. The ANOVA looks at the entire sample as part of this analysis, partitioning the variance across all three groups. In this case, the overall variance showed a significant effect, but this effect could not be attributed to any one specific comparison. Sometimes the variances of the individual groups are not sufficiently independent to show a significant difference, even when the overall *F* test does.

The investigator who studied the single comparison, on the other hand, is only concerned with the variance of two groups, and can narrow his statistical search for a difference. His hypothesis is, "There will be a difference between these two groups." And so there was!

Adapted from Dallal GE. Multiple comparison procedures. Available at: http://www.tufts.edu/∼gdallal/mc.htm Accessed October 29, 2007.

Multiple comparison tests are most often used in studies where the independent variable is qualitative or nominal, and where the researcher's interest focuses on determining which categories are significantly different from the others. When an independent variable is quantitative, the treatment levels no longer represent categories, but differing amounts of something, such as age, duration or intensity of a modality, dosage of a drug, or time intervals for repeated testing. When the levels of an independent variable are ordered along a continuum, the researcher is often interested in examining the shape of the response rather than just differences between levels. This approach is called a **trend analysis**.

The purpose of a trend analysis is to find the most reasonable description of continuous data based on the number of turns, or "ups and downs" seen across the levels of the independent variable. For example, if we wanted to study the changes that occur in strength as one ages, we might study 10 blocks of subjects, each representing a different age category from 8 to 80 years old. A hypothetical plot of such data is shown in Figure 21.4. A multiple comparison of means will not tell us about the directions of change across age, but a trend analysis will.

Basically, trends are classified as either linear or nonlinear. In a **linear trend**, all data rise or fall at a constant rate as the value of the independent variable increases. This trend is characterized by a straight line, as shown in Figure 21.5A. For example, we might use this function to represent the relationship between height and age in children. As a child grows older, height tends to increase proportionally.

A *nonlinear trend* demonstrates "bends" or changes in direction. A **quadratic trend,** shown in Figure 21.5B, demonstrates a single turn upward or downward, creating a concave shape to the data. This means that following an initial increase or decrease in the dependent variable, scores vary in direction or rate of change. Learning curves can be characterized as quadratic. Performance generally increases at a sharp rate through early trials and then plateaus.

Higher order nonlinear trends are more complex and are often difficult to interpret. As shown in Figure 21.5C and D, a *cubic trend* involves a second change of direction, and a *quartic trend* a third turn. As the number of levels of the independent variable increases, the number of potential trend components will also increase. There can be a maximum of *k* – 1 turns, or trend components, within any data set.

The curves in Figure 21.5 are examples of pure trends. Real data seldom conform to these patterns exactly. Even with data that represent true trends, chance factors will produce dips and variations that may distort the observed relationship. The purpose of a trend analysis is to describe the overall tendency in the data using the least number of trend components possible. Some data can be characterized by a single trend; others demonstrate more than one pattern within a single data set. The hypothetical data for strength and age illustrate this possibility (see Figure 21.4). The portion of the data from 8 to 20 years shows that individuals tend to get stronger as they grow within this age range. We can see the quadratic component within this curve after age 20. Strength appears to plateau at age 30, after which a gradual dropoff is evident.

Trends are tested for significance as part of an analysis of variance. The mathematical basis for analyzing trends is beyond the scope of the present discussion. Most statistical computer packages are able to run a trend analysis.^{§§}

The results of trend analyses are listed as part of an ANOVA summary table. An example of this type of output for an independent samples test is given in Table 21.9, based on the hypothetical age and strength data in Figure 21.4. The top portion of the table shows how the standard analysis of variance is presented. In the bottom portion, the trend analysis is added. Note that the between-groups sum of squares for the effect of age has been partitioned into a linear trend and a quadratic trend. Because there are 10 measurement intervals, we have the potential for 9 trend components; however, testing beyond the quadratic component usually yields uninterpretable results. Therefore, variance attributable to all higher order trends is included in the error term (called deviation here).

Each specific trend component is tested by an *F*-ratio, calculated using the mean square for that trend and the error term. In this example, only the quadratic trend is significant. When a trend component is statistically significant, subjective examination of graphic patterns of the data is usually sufficient for further interpretation.

^{§§}Some computer packages will refer to trend analyses as *orthogonal decomposition* or *orthogonal polynomial contrasts*.

Two important limitations should be considered when interpreting trend analyses. First, the number and spacing of intervals between levels of the independent variable can make a difference to the visual interpretation of the curve. Obviously, with only two levels of an independent variable no trend can be established. A linear trend requires a minimum of three points, a quadratic trend a minimum of four points, and so on. With larger spans in the quantitative variable, more intervals may be necessary.

Most investigators try to use equally spaced intervals to achieve consistency in the interpretation. Others will purposefully create unequal intervals to best represent the samples of interest. For instance, trends that are established over time may involve some intervals of hours and others of days. Most computer packages that perform trend analyses will accommodate equal or unequal intervals, but distances between unequal intervals must be specified.

The second caution for interpreting trend analysis is to avoid extrapolating beyond the upper and lower limits of the selected intervals. For example, based on Figure 21.4, if we had tested only individuals between 20 and 80, we might conclude that strength declines linearly with age. Conversely, if we looked only at ages 8 through 20, we might conclude that strength increases linearly with age. By limiting the range of intervals we would have missed the quadratic function that more accurately describes the relationship between strength and age across the lifespan. Therefore, the nature of the relationship between the independent and dependent variables should be examined within and across the ranges that will allow the most complete interpretation.

There are no widely accepted criteria for choosing one multiple comparison test over another, and the selection of a particular procedure is often made either arbitrarily or on the basis of available software; however, two basic issues should guide the choice of a multiple comparison procedure.

The first issue relates to the decision to conduct either planned or unplanned contrasts. This decision rests with the researcher during the planning stages of the study, in response to theoretical expectations. With planned comparisons, the researcher asks, "Is *this* difference significant?" With *post hoc* tests the question shifts to, *"Which* differences are significant?" When the researcher is interested in exploring all possible combinations of variables, unplanned contrasts should be used.

The second issue concerns the importance of Type I or Type II error. Each multiple comparison test will control for these errors differently, depending on the use of per comparison or familywise error rates. Of the three post hoc comparisons described here, the Newman-Keuls test is the most powerful. Scheffe's comparison gives the greatest control over Type I error, but at the expense of power. Researchers often prefer Tukey's HSD because it offers both reasonable power and protection against Type I error. The power of the Newman-Keuls procedure is increased by using different comparison intervals, but use of the per comparison error rate increases the risk of Type I error.

Researchers must examine the research question to determine which multiple comparison test is most appropriate in terms of the research design. These decisions should be based on the research question, not on which test is most likely to find significant differences. The decision to run planned or unplanned comparisons and simple or complex contrasts should be made before the data are analyzed. Other than these rather straightforward criteria, when there is no overriding concern for either Type I or Type II error, there may be no obvious choice for a specific test. The researcher is obliged to consider the rationale for comparing treatment conditions or groups and to justify the basis for making these comparisons.

*J Am Statist Assoc*1955;50:1096–1121.

CrossRef

*Design and Analysis: A Researcher's Handbook*(4th ed.). Englewood Cliffs, NJ: Prentice Hall, 2004.

*Biometrika*1953;40:87–104.

*Statistical Methods in Education and Psychology*(3rd ed.). Boston: Allyn and Bacon, 1996.

*J Educ Stat*1980;5:269–287.

CrossRef

*Using SPSS for Windows: Analyzing and Understanding Data.*Upper Saddle River, NJ: Prentice Hall, 1997.