In previous chapters we have presented several statistical tests that are based on certain assumptions about the parameters of the population from which the samples were drawn. These **parametric tests** require that the assumptions of normality and homogeneity of variance are met to a reasonable degree for validity of analysis. In this chapter, we present a set of statistical procedures classified as **nonparametric**, which test hypotheses for group comparisons without normality or variance assumptions. For this reason, these methods are sometimes referred to as *distribution-free tests.*

Nonparametric methods are similar to parametric methods in that both test hypotheses and both involve the use of a statistical ratio or test statistic, with an associated probability. Similarly, the outcomes of these tests are evaluated according to a predetermined alpha level of significance. In this chapter we describe five nonparametric procedures that are the most commonly used analogs of the parametric *t*-test and *F*-test: the Mann-Whitney U-test, sign test, Wilcoxon signed-ranks test, Kruskal-Wallis one-way analysis of variance by ranks, and the Friedman two-way analysis of variance by ranks (see Table 22.1). Although these tests are easily computed with a hand-held calculator, they are also included in most statistical packages for computer analysis. We will present both the hand calculations and sample computer output.

Comparison | Parametric Test | Nonparametric Test |
---|---|---|

Two independent groups | Unpaired t-test | Mann-Whitney U test |

Two related scores | Paired t-test | Sign test Wilcoxon signed-ranks test ( |

Three or more independent groups | One-way analysis of variance (F) | Kruskal-Wallis analysis of variance by ranks (H or χ^{2}) |

Three or more related scores | One-way repeated measures analysis of variance (F) | Friedman two-way analysis of variance by ranks (χ^{2}_{r}) |

Two major criteria are generally adopted for choosing a nonparametric test over a parametric procedure. The first is that assumptions of population normality and homogeneity of variance cannot be satisfied. Many clinical investigations involve variables that have not been studied sufficiently to support these assumptions. In all likelihood, most pathological conditions are represented by skewed distributions rather than symmetrical ones. In addition, small clinical samples and samples of convenience cannot automatically be considered representative of larger normal distributions.

The second criterion for choosing a nonparametric test is that data are measured on the nominal or ordinal scale. Many assessment tools have been developed around these scales. Nonparametric tests provide an objective mechanism for supporting statistical hypotheses when these levels of measurement are used.

Although nonparametric tests require fewer statistical assumptions than parametric procedures, they still put some restrictions on data. Some type of randomization procedure should be used in forming groups. This allows the researcher to make assumptions about the equality of groups before the independent variable is administered. In addition, the nonparametric tests described in this chapter apply to data that are at least at the ordinal level (see Chapter 25 for tests appropriate to nominal level data); that is, the variable of interest has an underlying continuous distribution that can be ranked, even if it cannot be measured quantitatively. For instance, strength can be measured using discrete manual muscle test grades on an ordinal scale, even though strength truly exists along a continuum. Ordinal scales are used often to measure relative changes in clinical variables such as sitting balance, function or sensation. Analysis of these types of variables represents the most appropriate use of nonparametric statistics.

The major disadvantage of nonparametric tests is that they do not accommodate complex clinical designs. There are many newer tests, however, that have been developed to allow the application of regression procedures or tests of interaction effects. These are beyond the scope of this book, but can be found in other recent texts.^{1,2}

Many researchers prefer to use parametric tests because they are generally more powerful. Nonparametric tests are less sensitive than parametric tests because most of them involve ranking scores rather than comparing precise metric changes. Nonparametric and parametric methods have been compared on the basis of their *power-efficiency,* which is a test's relative ability to identify significant differences for a given sample size. Generally, an increase in sample size is needed to make a nonparametric test as powerful as a parametric test. For instance, a nonparametric test may require a sample size of 50 to achieve the same degree of power as a parametric test with 30 subjects. This relationship can be expressed as a percentage that indicates the relative power- efficiency of the nonparametric test. For example, if power-efficiency is 60%, then with equal sample sizes, the nonparametric test is 60% as powerful as the parametric test. In other words, to achieve equal power with the nonparametric test, we would need 10 subjects for every 6 used with the parametric procedure.

With equal sample sizes, nonparametric tests will generally be less powerful than their parametric counterparts; however, with larger samples this discrepancy is minimized. Most of the nonparametric tests described here can achieve approximately 65% to 95% power-efficiency in comparison to their most powerful parametric analogs.^{3} These figures apply to calculations based on comparisons of normal populations. With very small samples, as with six subjects or less, many nonparametric tests will be as powerful as their parametric counterparts. With larger nonnormal populations, the nonparametric statistics may actually be more powerful.^{4} As power is an issue only when significant results are not obtained, a researcher need not be concerned with the relative power of nonparametric tests when the null hypothesis is rejected.

Most nonparametric tests are based on rank ordering of scores. The procedure for ranking will be illustrated using the two samples shown in Table 22.2. Scores are always ranked from smallest to largest, with the rank of 1 assigned to the smallest score. Algebraic values are taken into account, so that the lowest ranks are assigned to the largest negative values, if any. The highest rank will equal *n*. As shown in Sample A, the rank of 1 is assigned to the smallest score (−3), the rank of 2 goes to the next smallest (0), and so on, until the rank of 8 is assigned to the highest score (16).

SAMPLE A (n = 8) | Rank | SAMPLE B (n = 8) | Rank |
---|---|---|---|

6 | 4 | 8 | 3 |

2 | 3 | 11 | 5 |

8 | 5 | 3 | 1.5 |

9 | 6 | 17 | 8 |

−3 | 1 | 11 | 5 |

0 | 2 | 3 | 1.5 |

16 | 8 | 11 | 5 |

12 | 7 | 12 | 7 |

When two or more scores in a distribution are tied, they are each given the same rank, which is the average of the ranks they occupy. For instance, in Sample B, there are two scores with the smallest value (3). They occupy ranks 1 and 2. Therefore, they are each assigned the average of their ranks: (1 + 2)/2 = 1.5. The next highest value (8) receives the rank of 3, as the first two ranks are filled. The next highest value is 11, which appears three times. As we have already filled ranks 1, 2 and 3, we average the next three ranks: (4 + 5 + 6)/3 = 5. Each score of 11 is assigned the rank of 5. Having filled the first 6 rank positions, the last two values in the distribution are assigned ranks 7 and 8.

The **Mann-Whitney U test** is one of the more powerful nonparametric procedures, designed to test the null hypothesis that two independent samples come from the same population.

^{∗}This test is analogous to the parametric

*t*-test for independent samples. Like the unpaired

*t*-test, the

*U*test does not require that groups be of the same size. It is, therefore, an excellent alternative to the

*t*-test when parametric assumptions are not met.

A researcher is interested in the effect of body position on a person's ability to relax, as measured by EMG biofeedback from the frontalis muscle. To study this question, 11 subjects are randomly assigned to two groups in a pretest-posttest design, with one group positioned supine, the other sitting. Results are recorded as changes in microvolt activity. The researcher hypothesizes that the positions will facilitate different levels of relaxation (a nondirectional hypothesis).

Hypothetical data for this example are given in Table 22.3A. The first step is to combine both groups and rank all the scores in order of increasing size. The sum of the ranks assigned to each group is designated *R*_{1} or *R*_{2}. Under the null hypothesis, we would expect the groups to be equally distributed with regard to high and low ranks, and the mean of the ranks would be equal for both groups. Any differences between the ranks should be the result of chance. The test will determine if the difference between the sums of ranks is sufficiently large to be considered significant. An alternative hypothesis can be directional or nondirectional.

^{∗}Some statisticians prefer to use the Wilcoxon rank sum test to test the difference between two independent samples. This test is equivalent to the Mann-Whitney *U*-test.

The test statistic, *U*, is calculated using each group as a reference, as follows:

where *n*_{1} is the smaller sample size, *n*_{2} is the larger sample size, and *R*_{1} and *R*_{2} are the sums of the ranks for the groups. Designation of *n*_{1} or *n*_{2} is arbitrary if groups are of equal size. Obviously, these formulas will yield different values of *U*. For example, using calculations shown in Table 22.3B, we obtain *U*_{1} = 27, with Group 1 as the reference group. Using Group 2 as the reference group, we obtain *U*_{2} = 3. We can show that these values are mathematically related as

and vice versa. For example, for the data in Table 22.3, we can demonstrate this relationship:

*U*

_{1}= (5)(6) − 3 = 27

*U*

_{2}= (5)(6) − 27 = 3

*The smaller of these two values is assigned to the test statistic U.* In this case, then, *U* = 3.

Critical values of *U* are given in Appendix Table A.8 for one- and two-tailed tests at several levels of significance. These values are compared with the *smaller value* of either *U*_{1} or *U*_{2}. The appropriate critical values are located in the table for *n*_{1} and *n*_{2}. The calculated value of *U* must be *equal to or less than* the tabled value to be significant. (*Note:* This is opposite to the way we have used critical values with parametric tests.)

For the current example, at *α*_{2} = .05, with *n*_{1} = 5 and *n*_{2} = 6, the critical value of *U* is 3. Because the calculated value, *U* = 3, is equal to this critical value, we can reject *H*_{0}. Our conclusion is then based on visual examination of the mean ranks, which shows that greater relaxation (higher mean rank) is attained in the supine position (Table 22.3E).

When sample size exceeds 25, Table A.8 cannot be used. In this situation, the value of *U* is converted to *z* and tested against the standard normal distribution:

Even though the present example does not warrant it, we have used the data to illustrate this application in Table 22.3C. In this formula it does not matter if *U*_{1} or *U*_{2} is used. The absolute value of *z* will be the same either way.

Critical values of *z* (in Appendix Table A.1) are used to determine if this ratio is significant.^{†} For a two-tailed test at .05 (we proposed a nondirectional hypothesis), *z* = 1.96. Our calculated value exceeds this critical value, and the null hypothesis is rejected. This outcome agrees with the results obtained using Table A.8. We can also determine the exact probability associated with the test by finding the tail probability for *z* in Table A.1. For *z* = 2.19, the one-tailed probability is .0143. Because we have proposed a nondirectional hypothesis, we double this value for a two-tailed test. Therefore, *p* = .0286. These findings are shown in the computer output in Table 21.3E.

^{†}For one-tailed tests at .05 and .01, the critical values are 1.645 and 2.326, respectively. For two-tailed tests, these critical values are 1.96 and 2.576, respectively. The calculated value of *z* must be *greater than or equal to* the critical value to be considered significant.

When three or more groups are compared (*k* ≥ 3), a nonparametric analysis of variance is appropriate for the same reasons that an *F*-test is used with parametric data. The **Kruskal-Wallis one-way analysis of variance by ranks** is a nonparametric analog of the one-way analysis of variance. It is a powerful alternative to the *F*-test when variance and normality assumptions for parametric tests are not met. It is also the most appropriate way to handle ordinal level data when more than two groups are compared. With *k* = 2, this test is equivalent to the Mann-Whitney *U*-test. Multiple comparison procedures can also be applied.

We want to study the effect of three modalities for relieving chronic low back pain. We randomly assign 17 subjects (*N* = 17) to receive ice (*n* = 6), hot pack (*n* = 6), or ultrasound (*n* = 5). Pain is measured on a visual analog scale from 0 mm (pain-free) to 100 mm (severe pain). Scores are recorded as the change in level of pain from pretreatment to posttreatment levels.

Hypothetical data are reported in Table 22.4A. The procedures for the Kruskal-Wallis ANOVA are similar to those used for the Mann-Whitney *U*-test. The first step is to combine data for all groups and rank scores from the smallest to the largest. The smallest score receives the rank of 1, and the largest score is assigned the rank of *N*. Ties are assigned average ranks.

The ranks are then summed for each group separately, as shown in Table 22.4A. If the null hypothesis is true, we would expect an equal distribution of ranks under the three conditions.

The test statistic for the Kruskal-Wallis test is *H,* calculated according to

where *N* is the number of cases in all samples combined, *n* is the number of cases in each individual sample, and *R* is the sum of ranks for each individual sample. This calculation is illustrated in Table 22.4B. For this example, *H* = 7.243.

The *H* statistic is tested using the chi-square distribution with *k* – 1 degrees of freedom (Table A.5).^{‡} With three groups we will have 2 *df*. Therefore, we test *H* against the critical value of 5.99. Our calculated value of *H* = 7.243 is significant, and we can reject *H*_{0}. The output for this analysis is shown in Table 22.4D.

Some researchers will stop here, basing their final decision on a subjective comparison of the mean ranks for each group. In this example, it is fairly clear that scores for the ultrasound group are higher than the other two groups. However, when such judgments are not sufficient, a multiple comparison procedure can be used to determine which groups are different. This will be described shortly.

^{‡}When samples are very small, with five subjects or fewer per group, alternative tables can be used to obtain critical values of *H*. See Siegel and Castellan.^{5}

A substantial number of ties can have a conservative effect on the value of *H*, making the test less powerful. This may be a concern when the test result is not significant and when greater than 25% of the scores are tied. A correction factor can be applied to increase the value of *H* under these conditions. Unless the number of ties is substantial, however, the effect of the correction will be minimal. Obviously, if *H* is significant without the correction, there is no point in making the adjustment. Procedures for this correction can be found in the text by Siegel and Castellan.^{5}

When *H* is significant, it is usually of interest to determine which specific groups are different from each other. The Mann-Whitney *U* test is often used as a multiple comparison procedure; however, a Bonferroni correction should be applied to control for the increased risk of Type I error, using the same rationale that applies to multiple *t*-tests. Siegel and Castellan^{5} present a multiple comparison procedure to protect against this increased error rate.

A multiple comparison for the Kruskal-Wallis ANOVA tests the significance of pairwise differences between conditions, based on the *mean of the ranks* for each sample: . For the data in Table 22.3, , , and . The total number of pairwise comparisons associated with an analysis will be equal to *k*(*k* – 1)/2. With three mean rankings (*k* = 3), we will have 3(3 − l)/2 = 3 comparisons.

Each pairwise comparison is tested against a minimum significant difference (MSD) based on the formula

where *N* is the total number of subjects in all samples combined, and *n*_{1} and n_{2} are the respective sample sizes for the two groups involved in the specific pairwise comparison. Any absolute difference between mean ranks that is *equal to or larger than* the minimum significant difference is considered significant.

The value of *z* in Equation (22.5) is based on the total number of comparisons to be made and the desired level of significance for the overall test. We obtain *z* from Table 22.5. The *α* level selected in the table is based on the desired *family wise error rate* (*α*_{FW}), that is, the overall probability associated with the entire set of comparisons. Researchers may choose to keep *α*_{FW} at .05, which is considered a conservative practice, or they may accept higher probability levels, such as .15 or .20, when the risk of Type I error is not of great concern.^{§} Typically, a larger *α* is chosen as *k* increases.^{6}

α_{FW} | |||||
---|---|---|---|---|---|

Number of Comparisons | .25 | .20 | .15 | .10 | .05 |

1 | 1.150 | 1.282 | 1.440 | 1.645 | 1.960 |

2 | 1.534 | 1.645 | 1.780 | 1.960 | 2.241 |

3 | 1.732 | 1.834 | 1.960 | 2.128 | 2.394 |

4 | 1.863 | 1.960 | 2.080 | 2.241 | 2.498 |

5 | 1.960 | 2.054 | 2.170 | 2.326 | 2.576 |

6 | 2.037 | 2.128 | 2.241 | 2.394 | 2.638 |

7 | 2.100 | 2.189 | 2.300 | 2.450 | 2.690 |

8 | 2.154 | 2.241 | 2.350 | 2.498 | 2.734 |

9 | 2.200 | 2.287 | 2.394 | 2.539 | 2.773 |

10 | 2.241 | 2.326 | 2.432 | 2.576 | 2.807 |

^{§}The actual probability associated with each individual comparison is *α*_{FW}/*k*(*k* − 1). Therefore, with *k* = 3, and *α*_{FW} = .05, the per comparison error rate is .05/3(3 – 1) = .008. At *α*_{FW} = .20, the per comparison error rate would be .20/3(3 − 1) = .03.

We can illustrate this procedure using the data in Table 22.4. To compare Groups 1 (*n*_{1} = 5) and 2(*n*_{2} = 5), we first specify our desired familywise error rate, say α_{FW} = .15. Next, we determine that there will be a total of three comparisons. According to Table 22.5, at α_{FW} = .15, *z* = 1.96 for three comparisons. We can now compute the minimum significant difference for this comparison using Equation (22.5):

We compare this minimum difference with the absolute difference between the mean ranks for Groups 1 and 2:

Because this difference is less than the minimum significant difference, it is not considered significant. There is no significant difference between ice and hot packs for relieving pain.

We compare Groups 1 and 3 (*n*_{1} = 5, *n*_{3} = 4) using

The difference, , is greater than this minimum significant difference, and, therefore, this represents a significant effect. Ultrasound () is more effective than ice ().

Finally, we compare Groups 2 and 3 (*n*_{2} = 5, *n*_{3} = 4) using the minimum significant difference of 5.50 (obtained earlier for the same sample sizes):

This comparison is also significant. We can now conclude that ultrasound () is more effective for reducing low back pain than either ice () or hot packs ().

When all *k* samples are of equal size, one minimum significant difference can be used for all comparisons, using the formula

Two procedures are commonly used for testing the difference between correlated samples: the sign test and the Wilcoxon signed-ranks test. These tests are used with two-level repeated measures designs. They are analogous to the parametric *t*-test for correlated or paired samples.

The **sign test** is one of the simplest nonparametric tests because it requires no mathematical calculations. It is used with binomial data, and does not require that measurements be quantitative. As its name implies, the data are analyzed using plus and minus signs rather than numerical values. Therefore, this test provides a mechanism for testing relative differentiations such as more-less, higher-lower, or larger-smaller. It is particularly useful when quantification is impossible or unfeasible and when subjective ratings are necessary.

We are interested in the effect of knee angle on knee extensor strength. Using a manual muscle test (MMT), we will study 10 patients, six months following a total knee replacement. MMT grades are recorded from 0 (no muscle activity) to 12 (normal strength). We hypothesize that knee extensor strength will be different with the knee in 90° and 15° of flexion.

Hypothetical data are shown in Table 22.6A. The sign test is applied to the differences between each pair of scores, based on whether the direction of difference is positive or negative. In this example, we will use the grades measured at 15 degrees as the reference and record whether the grade at 90 degrees is greater (+), the same (0), or less (−) than the reference grade, always maintaining the same direction of comparison. It does not matter which value is used as the reference, as long as the order is consistent. In the fourth column in Table 22.6A, the signs of the differences are listed. When no difference is obtained, a zero is recorded.

Under the null hypothesis, we would expect half the differences to be positive and the other half to be negative. We will reject H_{0} if one sign occurs sufficiently less often. If we propose a directional alternative hypothesis, we must be sure that the direction of comparison supports the predicted direction of change. For this illustration, we have proposed a nondirectional hypothesis.

To proceed with the test, we count the number of plus signs and the number of minus signs. Ties, recorded as zeros, are discarded from the analysis, and *n* is reduced accordingly. In this example, 7 of the 10 subjects showed differences, with three ties. Therefore, *n* = 7. There are 6 plus signs and 1 minus sign (see Table 22.6A). We take the smaller of these two values, the *number of fewer signs*, and assign it the test statistic, *x*. In this case, *x* = 1, the number of minus signs.

To determine the probability of obtaining *x* under *H*_{0}, we refer to Appendix Table A.9. This table lists one-tailed probabilities associated with *x* for values up to *n* = 30, where *n* is the number of pairs whose differences showed direction. Two-tailed tests require doubling the probabilities given in the table.

For *x* = 1 and *n* = 7, the table shows *p* = .062. Because we have proposed a nondirectional hypothesis, we double this value for a two-tailed probability of *p* = .124. This is greater than the acceptable level of .05, and we cannot reject *H*_{0}. The probability that the difference in the number of plus and minus signs occurred by chance is too great. We conclude that there is no significant difference in knee extensor strength with the knee at 90 and 15 degrees.

The determination of the probability associated with *x* is based on a theoretical distribution called the *binomial probability distribution*. A binomial outcome is one that can take only two forms, in this case either positive or negative. The binomial test determines the likelihood of getting the smaller number of plus or minus signs out of the total number of differences just by chance.

With sample sizes greater than 30, *x* is converted to *z* and tested against the normal distribution according to the formula

where |D| is the absolute difference between the number of plus and minus signs.

This calculation is illustrated in Table 22.6B for data with six plus signs and one minus sign, resulting in *z* = 1.51. Using the critical value of *z* = 1.96 for *α*_{2} = .05, this outcome does not achieve significance. The output for this analysis is also shown.

The sign test evaluates differences within paired scores based solely on whether one score is larger or smaller than the other. This is often the best approach with subjective clinical variables that offer no greater precision; however, if data are able to provide information on the relative magnitude of differences, the more powerful **Wilcoxon signed-ranks test** can be used. This test examines both the direction of difference and the relative amount of difference.

Consider the example presented in the previous section. In Table 22.6A, we have listed the manual muscle test grades as ordinal values, based on a scale of 0 to 12. We obtain a difference score for each subject, labeled *d*. When *d* = 0, the subject is dropped from the analysis, and *n* is reduced, as it was in the sign test.

We proceed by ranking the difference scores, *without regard to sign*, and discarding any pairs with no difference. We then attach the sign of the difference to the obtained ranks. For instance, in our example, the rank of 1 is given to the smallest difference score (Subject 2), and then assigned −1 because it reflects a negative difference. Tied difference scores are given the mean of their ranks. Therefore, ranks 2, 3 and 4 are taken by Subjects 4, 5 and 7, who all have a difference score of 2. These scores are each assigned the average rank of 3. Subjects 8 and 10 are tied with difference scores of 3, filling ranks 5 and 6, which are averaged to rank 5.5. The final rank of 7 is assigned to Subject 6.

If the null hypothesis is true, we would expect to find an equal representation of positive and negative signs among the larger and smaller ranks; that is, the sum of the positive ranks should be equal to the sum of the negative ranks. We reject *H*_{0} if either of these sums is too small.

We determine if there are fewer positive or negative ranks, and then sum the ranks for the *less frequent sign*. This sum is assigned the test statistic, *T*. In this example, there are fewer ranks with negative signs, with the sum of −1. Therefore, *T =* −1. Only the absolute value of *T* is used to determine significance. The sign of *T* is of concern only when performing a one-tailed test.

Critical values of *T* are given in Appendix Table A.12 for one- and two-tailed tests, where *n* is the number of pairs with nonzero differences. The absolute calculated value of *T* must be *less than or equal to* the critical value to achieve significance. Note once again that this is opposite to the way most critical values are used. For this analysis, at *α*_{2} = .05, with *n* = 7, the critical value of *T* is 2. Therefore, our calculated value of *T* = 1 is significant (see Table 22.6C). We can reject *H*_{0} and conclude that knee extensor strength is different with the knee at 90 and 15 degrees. Visual examination of the data tells us that strength is greater with the knee at 90 degrees.

It is interesting to note the difference between the outcome of this analysis and the outcome of the sign test on the same data. We were able to substantiate a significant difference using the Wilcoxon procedure, because it is sensitive to relative differences, not just direction. Therefore, if data achieve adequate precision, the Wilcoxon test is recommended over the sign test.

With sample sizes over 25, the absolute value of *T* can be converted to *z* according to

where *n* is the number of paired observations. For this analysis, *z* = −2.20 (see Table 22.6C). The absolute value of *z* is greater than the critical value 1.96, which represents a significant difference at *α*_{2} = .05. According to Appendix Table A.l, the two-tailed significance associated with *z* = 2.20 is .0278. This is illustrated in the output for the *z* test.

In this section we present a nonparametric test to analyze data from a single-factor repeated measures design with three or more experimental conditions. The **Friedman two-way analysis of variance by ranks** is a powerful alternative to the parametric repeated measures ANOVA when ordinal data are used or when parametric assumptions are not tenable. The test is given the designation "two-way" based on the interpretation that "subjects" is treated as an independent variable with *n* = 1 per cell of the design. It is assumed that the number of measurements in each experimental condition will be the same.

We are interested in measuring the effect of changing body position on blood pressure in six patients with chronic pulmonary disease. Each patient will be placed in three positions—level, head down and head elevated—in random order. Blood pressure will be measured within 1 minute of assuming the position. We may choose to use a non- parametric form of analysis for this study because the sample is small, and because we do not have sufficient reason to assume that blood pressure for a population of patients with this disease will be normally distributed. In addition, although blood pressure measurements can be considered ratio level data, we can rationalize that the lack of reliability in the data warrants using a nonparametric test.

Hypothetical data for this study are reported in Table 22.7A. Data are arranged so that rows represent subjects (*n*) and columns represent experimental conditions (*k*). In this example, *n* = 6 and *k* = 3. We begin by converting all scores to ranks; however, the ranking process for this test is different from that used with the Kruskal-Wallis ANOVA. Here the ranks are assigned across each row (within a subject). Ties are assigned average ranks within a row. The highest rank within a row will equal *k*.

The next step is to sum the ranks within each column. If the null hypothesis is true, we would expect the distribution of ranks to be a matter of chance, and high and low ranks should be evenly distributed across all treatment conditions. Therefore, the rank sums within each column should be equal. If the alternative hypothesis is true, at least one pair of conditions will show a difference.

The test statistic for the Friedman ANOVA is χ^{2}_{r} (read "chi square r"). It is computed using the formula

where *n* is the number of subjects (rows), *k* is the number of treatment conditions (columns) and Σ*R*^{2} is the sum of the squared ranks for each column. Calculation of χ^{2}_{r} is illustrated in Table 22.7B. For this analysis, χ^{2}_{r} = 9 25.

The distribution of χ^{2}_{r} follows the standard χ^{2} distribution with *k* – 1 degrees of freedom, where *k* is the number of experimental conditions (Appendix Table A.5). With 2 *df*, we measure our calculated value of 9.25 against 5.99 (at *α* = .05). The calculated value must be *equal to or larger than* the critical value to be significant. Therefore, our test is significant (see Table 22.7C).

When χ^{2}_{r} is significant, we can test all pairwise differences using a multiple comparison procedure. Although the Wilcoxon signed-ranks test is often used for this purpose, a specific multiple comparison has been developed for the Friedman ANOVA.^{5} We propose a familywise error rate as an overall level of significance for the combined set of contrasts in the experiment.

The expression used to determine the minimum significant difference (MSD) for all pairwise contrasts is

where *R*_{1} and *R*_{2} are the rank totals for each treatment condition, *n* is the number of subjects, and *k* is the number of treatment conditions. The value of *z* is taken from Table 22.5 for the appropriate number of comparisons (*k*(*k* – 1)/2) and the desired familywise *α* level for the combined set of comparisons. For the current example, we have a total of three comparisons, and we propose a familywise *α* level of .10. Therefore, *z* = 2.128.

We compute the minimum significant difference:

For this analysis, contrasts are made between *rank totals* for each treatment condition, not mean ranks. The absolute value of differences between rank sums for each pair of treatment conditions must be *greater than or equal to* the obtained critical value. Because we are dealing with repeated measures, and all subjects are represented under each treatment, there is only one critical value for all contrasts. The three pairwise comparisons for this study are

The only difference score that exceeds the critical value of 7.37 is obtained from the second comparison between conditions 1 and 3. Therefore, there is a significant difference in blood pressure when an individual is positioned level versus head down, with higher pressures obtained in the head down position. No other contrasts are significant.

Nonparametric procedures offer clinical researchers a powerful and easily understood statistical mechanism for analyzing changes measured with subjective tools. Because of the nature of many clinical assessments, the ability to analyze ordinal data is important. There is still some debate among statisticians and researchers concerning the appropriate application of parametric versus nonparametric statistics with ordinal data. The classical view is that only nonparametric procedures should be used with ordinal measurements; however, many researchers do apply parametric tests to ordinal data, presumably because parametric tests have greater statistical power. This practice has been justified by assuming that the ordinal intervals are consistent, even though sensitivity of measurement may be unable to document this. Therefore, the analysis would not conceptually violate the assumptions of the parametric test.^{7,8} Although some assessment scales can be constructed in such a way as to define intervals as precisely as possible, it is probably unreasonable to assume that constructs such as function, manual resistance, sensation and so on, typically measured as ranks, can be measured with sufficient reliability that intervals can be considered equal. It is also likely that many of these scales are nonlinear, so that intervals at extremes of the scale will be different from those toward the center. Those who use this approach or who interpret findings of others who have used it must consider the potential for jeopardizing the validity of statistical outcomes by treating ordinal data as interval data.^{9}

Nonparametric methods are also appropriate for use with interval or ratio data when distributions are skewed or when sample sizes are too small to assume representation of a normal distribution; however, nonparametric procedures can be wasteful of information when used with data on the interval or ratio scales, because precise data are reduced to ranks. Therefore, when the criterion for using nonparametric tests is based on violations of normality only, it may be useful to transform data using a logarithmic transformation to achieve a normal distribution (see Appendix D) and to apply a parametric test.

The tests that have been included in this chapter are only a sampling of available nonparametric procedures. Statisticians continue to develop and refine these tests and to expand the capabilities of nonparametric methods into areas such as regression and factorial designs. Many tests have been developed with very specific purposes, such as comparing several treatment groups with a single control or looking at differences in variables that have an inherent order. Nonparametric statistics can also be used for correlation procedures and for testing nominal scale data. These procedures are presented in Chapters 23 and 25.

*Logistic Regression Models for Ordinal Response Variables.*Thousand Oaks, CA: Sage Publications, 2005.

*Statistical Principles in Experimental Design*(3rd ed.). New York: McGraw-Hill, 1991.

*Technometrics*1968;10:509–522.

*Nonparametric Statistics for the Behavioral Sciences*(2nd ed.). New York: McGraw-Hill, 1988.

*Nurs Res*1999;48:226–229. [PubMed: 10414686]

*Psychol Bull*1980;87:564–567.