Many research questions in clinical and behavioral science involve categorical variables that are measured on a nominal or ordinal scale. These questions usually deal with the analysis of proportions or frequencies within various categories. For instance, surveys often code responses that represent frequencies, such as the number of Yes-No responses to a series of items or the number of respondents who fall into certain age groups. We can then ask questions about the proportion of respondents that fall into each category. In descriptive studies we are often interested in how certain nominal variables are distributed. For example, we might want to determine the proportion of patients with right-sided or left-sided strokes who are functionally dependent or independent at discharge or the proportion of therapists who work in private practice versus institutional settings.

These types of categorical data are analyzed by determining if there is a difference between the proportions *observed* within a set of categories and the proportions that would be *expected* by chance. For example, if therapists are equally likely to work in private or institutional settings, then theoretically we would expect an equal proportion, or 50%, to fall into each category. The null hypothesis states that no difference exists between the actual proportions measured in a sample and this theoretical distribution. If the observed data depart significantly from these expected null values, we reject the null hypothesis.

The purpose of this chapter is to describe the use of several statistics that can be used to analyze frequencies or proportions. These statistics are based on **chi-square, χ^{2}**, which is a nonparametric statistic used to determine if a distribution of observed frequencies differs from theoretical expected frequencies. Chi-square has many applications in clinical research, in both experimental and descriptive analysis. We concentrate on two general uses of the test. A test of goodness of fit is used to determine if a set of observed frequencies differs from a given set of theoretical frequencies that define a specific distribution. A test that compares the proportion of therapists in private and institutional settings fits this model, based on a theoretical distribution of 50 : 50. Tests of independence are used to determine if two classification variables are independent of each other, that is, to examine the degree of association between them. For example, we could study the frequency of left- and right-sided stroke in terms of functional level at discharge to determine if these variables are related or independent of each other. We also discuss the use of a related procedure called the McNemar test, for examining frequencies of correlated samples. In addition, several other coefficients of association for categorical data will be described.

As we discuss the different applications of the *χ*^{2} statistic, it is important to keep in mind two general assumptions: (1) *Frequencies represent individual counts*, not ranks or percentages. This means that data in each category represent the actual number of persons, objects, or events in those categories, not a summary statistic. (2) *Categories are exhaustive and mutually exclusive*. Therefore, every subject can be assigned to an appropriate category, but only one. Repeated measurement or assignment is not appropriate; that is, no one individual should be represented in more than one category. The characteristics being measured should be defined with enough specificity to avoid any overlaps in group assignment.

Chi-square is defined by^{∗}

where *O* represents the **observed frequency** and *E* represents the **expected frequency.** As the difference between observed and expected frequencies increases, the value of *χ*^{2} will increase. If observed and expected frequencies are the same, *χ*^{2} will equal zero.

We illustrate the application of this statistic using a simple example. Suppose we tossed a coin 100 times. The null hypothesis states that no bias exists in the coin, and we would expect a theoretical outcome of 50 heads and 50 tails. We observe 47 heads and 53 tails. Does this deviation from the null hypothesis occur because the coin is biased, or is it only a matter of chance? In other words, is the difference between the observed and expected frequencies sufficiently large to justify rejection of the null hypothesis?

We calculate *χ*^{2} by substituting values in the term (*O* – *E*)^{2}/*E* for each category. For heads,

For tails,

The sum of these terms for all categories is the value of *χ*^{2}. Therefore,

We analyze the significance of this value using critical values of *χ*^{2} found in Appendix Table A.5. Along the top of the table we identify the desired *α* level, say .05. Along the side we locate the appropriate degrees of freedom. In this case, *df* = 1. We will discuss rules for determining degrees of freedom for different statistical models shortly. Chi-square tests do not distinguish between one- and two-tailed tests because no negative values are possible.

The calculated value of *χ*^{2} must be *greater than or equal to* the critical value to be significant. In this example, the observed value is less than _{(.05)}*χ*^{2}_{(1)} = 3.84. Therefore, *H*_{0} is not rejected, and we would conclude that the coin toss was fair.

^{∗}Although the definitional formula for *χ*^{2} (Eq. 25.1) is used most often, there is a computational formula that may be useful: .

In tests for **goodness of fit**, the researcher compares observed frequency counts with a known or theoretical distribution. The classical studies of heredity performed by Mendel illustrate this concept. He observed the color and shape of several generations of peas and compared the frequencies of specific color and shape combinations with a theoretical distribution based on his predictions about the role of dominant and recessive genes. When the observed distributions matched the theoretical model, his genetic theory was supported. Similarly, the coin toss described earlier is essentially a test of goodness of fit to a probability distribution. Chi-square will test the null hypothesis that the proportion of outcomes within each category will not significantly differ from the expected distribution; that is, the observed proportions will fall within random fluctuation of the expected proportions.

There are many models for testing goodness of fit. The two most common applications involve testing observed data against a **uniform distribution** across all categories and a **known distribution** within the underlying population.

Sample size for goodness of fit tests should be large enough that no expected frequencies are less than 1.0; that is, every category in the theoretical distribution of interest should expect at least one count. When this criterion is not met, sample size should be increased or categories combined to create an appropriate distribution. Note that this criterion applies to the expected frequencies, not the observed counts.

Consider a study designed to determine if the incidence of stroke is greater on the right or the left in people over 70 years of age. If we assume that the causative factors of stroke are not biased to one side, then theoretically we would expect to see a uniform distribution, 50% right-sided and 50% left-sided strokes, in the population. This is the null hypothesis, representing chance occurrence. Suppose we obtain data from a broad sample of 130 patients, and find that 71 were affected on the right and 59 on the left. Is this distribution significantly different from the 50% ratio we expect by chance?

We use chi-square to determine if the observed frequencies fit the uniform distribution model by comparing the observed and expected frequencies using Equation (25.1).

First we must establish the expected frequencies. For a uniform distribution we do this simply by dividing the total sample equally among the categories. Therefore, if chance is operating, we would expect 50% of our sample, or 65 people, to have right-sided strokes, and 50%, or 65 people, to have left-sided strokes. We calculate (*O* – *E*)^{2}/*E* for each category, as shown in Table 25.1.

In the uniform distribution goodness of fit model, degrees of freedom equal *k* − 1, where *k* is the number of categories. With two categories (right and left), *df* = 1. Therefore, we compare the calculated value, *χ*^{2} = 1.10, with the critical value _{(.05)}*χ*^{2}_{(1)} = 3.84, obtained from Appendix Table A.5. The calculated value is less than the critical value, and we do not reject the null hypothesis. The difference between the observed and expected frequencies can be attributed to chance, and our sample fits the expected uniform distribution. According to these hypothetical data, the incidence of right- and left-sided strokes can be considered a random event.

By definition, the expected frequencies for a uniform distribution will be evenly divided among the categories. Therefore, if we studied a sample with three or four categories, we would test the observed frequencies in each category against expected frequencies of 33.3% and 25%, respectively. For example, if we assigned 130 cases to three different categories, we would expect 43.33 cases in each category. If we had four categories, we would expect 32.5 cases per category. It may seem strange to be dealing with fractions of a count in expected frequencies, as we obviously cannot have a fraction of an individual in a category; however, these values represent only theoretical values based on an infinite number of possible scores, and cannot be interpreted as representing actual expected counts.

The third goodness of fit model compares a sample distribution with a known distribution within an underlying population. This is one way to document how well a sample represents its parent population. In many cases, the variable of interest is normally distributed in the population and the goodness of fit test for normal distributions should be used. In other situations, the population shows a unique distribution that can be tested against observed frequencies.

For example, suppose an investigator hypothesizes that thromboembolism is more common in individuals with certain blood types. If this is true, then we can expect to see those blood types represented among patients with thromboembolism in higher percentages than in the overall population. Suppose we study a sample of 85 patients who have experienced thromboembolism. The null hypothesis states that the disorder is not associated with blood type and that the distribution of blood types in the sample will be similar to that in the overall population. Knowing that 39% of the population has Type A blood, 9% has Type B, 5% has Type AB, and 47% has Type O,^{1} we can determine what proportion of the patients should be expected to have each blood type under the null hypothesis. For example, 39% of the sample, or (0.39)(85) = 33.15 patients, should have Type A blood. Hypothetical observations and expected values for all four categories are shown in Table 25.2A. By looking at the column labeled (*O* – *E*), we can see that there are marked differences between expected and observed frequencies, some showing less than expected and others greater than expected values.

With a known distribution, we test *χ*^{2} with *k* – 1 degrees of freedom. In this case, *df* = 4 – 1 = 3. The calculated value of *χ*^{2} = 19.94, as shown in Table 25.2A. This value exceeds the critical value _{(.05)χ2(3)} = 7.82, and we can reject the null hypothesis. These hypothetical data do not follow the population distribution, and therefore, there is reason to believe that this disorder has some association with blood type.

When the results of a chi-square test are significant, we can examine the results subjectively, to determine which categories demonstrate the greatest discrepancy between observed and expected values. For this purpose we can look at a residual for each cell, which is the difference between the observed and expected frequencies, given in the column labeled *O* − *E*. For the blood type study, for instance, the residual for Type A is −5.15. This means that the observed proportion of Type A blood in this sample was less than expected by chance. These raw values may be difficult to interpret, however, as they are effected by the number of observed counts within each cell; that is, cells with larger counts are likely to have larger residuals. Therefore, **standardized residuals, R,** are often used to demonstrate the relative contribution of each cell to the overall value of chi-square:

For example, using the data for Type B blood in Table 25.3A, the standardized residual is

Standardized residuals for blood types are listed in the rightmost column in Table 25.2. These residual values can be compared to determine which categories contributed most to the value of *χ*^{2}. Residuals that are close to or greater than 2.00 are generally considered important.^{2} The values for blood type demonstrate that the difference between observed and expected frequencies for patients with Type B blood shows the greatest discrepancy. The positive sign for the residual indicates that the proportion of individuals with thromboembolism who have Type B blood is greater than expected by chance. The small residuals for the other blood types suggest that they do not contribute appreciably to the value of *χ*^{2}. The negative values for Types A and O indicate that those frequencies are actually represented in smaller numbers than would be expected by chance.

The most common application of chi-square in clinical research is in tests of independence. With this approach, researchers examine the association, or lack of association, between two categorical variables. This association is based on the proportion of individuals who fall into each category. These data may be obtained from randomized experiments or from descriptive studies involving classification of subject characteristics.

Many examples of these applications can be found in clinical literature. For example, Frankel et al.^{3} examined outcomes of younger and older patients with traumatic brain injuries. They used *χ*^{2} to demonstrate an age-related difference in the proportion of patients who were discharged home versus an institutional setting. Yu et al.^{4} demonstrated a higher proportion of school children with wheezing and shortness of breath in school districts with greater air pollution. Monset-Couchard et al.^{5} studied differences in frequency of speech problems in twins who were bom at normal or small birth weight. Proctor and co-workers^{6} studied patients with work-related musculoskeletal disorders, and looked at the proportion of those who completed or did not complete a functional restoration program in relation to return to work and frequency of surgeries. Epidemiologic studies often use chi-square to evaluate the effect of different exposures among diseased and nondiseased individuals.^{†}

In each of the preceding studies, the research question asks if the proportions of subjects observed in each category are independent of each other. Two variables are considered independent if the distribution of one in no way depends on the distribution of the other. For example, if the presence of speech problems is independent of birth weight, then a child with a low birth weight is no more likely to have such problems than a child who was bom at a normal birth weight. The null hypothesis for a test of independence states that two categorical variables are independent of each other. Therefore, when the null hypothesis is rejected following a significant *χ*^{2} test, it indicates that an association between the variables is present.

To test the relationship between two categorical variables, data are arranged in a two-way matrix, called a **contingency table**, with *R* rows and *C* columns. To illustrate, consider the data in Table 25.3A, taken from a study by Armstrong et al.,^{7} who looked at the differential effect of a total contact cast (TCC) or removable cast walker (RCW) on healing of neuropathic diabetic foot ulcers. They studied 50 patients who were randomly assigned to use either the TCC (*n* = 27) or the RCW (*n* = 23). The dependent variable was the assessment of healing over 12 weeks, scored as "healed" or "unhealed." This is a nominal level of measurement, and is appropriately analyzed using *χ*^{2}. The 2 × 2 contingency table shows the observed frequencies as the first entry within each cell (labeled "count").

The null hypothesis states that there is no association between the type of cast and healing; that is, both casts will be equally effective. We begin our analysis by calculating the expected frequencies for each cell in the table. This process is somewhat more complicated when working with a contingency table, because we cannot just evenly divide the total sample among the four cells. We must account for the observed *proportions* within each variable. First we ask, what proportion of the total sample (*N* = 50) had healed or unhealed ulcers? According to the observed data, these proportions are

Healed: 33/50 = 66%

Unhealed: 17/50 = 34%

Therefore, if the null hypothesis is true, and no association exists between healing and type of cast worn, we would expect to see these same proportions in the TCC and RCW groups. This means that within each category of cast, 66% of the patients should have healed and 34% should be unhealed. Therefore, of the 23 patients who wore the TCC, 66% or [(.66)(23) = 15.18] should have healed, and 34% [(.34)(23) = 7.82] should be unhealed. Similarly, 66% of the 27 patients who wore the RCW [(.66)(27) = 17.82] should be healed, and 34% should be unhealed [(.34)(27) = 9.18]. These are the frequencies that would be expected if type of cast and healing are not related. Table 25.3B shows the expected frequencies under the column labeled "E".

We can simplify the process of calculating the expected frequency (*E*) for a given cell in the table using the formula

where *f _{R}* and

*f*represent the frequency totals for the row and column associated with that cell, respectively. Therefore, for those who wore the TCC, expected frequencies are

_{C}And for those who wore the RCW,

^{†}The *Mantel-Haenszel chi-squar**e* statistic is a variation of the chi-square test for independence, used in case-control and cohort studies, when the association between two variables is considered confounded by a third variable. The data are stratified so that the effect of the confounder is partitioned out. The Mantel-Haenszel statistic essentially adjusts the value of chi-square to account for the differential contribution of each stratum. Formulas for Mantel-Haenszel statistics can be found in most epidemiologic tests. See Chapter 28 for a discussion of confounding in epidemiologic studies.

Table 25.3B shows the calculation of *χ*^{2} using these data. These calculations proceed as in previous examples, with all observed and expected frequencies listed in the table (order is unimportant). The test value, *χ*^{2} = 5.24, is compared with the critical value with (*R* − 1)(*C* − 1) degrees of freedom. In this case, we have two rows and two columns, with (2 − 1)(2 − 1) = 1 degree of freedom. From Appendix Table A.5 we obtain the critical value _{(.05)}*χ*^{2}_{(1)} = 3.84. Therefore, *χ*^{2} is significant and the null hypothesis of independence is rejected. These variables are not independent of each other. There is a significant association between the type of cast worn and healing of foot ulcers.

We can examine the frequencies within each cell to interpret these findings. The output for this analysis allows us to see how each cell contributes to the overall chi- square. As shown in Table 25.3D, the frequency within each cell is also given as a percentage of the column (% within Healed) and the row (% within Cast). For instance, 19 patients in the TCC group were healed. This represents 82.6% of all those who wore the TCC (the row %) and 57.6% of all those who were healed (the column %). If we examine the standardized residuals for these data, shown as the last entry in each cell, we can see that the two cells representing patients who were unhealed contribute most to the significant outcome. With use of the TCC, the number of patients whose ulcers were unhealed was less than expected by chance (*R* = –1.4). For those who wore the RCW, the number of patients who were unhealed was greater than expected (*R* = 1.3). It is reasonable, then, to conclude that the TCC was more effective.

When data are arranged in a contingency table, the marginal frequencies can be generated in one of two ways. They may be *fixed effects*, in that the totals are predetermined by the experimenter. If the study were to be repeated, the same frequencies would probably be used. The levels of cast can be classified as fixed, in that the subjects were assigned to these groups. The numbers of subjects in each category of treatment were determined by the researchers. Conversely, the number of subjects appearing in each category of healing was not predetermined. This is considered a *random effect*, indicating that the numbers in these categories would probably change with repeated sampling.

A **fixed model** contingency table is created when both variables of interest are assigned. This approach is rare in clinical studies. The more common **random model** is composed of two random variables. For example, we could analyze a class of 60 students and classify them according to sex and age. The totals in each category would be different for every class that was tested. A **mixed model** is composed of one random and one fixed variable. The cast example, in which subjects were assigned to treatment groups and measured on healing, fits this model. Treatment is fixed and healing levels are random. Case-control studies use this approach, choosing a fixed number of cases and control subjects, and then examining how many in each group are exposed to a risk factor. If the study were to be repeated, the same numbers of cases and controls could be chosen, but the exposure data would vary. The significance of analyzing a fixed, random, or mixed model will be discussed shortly when we deal with issues of sample size.

The 2 × 2 contingency table is a commonly used model in the analysis of frequencies. An alternative formula for calculating *χ*^{2} can be applied, which eliminates the need for determining expected frequencies. This formula is illustrated in Table 25.4.

Assumptions related to sample size with contingency tables are based on the expected frequencies. In addition to the requirement that each cell contain an expected frequency of at least 1, no more than 20% of the cells should contain expected frequencies less than 5.^{8} When this occurs, the researcher may choose to collapse the table (if it is larger than 2 × 2) to combine adjacent categories and increase expected cell frequencies.

A statistical correction, known as **Yates' correction for continuity**, is often recommended to adjust *χ*^{2} to account for small expected frequencies. This procedure reduces the size of *χ*^{2} by subtracting 0.5 from the absolute value of *O – E* for each category before squaring:

With 2 × 2 tables, Yates' correction for continuity is given as

A number of statistical sources suggest that Yates' correction for continuity is too conservative and unduly increases the chance of committing a Type II error.^{9,10} It has been suggested that *χ*^{2} can provide a reasonable estimate of Type I error for 2 × 2 tables when random or mixed models are used with *N* ≥ 8.^{11} With expected frequencies less than 5, a related procedure called the **Fisher Exact Test** is recommended for use with 2 × 2 tables.^{12} This test results in the exact probability of the occurrence of the observed frequencies, given the marginal totals. The calculation of Fisher's Exact Test is quite cumbersome and is best generated by computer analysis.

One of the basic assumptions required for use of *χ*^{2} is that variables are independent; that is, no one subject is represented in more than one cell. There are many research questions, however, for which this assumption will not hold. For instance, we could look at a sample's responses to a question and see how many subjects answered correctly or incorrectly before and after exposure to specific information. Or we could examine the effects of a particular treatment program by looking at the presence or absence of an outcome variable, such as pain, before and after treatment. These studies use nominal variables, but in a repeated measures design. The *χ*^{2} test is not valid under these conditions.^{13}

The **McNemar test** is a form of the *χ*^{2} statistic used with 2 × 2 tables that involve correlated samples, where subjects act as their own controls or where they are matched. This test is especially useful with pretest-posttest designs when the dependent variable is measured as a dichotomy in an ordinal or nominal scale. To illustrate this approach, Evans et al.^{14} studied the effect of percutaneous vertebroplasty on pain and function in patients with vertebral fractures. Table 25.5A shows representative data for the use of pain medications before and after the procedure. In this situation, the cells are not independent, and each subject is represented twice.

The cells in the correlated design follow the standard notation for a 2 × 2 table. The number of patients who demonstrate a change in the use of pain medications following the vertebroplasty are reflected in shaded cells *B* and *C*. Patients in cell B did not use pain medications prior to the procedure, but did use them afterwards. Those in cell C did use pain medications prior to the procedure, but no longer used them afterwards. Patients in cells A and D did not change their use (or nonuse) of medications.

As *B* and *C* represent the total number of patients who showed a change in their behavior, these are the only cells of interest for this analysis. Under the null hypothesis, half of those who changed should stop using medications after the procedure and half should begin using medications. We test this hypothesis using the formula

which is tested against critical values of *χ*^{2} with one degree of freedom (Appendix Table A. 5). As shown in Table 25.5B, for the preceding example, *χ*^{2} = 10.67. This value is significant (Table 25.5C). We can see that the proportion of patients who stopped using medications after the vertebroplasty is substantially higher than for those who began using medications.

Sometimes a measure of association, like a correlation coefficient, is desired, as a way of expressing the degree of relationship in a set of categorical data. Chi-square tells us only if the association is significant, not if it is strong or weak.

The **phi coefficient**, Φ, can be used to express the degree of association between two nominal variables in a 2 × 2 table.^{15} Its value can range from −1.00 to +1.00, and can be interpreted as a correlation coefficient. It is based on the *χ*^{2} statistic as follows:

For the data in Table 25.5,

This finding indicates a relatively weak association between type of cast worn and incidence of healing. The results of the *χ*^{2} test on these data showed that the two variables were not independent. The contingency coefficient indicates the strength of their relationship. This statistic can also be obtained using the Pearson correlation (see Chapter 23).

The **contingency coefficient, C**, is a measure of association that can be used with tables larger than 2 × 2, but with the restriction that the number of rows has to equal the number of columns. This value is given by

Once again, using the data in Table 25.5

As these results show, the phi coefficient and the contingency coefficient will yield similar results with 2 × 2 tables.

The contingency coefficient will range from 0 to a maximum of where *q* represents the number of rows or columns in a symmetrical table. For a 2 × 2 table the upper limit of C is . For a 3 × 3 table, this maximum will be . Because of these differences, contingency coefficients are not directly comparable unless they are obtained from tables of equal sizes.

A third measure of association based on *χ*^{2} is **Cramer's V** coefficient, which is an alternative to the contingency coefficient when contingency tables are asymmetrical. This coefficient is designed so that the attainable upper bound is always ±1.00. The formula is

where *N* is the total number of subjects, and *q* is the number of rows or columns, *whichever is smaller*. For example, suppose we were conducting a survey of 50 participants in a health promotion program. We ask respondents for their age (in 4 categories) and their level of satisfaction with the program (in 3 categories), so that we create a 4 × 3 contingency table. Assume that *x*^{2} = 24.00. Because we have 4 levels of age and 3 levels of satisfaction, *q* = 3. Therefore,

which represents a moderate degree of relationship, indicating that there is some association between the participants' age and level of satisfaction with the program.

Many computer packages generate a series of coefficients associated with contingency table analyses. These statistics are not based on chi-square.

The **lambda coefficient, λ**, is used to determine how well one can predict membership in one category based on knowledge of another category. Both sets of categories should be at the nominal level. Lambda is reported in asymmetric and symmetric versions. The *asymmetric lambda* is interpreted as the improvement in predicting *Y* once values of *X* are known; that is, one nominal variable is designated as the dependent variable (*Y*), and the other as the independent variable (*X*). For instance, in the study of diabetic ulcers described ealier, we would designate the type of cast as the independent variable and level of healing as the dependent variable. In some analyses, however, the researcher is unable to specify which variable is dependent. For example, we might want to look at the relationship between side of stroke and sex, neither of which could necessarily be seen as a dependent variable. In this case, the *symmetric* version of lambda is used. Lambda ranges from 0, when there is no improvement in prediction, to 1.0, when predictions can be made without error.

**Kendall's tau-b** and **tau-c** are measures of association for ordinal variables that are reported in categories. Tau-b is appropriate with square tables, such 2 × 2, and tau-c should be used with rectangular tables where the number of rows and columns differ.

**Gamma** is based on the tau statistic, but ignores ties; that is, pairs that have the same classification for *X* and *Y* are eliminated from the analysis. When tables have three or more dimensions (three or more category variables, such as sex, age group and diagnosis), partial gammas can be calculated.

Clinical researchers can find many uses for the chi-square statistic for data analysis and descriptive purposes. It is often useful as a way of establishing group equivalence following random assignment. For instance, once two groups have been assigned, it may be of interest to compare the numbers of males and females in each group to see if they were assigned in equal proportions. Or it may be important to determine if certain age groups are equally represented in each experimental group. Chi-square can be used to make these determinations and confirm the validity of the randomization process.

Chi-square should not be used as an alternative to more precise tests, such as the *t*-test or analysis of variance, when data can be measured on a continuous scale. Any data can be reduced to the nominal level, but this can result in a serious loss of information and is not encouraged for continuous measures. For example, if a survey requested information on an individual's age, and the exact age is given, it may not be useful to reduce the data to age intervals.

Issues of sample size are relevant to discussions of chi-square. The statistic is sensitive to increases in sample size when there is a true difference between observed and expected frequencies. With larger samples, the magnitude of these differences will usually increase, thereby increasing the value of *χ*^{2}. When samples are very small, these differences can be hidden. It is often useful to consider collapsing categories when this does not compromise the research question, and to re-examine data using larger cell frequencies; however, this should be done only when the combinations of categories are theoretically reasonable and meaningful. It may be helpful to think about potential combinations of categories prior to data analysis. It is never appropriate to make such combinations on the basis of the observed data to achieve significant outcomes. See Appendix C for a discussion of power related to chi-square.

*Arch Phys Med Rehabil*2006;87:57–62. [PubMed: 16401439]

*J Occup Environ Med*2004;46:946–952. [PubMed: 15354060]

*Arch Dis Child Fetal Neonat Ed*2004;89:F310–314.

*Arch Phys Med Rehabil*2005;86:1509–1515. [PubMed: 16084800]

*Diabetes Care*2005;28:551–554. [PubMed: 15735186]

*J Bio-pharm Stat*1995;5:43–70.

*Stat Med*1990;9:363–367. [PubMed: 2362976]

*Am Statistician*1975;29:143–145.

*Arch Phys Med Rehabil*1995;76:678–681. [PubMed: 7605189]

*Radiology*2003;226:366–372. [PubMed: 12563127]

*Nonparametric Statistics for the Behavioral Sciences*(2nd ed.). New York: McGraw-Hill, 1988.