Many statistical procedures, like the *t*-test, analysis of variance and linear regression are based on assumptions about homogeneity of variance and normality that should be met to ensure the validity of the test. Although most parametric statistical procedures are considered robust to moderate violations of these assumptions, some modification to the analysis is usually necessary with striking departures. When this occurs, the researcher can choose one of two approaches to accommodate the analysis. The analytic procedure can be modified, by using nonparametric statistics or nonlinear regression, or the dependent variable, *X*, can be transformed to a new variable, *X'*, which more closely satisfies the necessary assumptions. The new variable is created by changing the scale of measurement for *X*. In this appendix we introduce five approaches to *data transformation*.

The three most common reasons for using data transformation are to satisfy the assumption of homogeneity of variance, to conform data to a normal distribution, and to create a more linear distribution that will fit the linear regression model. Fortunately, the same transformation will often accomplish more than one of these goals.^{1}

The most commonly used transformations are the square root transformation, the square transformation, the log transformation, the reciprocal transformation, and the arc sine transformation. The choice of which method to use will depend on characteristics of the data. Before we describe the guidelines for using each of these approaches, it may be helpful to illustrate the transformation process using the square root transformation.

The *square root transformation* replaces each score in a distribution with its square root. This method is most appropriate when variances are roughly proportional to group means, that is, when is similar for all samples. The square root transformation will typically have the effect of equalizing variances.

Suppose we were given two sample distributions shown on the left panel in Table D.1. These variances, *s*^{2}* _{A}* = 8.5 and

*s*

^{2}

*= 26.5, are obviously quite different from one another. We determine the applicability of the square root transformation by demonstrating that is similar for both distributions: and .*

_{B}Original Data (X) | Transformed Data (√X) | |||
---|---|---|---|---|

A | B | A | B | |

1 | 8 | 1.00 | 2.83 | |

3 | 7 | 1.73 | 2.65 | |

8 | 12 | 2.83 | 3.46 | |

6 | 5 | 2.45 | 2.24 | |

2 | 18 | 1.41 | 4.24 | |

Σ | 20 | 50 | 9.42 | 15.42 |

X̄ | 4 | 10 | 1.88 | 3.08 |

s^{2} | 8.5 | 26.5 | .56 | .61 |

2.125 | 2.65 |

Each score in both distributions is transformed to its square root on the right in Table D.l. As we can see, the effect of this transformation is a reduction in the discrepancy between the two variances; now *s*^{2}* _{A}* = .56 and

*s*

^{2}

*= .61. These transformed values can now be used in a statistical analysis.*

_{B}When data contain many small numbers (equal or close to zero), the square root transformation is more valid using as the converted score.

The *square transformation* (*X*' = *X*^{2}) is used primarily in regression analysis when the relationship between *X* and *Y* is curvilinear downward; that is, slope steadily decreases as the value of the independent variable increases.^{1} This transformation will cause the relationship to appear more linear. It will also have the effect of stabilizing variances and will normalize the dependent variable when the residuals are negatively skewed.

The *log transformation* (*X'* = log *X*) is most appropriately used when the standard deviations of the original data are proportional to the mean; that is, the ratio (the coefficient of variation) will be roughly constant across distributions. In addition to equalizing variances, the log transformation is used most often to normalize a skewed distribution. In regression analyses, the log transformation can also be used to create a more linear relationship between *X* and *Y* when the regression model shows a consistently increasing slope.^{1} When data are numerically small, the transformation should be made on the basis of *X'* = log *X* + 1.^{2} The effect of log transformation can be easily demonstrated by plotting scores on logarithmic or semilogarithmic graph paper.

The *reciprocal transformation* (*X'* = 1/*X*) is used when the standard deviations of the original data are proportional to the square of the mean .^{3} It is effective for attaining homogeneity of variance or normality. Use of this approach will minimize the skewing effect of large values of *X*, which will be close to zero in their reciprocal form. With numeric data close to zero, this transformation should be obtained by using *X'* = 1/*X* + 1.

The *arc sine transformation* (*X'* = arcsin √*X*) is also called angular transformation. It is used when data are collected in the form of proportions or percentages, such as the proportion of successful responses in a given number of trials. The relationship should be constant for all samples. This transformation is based on an angular scale, whereby each proportion, *p*, is replaced by the angle whose sine is √p. Angles are usually given in radians. Tables for arc sine transformations are provided in Fisher and Yates^{4} and Snedecor and Cochran.^{5}

Selecting the best transformation may be a less than obvious task. Many researchers use trial and error to determine the transformation that is most successful at reorienting the data. Kirk has suggested a method that may be helpful in facilitating this decision.^{3} He uses each transformation to convert the largest and smallest scores in each distribution. The difference between the largest and smallest score, or the range of the distribution, is calculated using the transformed values. The ratio of the larger to the smaller range is then calculated for each transformation. The transformation that produces the smallest ratio is selected. This process is illustrated in Table D.2.

Data are obtained from two treatment groups. For this example, the largest and smallest raw scores in each distribution are transformed using the square root, log, and reciprocal transformations. The differences between the transformed values of the smallest and largest scores is calculated. For example, the difference between the square roots of 18 and 10 (the largest and smallest scores in Distribution 1) is 1.08. The difference between the square roots of 40 and 20 (Distribution 2) is 1.85. For the square root transformation, the ratio of the larger to the smaller range is 1.85/1.08 = 1.71. A similar ratio is calculated for each of the other transformations, as shown in Table D.2. The log transformation would be selected because it results in the smallest ratio.

When more than two distributions are compared, the ratio is calculated using the largest and smallest ranges for each transformation. For instance, suppose we added a third group to the data, and the differences between the square roots of the largest and smallest values were 1.08, 1.85, and 1.13. The ratio for this transformation would be formed using only 1.85 and 1.08, as these are the largest and smallest ranges for this transformation.

Once data are analyzed using transformed data, all further interpretations of data must be made using the transformed values. For example, epidemiologists have shown that the distribution of incubation periods of communicable diseases tends to be normally distributed on a logarithmic scale.^{6} Therefore, further analyses of these data have used the log incubation period as the unit of measurement.^{7}

There are situations where data will be of sufficient variability that no transformation will be successful at smoothing the data. When this occurs, the researcher may consider choosing a different response measure as the dependent variable, one that would be more evenly distributed. Alternatively, nonparametric statistics can be applied. These tests, discussed in Chapter 22, do not require normality or equal variances.

Tables are provided in many statistics texts to facilitate log, square, and square root transformations.^{4,8} In addition, most computer programs provide a mechanism for data transformation prior to analysis.

*Applied Regression Analysis and Other Multivariable Techniques.*North Scituate, MA, Duxbury Press, 1978.

*Experimental Design: Procedures for the Behavioral Sciences,*ed 2. Belmont, CA, Brooks/Cole, 1982.

*Statistical Tables for Biological, Agricultural and Medical Research,*ed 6. London, Longman, 1963.

*Am J Hyg*51:310, 1950. [PubMed: 15413610]