The simplest experimental comparison involves the use of two independent groups created by random assignment. This design allows the researcher to assume that all individual differences are evenly distributed between the groups, so that the groups are equivalent at the start of the experiment. Statistically, the groups are considered random samples of the same population, and therefore, any observed differences among them should be the result of sampling error or chance. After the application of a treatment variable to one group, the researcher wants to determine if the groups are still from the same population, or if their means can be considered significantly different.

Comparisons can also be made using a repeated measures design. A researcher may be interested in looking at the difference between two conditions or performances by the same group of subjects. In this case, the subjects serve as their own control, and the researcher wants to determine if the conditions are significantly different.

The purpose of this chapter is to introduce procedures for evaluating the comparison between two means using the *t*-test and confidence intervals. These procedures can be applied to differences between two independent samples or between scores obtained with repeated measures. These procedures are based on parametric operations and, therefore, are subject to all assumptions underlying parametric statistics.

The concept of statistical significance for comparing means is based on the relationship between two sample characteristics: the mean and the variance. The difference between group means indicates the degree of separation *between* groups (the effect size). Variance measures tell us how variable the scores are *within* each group. Both of these characteristics represent sources of variability that are used to describe the extent of treatment effects.

Suppose we wanted to compare two randomly assigned groups, one experimental and one control, to determine if treatment made a difference in their performance. Theoretically, if the experimental treatment was effective, and all other factors were equal and constant, all subjects within the treatment group would achieve the same score, and all subjects within the control group would also achieve the same score, but scores would be different between groups. As illustrated in Figure 19.1A, everyone in the treatment group performed better than everyone in the control group. Consider all the scores in this sample for both groups combined. If we were asked to *explain* why these scores were different, we would say that all differences were due to the effect of treatment. There is a difference *between the groups*, but no variance *within the groups*.

###### FIGURE 19.1

Four sets of hypothetical distributions with the same means, but different variances. In (**A**) all subjects in each group received the same score, but the groups were different from each other. There is no variance within groups. In (**B**) the subjects' scores were more spread out, but the control and treatment conditions are still clearly different. There is some variance within groups, but the variance between groups is greater. In (**C**) the subjects are much more variable. There is greater variance within groups, and therefore, the groups are not distinctly different. In (**D**) the variances of the two groups are not equal.

Now let's consider the more realistic situation where subjects within a group do not all respond the same way. As shown in Figure 19.1B, the scores in the treatment and control groups are variable, but we still tend to see higher scores among those who received treatment. If we look at the entire set of scores, and we were asked again to explain why scores are different, we would say that some of the differences can be *explained* by the treatment effect; that is, most of the higher scores were in the treatment group. However, the scores are also influenced by personal characteristics, inconsistencies in measurement, and behavioral and environmental factors. These factors create variance within the groups, variance that is *unexplained*, due to all the other unknown factors influencing the response.

This unexplained portion is called **error variance**. The concept of statistical "error" does not mean mistakes or miscalculation. It refers to all sources of variability within a set of data that cannot be explained by the independent variable. Any given score is a composite of the treatment effect and error variance. Random assignment allows us to assume that these error components are unsystematic chance variations, and therefore, are independent of the treatment effect.^{1}

The distribution in Figure 19.1B is represented by a pair of curves. This graphic shows means that are far apart, with few overlapping scores (the gray area) at one extreme of each curve. The curves show that the individuals in the treatment and control groups behaved very differently, whereas subjects within each group performed within a narrow range (error variance is small). In such a comparison, the null hypothesis would probably be rejected, as the treatment effect has clearly differentiated the groups.

Contrast this with the distributions in Figure 19.1C, which show the same means, but greater variability within the groups, as evidenced by the wider spread of the curves. Factors other than the treatment variable are causing subjects to respond very differently from each other. Here we find a great deal of overlap, indicating that many subjects from both groups had the same score, regardless of whether or not they received the experimental treatment. These curves reflect a greater degree of error variance; that is, the treatment does not help to explain most of the differences among scores. In this case, it is less likely that the treatment is differentiating the groups, and the null hypothesis would probably not be rejected. Any group differences observed here are probably due to chance variation.

The subjective judgments we have made about the distributions in Figure 19.1 are not adequate, however, for making research decisions about the effectiveness of treatment. We know groups can look different just by chance. So how do we objectively determine if observed differences between groups are true population differences or only chance differences? In other words, how do we decide if we should reject the null hypothesis? We make this decision on the basis of its *probability of being true*. This is what a test of statistical significance is designed to do.

The significance of the difference between group means is judged by a ratio derived as follows:

The numerator represents the separation between the groups, which is a function of all sources of variance, including treatment effects and error. The denominator reflects the variability within groups as a result of error alone. Therefore, when *H*_{0} is false, that is, when a treatment effect does exist (*μ*_{1} ≠ *μ*_{2}), the ratio is conceptually represented as

When *H*_{0} is true, that is, when no real treatment effect exists (*μ*_{1} = *μ*_{2}), the ratio reduces to

As the treatment effect increases, the absolute value of this ratio gets larger. As the error variance increases, the ratio gets smaller, approaching 1.0. If we want to demonstrate that two groups are significantly different, this ratio should be as large as possible. Thus, we would want the separation between the group means to be large and the variability within groups to be small. We emphasize the variance *between* and *within* groups as essential elements of significance testing which will be used repeatedly as we continue our discussion. Most statistical tests are based on this relationship.

The null hypothesis for a two-level design states that the two population means are equal:

*H*

_{0}:

*μ*

_{1}=

*μ*

_{2}

The alternative hypothesis can be stated in a nondirectional format,

*H*

_{1}:

*μ*

_{1}≠

*μ*

_{2}

or a directional format,

*H*

_{1}:

*μ*

_{1}>

*μ*

_{2}or

*H*

_{1}:

*μ*

_{1}<

*μ*

_{2}

Nondirectional hypotheses are tested using a two-tailed test of significance. Directional hypotheses are tested using a one-tailed test. Even though we are actually comparing sample means, our hypotheses are written in terms of population parameters.

Most parametric statistics require the assumption of equal variances among groups, or **homogeneity of variance**. While there is an expectation that error variance will exist within each group, the assumption is that the degree of variance will be roughly equivalent. Look at the scenarios in Figure 19.1 B and C. In one case (B) the variances are small, and in the other (C) they are larger; however, in both cases they are similar across groups. If we consider the spread of scores in Figure 19.1D, we can see that the treatment group is much less variable than the control group. In this situation, the two groups have different variances, and the assumption of homogeneity of variance is not met.

Most statistical procedures that compare means include a test that will determine if the difference in the variance components is significant. We can expect some difference in variances just by chance. With random assignment, larger samples will have a better chance of showing equal variances than small samples. Therefore, the test for homogeneity of variance will clarify if the observed difference in variances is large enough to be meaningful. When variances are significantly different (they are not equal), adjustments can be made in the test for means that will account for these differences.^{∗}

^{∗}This can sometimes be confusing when the test of homogeneity of variance is performed in conjunction with a test for differences between means. Two different tests are actually being done. First the test for homogeneity of variance determines if the variances are significantly different. Then the test for means will determine if the means are significantly different. If the first test shows that variances are not equal, an adjustment will be made in the test for means.

The *t*-test is the statistical procedure used to compare two means.^{†} The **independent** or **unpaired t-test** is used when the means of two independent groups of subjects are compared. Such groups are usually created through random assignment, although samples of convenience or intact groups may be used.

^{‡}Groups are considered independent because each is composed of an independent set of subjects, with no inherent relationship derived from repeated measures or matching.

The test statistic for the unpaired *t*-test is calculated using the formula

The numerator of this ratio represents the difference between the independent group means, or the effect size. The term in the denominator is called the **standard error of the difference between the means**,^{§} representing the variability within the two samples. Equation (19.1) can be used in situations where *n*_{1} = *n*_{2}, or when *n*_{1} ≠ *n*_{2} if variances are equal. An alternative formula for *t*, to be described shortly, is used when the assumption of equality of variance is not met.

We estimate using a **pooled variance estimate**, given the symbol

where *s*^{2}_{1} and *s*^{2}_{2} are the group variances, and *n*_{1} and *n*_{2} are the respective sample sizes. This estimate provides a *weighted average* of *s*_{1}^{2} and *s*_{2}^{2}.^{∗∗} The pooled variance estimate is based on the assumption that both samples come from the same population and that they have equal variances (any difference between variances is due to chance). Therefore, the pooled variance should estimate the population variance.

The standard error of the difference between the means is then given by^{††}

The number of degrees of freedom associated with the independent *t*-test is the total of the degrees of freedom for both groups. Therefore, *df* = (*n*_{1} – 1) + (*n*_{2} − 1) = (*n*_{1} + *n*_{2} – 2). This can also be written *df* = *N* − 2, where *N* is the combined sample size.

^{†}Recall from the discussion in Chapter 18 that the *t*-distribution is an analog of the standard normal distribution, developed to represent smaller sampling distributions. The *t*-distribution was originally developed by W.S. Gossett in 1908, who wrote under the pseudonym of "Student." Therefore, the *t*-test is often referred to as Student's *t*-test.

^{‡}When intact groups are used, regression procedures may be the more appropriate form of analysis because groups cannot be randomly assigned to treatment conditions. See Chapters 24 and 29 for a discussion of regression procedures.

^{§}In Chapter 18 we introduced the concept of standard error as an estimate of population variability based on a sampling distribution of means. In this case we are estimating the variability in a sampling distribution of *differences between means*.

^{∗∗}With two samples of equal size, this equation is reduced to .

^{††}With equal sample sizes , where *n* is the number of subjects in each group.

Suppose we are interested in testing the hypothesis that a newly designed splint will improve hand function of patients with rheumatoid arthritis, as measured by pinch strength in pounds (Figure 19.2). We propose a directional alternative hypothesis because we are interested only in documenting an improvement in function with the splint. Results that show no change or a negative change would not be significant.

###### FIGURE 19.2

A pretest-posttest control group design, with two groups of patients with rheumatoid arthritis (RA) formed through random assignment. One group is treated with a hand splint; the other participates in regular activities. The difference between the posttest and pretest pinch strength (change score) is used to compare the two groups with the unpaired *t*-test.

We assemble a random sample of 20 subjects with rheumatoid arthritis, with similar degrees of deformity in the hand and wrist. The subjects are randomly assigned to an experimental group (*n*_{1} = 10) or a control group (*n*_{2} = 10). The experimental subjects wear the splint for 1 week, in addition to participating in their regularly scheduled activities. The control subjects engage in their regular activities with no splint. Pinch strength is measured on day 1 and day 8 for both groups, and the change between the pretest and posttest measurements is used for analysis. Therefore, this study is structured as a pretest-posttest control group design, testing *H*_{0}: *μ*_{1} = *μ*_{2} against *H*_{1}: *μ*_{1} > *μ*_{2}.

Hypothetical data are reported in Table 19.1A. The mean improvement in strength was 10.11 pounds for the splinted group and 5.45 pounds for the control group.

To calculate the *t*-ratio for this comparison, we first determine the value of the denominator, = 1.714, as shown in Table 19.1B. We substitute this value in Equation (19.1), and arrive at a *calculated t-ratio* of 2.718.

Now we must determine if the calculated *t*-ratio is sufficiently large to be considered significant. We do this by comparing the calculated *t* value with a **critical value** at a specified level of significance. The larger the ratio, the more likely the difference is *not* due to chance. Table A.2 in the Appendix is a table of critical values associated with *t*-distributions for samples of various sizes. At the top of the table, levels of significance are identified for one-tailed (*α*_{1}) and two-tailed (*α*_{2}) tests. Because we proposed a directional alternative hypothesis in this example, we will perform a one-tailed test at *α* = .05.

The column along the left side of Table A.2, labeled *df*, identifies the degrees of freedom associated with different-size samples. In this study there are 10 + 10 – 2 = 18 *df*. We look across the row for 18 *df* to the column labeled *α _{1}* = .05 and find the critical value 1.734. We use the summary form

to indicate the critical value of *t* associated with *α*_{1} = .05 and 18 *df*.

Figure 19.3 illustrates the critical value of *t* for 18 *df*, demarcating .05 in the tail of the curve for a 1-tailed test. The null hypothesis states that the difference between means will be zero, and therefore, the *t*-ratio will also equal zero. The probability that a calculated *t*-ratio will be as large or larger than 1.734 is 5% or less.

###### FIGURE 19.3

Curve representing a *t* distribution for 18 *df*, showing the critical value of 1.734 for a one-tailed test at .05, The null hypothesis (*H*_{0}) states that the *t* ratio will equal zero. The calculated value for this example is *t* = 2.718 (from Table 19.1). This value demarcates an area in the tail of the curve of .007. Because this probability is less than .05, we consider the test to be significant.

For a *t*-ratio to represent a significant difference, the absolute value of the calculated ratio must be *greater than or equal to* the critical value. In this example, the calculated value *t* = 2.718 is greater than the critical value 1.734. Therefore, the group means are considered significantly different at *α*_{1} = .05 (see Table 19.1C). We reject *H*_{1}, and accept *H*_{1}, and conclude that patients wearing the hand splint improved more than those in the control group.

Note that each column in Appendix Table A.2 represents both a one- and a two-tailed probability. Note, too, that each two-tailed probability is twice its corresponding one-tailed probability. For instance, the critical value of *t* at *α*_{1} = .05 for 18 *df* is 1.734. This is also the critical value for *t* at *α*_{2} = .10. The critical value for *α*_{1} = .01 is the same as that for *α*_{2} = −02. Some statistical tables and computer packages provide values for only a one- or two-tailed test. If this occurs, it is a simple matter to convert the probability values. The probability reported for a one-tailed test is doubled to get a two-tailed test. Conversely, the probability reported for a two-tailed test is halved to get a one-tailed test. Please be clear that it is the probabilities that are doubled or halved, not the critical values.

The astute reader will also note that the same calculated value of *t* may be significant for a one-tailed test but not for a two-tailed test; that is, critical values are lower for one-tailed tests at a given alpha level. In other words, the one-tailed test is more powerful. This occurs because one-tailed tests require proof in only one direction, and the full 5% probability can fall in one tail of the curve rather than being split between both sides. This concept is clarified in Chapter 18 (see Figures 18.6 and 18.7). Because of the different critical values associated with one- or two-tailed tests at the same probability level, the type of *t*-test used should always be specified in advance of data analysis and should be stated in a research report.

Critical values of *t* are absolute values, so that negative or positive ratios are tested against the same criteria. The sign of *t* can be ignored when a nondirectional hypothesis has been proposed. The critical region for a two-tailed test is located in both tails of the *t*-distribution, and therefore a positive or negative value can be considered significant. The sign will be an artifact of which group happened to be designated Group 1. If the groups were arbitrarily reversed, the ratio would carry the opposite sign, with no change in outcome.

The sign is of concern, however, when a directional alternative hypothesis is proposed. In a one-tailed test, the researcher is predicting that one specific mean will be larger than the other, and the sign must be in the predicted direction for the alternative hypothesis to be accepted. For the current example, the ratio is positive, because the mean improvement for the experimental group (X̄_{1}) was larger than the mean improvement for the control group (X̄_{2}), as predicted (*H*_{1}: *μ*_{1} > *μ*_{2}).

If the difference is in the opposite direction to that predicted, the researcher cannot reverse the alternative hypothesis, and *H _{0}* cannot be rejected. It is important, therefore, to be sure of direction when performing a one-tailed test.

Table 19.ID shows the output for an unpaired *t*-test for the example of strength and hand splints. There are several pieces of information to consider.

First, the output provides a summary of the descriptive statistics associated with each group. This information is useful as a first pass, to confirm that the correct number of subjects were included, and to see how far apart the means and variances are. Standard deviations are reported, and these can be squared to obtain variance values.

Next, notice that there are actually two lines of output for the independent samples test, labeled according to the assumption of equal variances. We must determine which of these to use. Computer packages automatically run the *t*-test for equal and unequal variances, and the researcher must choose which one should be used for analysis. The columns labeled Levene's Test for Equality of Variances will tell us whether the variances are significantly different. Refer to the probability associated with Levene's test (Table 19.1D➊), which is *p* = .419. This value is greater than .05, and we will conclude that the variances are not significantly different (they are equal). Therefore, we will use the first line of data for "Equal variances assumed."

For the data in Table 19.1, we performed a one-tailed test. The computer output reports only a 2-tailed significance (Table 19.1D➋). Therefore, to get the one-tailed probability we divide .014 by 2; therefore, *p* = .007 for this test. Because this value is less than .05, we reject the null hypothesis.

Recall from Chapter 18 that a confidence interval specifies an interval or range of scores within which the population mean is likely to fall. We can also use this approach to set a confidence interval to estimate the *difference between group means* that exists in the population as follows:

where is calculated using Equation (19.3). We will be 95% confident that the true *difference between population means* will fall within this interval.

Consider again the data shown in Table 19.1 for changes in pinch strength, with a difference between means of 4.66. The unpaired *t*-test is significant at *α*_{1} = .05, and *H*_{0} is rejected. Now let us examine how we can use a confidence interval to arrive at the same conclusion.

We create the 95% confidence interval using (*α*_{2} = .05) *t*_{(18)} = 2.101 (from Table A.2). Even though we had proposed a directional hypothesis for this study, by definition, confidence intervals only look at two-tailed test values. The standard error of the difference between means is 1.714, calculated using the pooled variance estimate as shown in Table 19.1B. We substitute these values in Equation (19.4) to determine the 95% confidence limits:

We are 95% confident that the true mean difference, *μ*_{1} – *μ*_{2}, lies between 1.06 and 8.26.

**The Null Value**. The null hypothesis states that the difference between two means will be zero. If we look carefully at the 95% confidence interval for these data, we see that the null value, zero, is not contained within it. As we are 95% confident that the true mean difference lies somewhere within this interval, it is unlikely that the true mean difference is zero. Therefore, we can reasonably reject *H*_{0}. This confirms the results of the *t*-test for these same data. Confidence intervals are reported in Table 19.1D➎.

Studies have shown that the validity of the unpaired *t*-test is not seriously compromised by violation of the assumption of equality of variance when *n*_{1} = *n*_{2};^{2} however, when sample sizes are unequal, differences in variance can affect the accuracy of the *t*-ratio. If a test for equality of variance shows that variances are significantly different, the *t*-ratio must be adjusted.

Consider the previous example, in which we examined the effect of a hand splint on pinch strength. Table 19.2 shows alternative data for this comparison, with unequal sample sizes (*n*_{1} = 15, *n*_{2} = 10). The variances in this analysis are significantly different (Levene's test, *p* = .038, Table 19.2D➊). In this instance, we will use the second line of data in the output, "Equal variances not assumed."

When the larger sample also has the larger variance, as in this example (*n*_{1} > *n*_{2} and S_{1}^{2} > s_{2}^{2}), the *t*-test becomes less powerful; that is, fewer significant differences will be found. Therefore, this issue is moot if significant differences are obtained, but is of concern in cases where *H _{0}* is not rejected.

This problem is of a different import when the smaller sample has the larger variance (*n*_{1} < *n*_{2} and *s*_{1}^{2} > *s*_{2}^{2}), especially when one variance is more than twice the other. In this case, the probability of a Type I error is increased. This discrepancy increases as the relative sample sizes and variance differences become more disparate.^{3} Obviously, this issue is of concern only when a significant difference is obtained.

When sample size and variances are unequal, the *t*-ratio is modified so that it is no longer based on a pooled variance estimate, but instead uses the *separate variances* of the two groups (see Table 19.2B):

The degrees of freedom associated with the *t*-test for unequal variances are also adjusted downward, so that the critical value for *t* is also modified. In this example, 20.6 degrees of freedom are used to determine the critical value of *t* (see Table 19.2D➋).^{‡‡} The output shows that the test is significant at *p* = .001 (Table 19.2D➌).

^{‡‡}The adjusted degrees of freedom are determined according to:

Researchers often use repeated measures or matched designs to improve the degree of control over extraneous variables in a study. In these designs subjects may be matched on relevant variables, such as age and intelligence, or any other variable that is potentially correlated with the dependent variable. Sometimes twins or siblings are used as matched pairs. More commonly, however, clinical researchers will use subjects as their own controls, exposing each subject to both experimental conditions and then comparing their responses across these conditions.

In these types of studies, data are considered *paired* or correlated, because each measurement has a matched value for each subject. To determine if these values are significantly different from each other, a **paired t-test** is performed. This test analyzes

*difference scores (d)*within each pair, so that subjects are compared only with themselves or with their match. Statistically, this has the effect of reducing the total error variance in the data because most of the extraneous factors that influence data will be the same across both treatment conditions. Therefore, tests of significance involving paired comparisons tend to be more powerful than unpaired tests.

The test statistic for paired data is based on the ratio

where *d̄* is the mean of the difference scores, and sg represents the **standard error of the difference scores**. This ratio also reflects the relationship of *between-* and *within-group* variance components. The numerator is a measure of the differences between pairs of scores, and the denominator is a measure of the variability within the difference scores.

The paired *t*-test is based on the assumption that samples are randomly drawn from normally distributed populations with equal variances; however, because the number of scores in both treatment conditions must be the same, it is unnecessary to test this assumption with correlated samples.

The total *df* associated with a paired *t*-test are *n* – 1, where *n* is the number of pairs of scores.

Suppose we set up a study to test the effect of using a lumbar support pillow on angular position of the pelvis in relaxed sitting (Figure 19.4). We hypothesize that pelvic tilt will change with use of a support pillow (a nondirectional hypothesis). We test eight subjects, each one sitting relaxed in a straight-back chair with and without the pillow (in random order). The angle of the pelvic tilt is measured using a flexible ruler, with measurements transformed to degrees.

Because each subject is measured under both experimental conditions, this is a repeated measures design, testing the hypothesis *H*_{0}: *μ*_{1} = *μ*_{2} against *H*_{1}: *μ*_{1} ≠ *μ*_{2}, where means represent repeated conditions. These hypotheses may also be expressed in terms of difference scores: *H*_{0}: *d̄* = 0 and *H*_{0}: *d̄* ≠ 0.

Hypothetical data are reported in Table 19.3A. A difference score, *d*, is calculated for each pair of scores. By substituting values in Equation (19.6), we obtain *t* = –1.532 (see Table 19.3B and D).

The absolute value of the calculated ratio is compared with a critical value, in this case for a two-tailed test with *n* – 1 = 7 *df:* Using Table A.2, we find

Because the calculated value is less than the critical value, these conditions are not considered significantly different. The output in Table 19.3D➍ shows that *t* is significant at *p* = .169. Because this is higher than .05, *H*_{0} is not rejected (see Table 19.3C).

For the paired *t*-test, a confidence interval is obtained using the formula:

Therefore,

We are 95% confident that the true difference in pelvic angle between the pillow and nonpillow conditions is between –8.58 and 1.84 degrees (see Table 19.3D➎). However, because zero is contained within this interval, these means are not significantly different.

The *t*-test is one of the most commonly applied statistical tests. Unfortunately, it is also one of the most misused.^{4} The sole purpose of the *t*-test is to compare two means. Therefore, when more than two means are analyzed within a single sample, the *t*-test is inappropriate. For instance, if we wanted to compare three types of exercise, it would be incorrect to use the *t*-test because the analysis involves three comparisons within one sample.

The problem with using multiple *t*-tests within one set of data is that the more comparisons one makes, the more likely one is to commit a Type I error, that is, to find a significant difference when none exists. Remember that *α* is the probability of committing a Type I error for any single comparison. At *α* = .05, there is a 5% chance we will be in error if we say that group means are different. Although it is true that *α* = .05 for each individual comparison, the potential *cumulative error* in a set of comparisons is actually greater than .05. Consider the interpretation that, for *α* = .05, if we were to repeat a study 100 times when *no difference really existed*, we could expect to find a significant difference five times, as a random event, just by chance. Five percent of our conclusions could be in error. For any one comparison, however, we cannot know if a significant finding represents one of the potentially correct or incorrect decisions. Theoretically, any one test could be in error.

Consider a different random event, such as a strike of lightning.^{5} Suppose you had to cross a wide open field during a lightning storm. There may be only a small risk of getting struck (perhaps 5 percent?)—but would you prefer to cross the field just once, or several times? Would you consider the risk greater if you crossed the field several times? This same logic applies to repeated *t*-tests. The more we repeat comparisons within a sample, the greater are our chances that one or more of those comparisons will result in a random event, a significant difference even when one does not exist.

This problem can be avoided by using the more appropriate *analysis of variance (ANOVA)*, which is a logical extension of the *t*-test specifically designed to compare more than two means. As an adjunct to the analysis of variance, *multiple comparison procedures* have been developed that control the Type I error rate, allowing valid interpretations of several comparisons at the desired *α* level. These procedures are discussed in Chapters 20 and 21.

Researchers in many disciplines, epidemiologists and biostatisticians foremost among them, have become disenchanted with the overemphasis placed on reporting *p* values in research literature.^{4} In an effort to make hypothesis testing more meaningful, investigators in these disciplines have relied on the confidence interval as a more practical estimate of a population's characteristics.^{6} As we have shown, the outcomes of hypothesis testing using either confidence intervals or *t*-tests will be the same; however, the confidence interval gives the researcher information not provided by the *t*-test. Rather than just indicating if two means are significantly different, the confidence interval essentially estimates true effect size; that is, it estimates how large a difference can be expected in the population. This information can then be used for evaluating the results of assessments and for framing practice decisions.

Confidence intervals may be more clinically useful than relying on probability values when the magnitude of differences is relevant to clinical decision making and prediction of normal or abnormal responses. Like *p* values, however, confidence intervals do not tell us about the importance of the observed effect. That remains a matter of clinical judgment.

*Design and Analysis: A Researcher's Handbook*(4th ed.). Englewood Cliffs, NJ: Prentice Hall, 2004.

*Rev Educ Res*1972;42:237–288.