The purpose of an experimental design is to provide a structure for evaluating the cause-and-effect relationship between a set of independent and dependent variables. Within the design, the researcher manipulates the levels of the independent variable and incorporates elements of control, so that the evidence supporting a causal relationship can be interpreted with confidence.

Although experimental designs can take on a wide variety of configurations, the important principles can be illustrated using a few basic structures. The purpose of this chapter is to present these basic designs and to illustrate the types of research situations for which they are most appropriate. For each design, we discuss strengths and weaknesses in terms of experimental control and internal and external validity. In addition, we include a short statement suggesting general statistical procedures for analysis. These suggestions do not represent all statistical options for a particular design, but they do represent the more commonly used techniques. This information demonstrates the intrinsic relationship between analysis and design.

The term **clinical trial** is often used to describe experimental studies that examine the effect of interventions on patient or community populations. Clinical trials are frequently designed on a large scale, involving subjects from a range of geographic areas or from several treatment centers. Clinical trials can be classified as either therapeutic or preventive. *Therapeutic trials* examine the effect of a treatment or intervention on a particular disease. For example, 25 years of clinical trials, begun in the 1970s, have shown that radical mastectomy is not necessary for reducing the risk of recurrence or spread of breast cancer, and that limited resection can be equally effective in terms of recurrence and mortality.^{1} A *preventive trial* evaluates whether a procedure or agent reduces the risk of developing a disease. One of the most famous preventive trials was the field study of poliomyelitis vaccine in 1954, which covered 11 states.^{2} The incidence of poliomyelitis in the vaccinated group was over 50 percent less than among those children who received the placebo, establishing strong evidence of the vaccine's effectiveness. In a more contemporary example, scientists continue to design trials in an effort to develop a vaccine to prevent HIV infection.^{3}

In the investigation of new therapies, including drugs, surgical procedures and electromechanical devices, distinct sequences of clinical trials are typically carried out. The phases of trials are intended to provide different types of information about the treatment in relation to dosage, safety and efficacy, with increasingly greater rigor in demonstrating the intervention's effectiveness and safety (see Box 10.1).

Experimental designs can be described according to several types of design characteristics. A basic distinction among them is the degree of experimental control.^{4,5} In a **true experimental design**, subjects are randomly assigned to at least two comparison groups. An experiment is theoretically able to exert control over most threats to internal validity, providing the strongest evidence for causal relationships. The **randomized controlled trial (RCT)** is considered the gold standard of true experimental designs.

A **quasi-experimental design** does not meet the requirements of a true experiment, lacking random assignment or comparison groups, or both. Even though quasi-experimental designs cannot rule out threats to internal validity with the same confidence as experimental designs, many such designs are appropriate when stronger designs are not feasible. Quasi-experimental designs represent an important contribution to clinical research, because they accommodate for the limitations of natural settings, where scheduling treatment conditions and random assignment are often difficult, impractical or unethical. These designs will be covered in Chapter 11.

Experimental designs may be differentiated according to how subjects are assigned to groups. In **completely randomized designs**, also referred to as **between-subjects designs**, subjects are assigned to independent groups using a randomization procedure. In a **randomized block design** subjects are first classified according to an attribute variable (a blocking variable) and then randomized to treatment groups. A design in which subjects act as their own control is called a **within-subjects design** or a **repeated measures design**.

These designs can also be described according to the number of independent variables, or *factors*, within the design. *Single-factor designs* have one independent variable with any number of levels. *Multi-factor designs* contain two or more independent variables.

Once a research question is formulated, the researcher must decide on the most effective design for answering it. Although experimental designs represent the highest standard in scientific inquiry, they are not necessarily the best choice in every situation. When the independent variable cannot be manipulated by the experimenter, or when important extraneous factors cannot be controlled, an observational or exploratory design may be more useful (see Chapter 13).

When an experimental design is deemed appropriate, the choice of a specific design will depend on the answers to six critical questions about how the study is conceptualized:

How many independent variables are being tested?

How many levels does each independent variable have, and are these levels experimental or control conditions?

How many groups of subjects are being tested?

How will subjects be assigned to groups?

How often will observations of responses be made?

What is the temporal sequence of interventions and measurements?

When each of these issues is considered, the range of potential designs will usually be narrowed to one or two appropriate choices. As specific designs are presented, these questions will be addressed within the context of research questions from the literature.

A single-factor design, also called a **one-way design**, is used to structure the investigation of one independent variable. The study may include one or more dependent variables.

The **pretest-posttest control group design** is the basic structure of a randomized controlled trial. It is used to compare two or more groups that are formed by random assignment. One group receives the experimental variable and the other acts as a control. These independent groups are also called **treatment arms** of the study. Both groups are tested prior to and following treatment. The groups differ solely on the basis of what occurs between measurements. Therefore, changes from pretest to posttest that appear in the experimental group but not the control group can be reasonably attributed to the intervention. This design is considered the scientific standard in clinical research for establishing a cause-and-effect relationship.

The pretest-posttest control group design can be configured in several ways. Figure 10.1 illustrates the simplest configuration, with one experimental group and one control group.

Researchers conducted a randomized controlled trial to study the effect of a supervised exercise program for improving venous hemodynamics in patients with chronic venous insufficiency.^{6} They randomly assigned 31 patients to two groups. The experimental group received physical therapy with specific exercises for calf strengthening and joint mobility. The control group received no exercise intervention. Both groups received compression hosiery. Dynamic strength, calf pump function and quality of life were assessed at baseline and after 6 months of exercise.

Measurements for the control group are taken within intervals that match those of the experimental group. The independent variable has two levels, in this case exercise intervention and control. The absence of an experimental intervention in the control group is considered a level of the independent variable. As this example illustrates, a study may have several dependent variables that are measured at pretest and posttest.

The pretest-posttest design can also be used when the comparison group receives a second form of the intervention. The *two-group pretest-posttest design* (see Figure 10.2) incorporates two experimental groups formed by random assignment.

Researchers conducted a randomized controlled trial to study the effect of semantic treatment on verbal communication in patients who experienced aphasia following a stroke.^{7} They randomly assigned 58 patients to two groups. Speech therapists provided semantic treatment to the experimental group. The control group received speech therapy focused on word sounds. Verbal communication was assessed using the Amsterdam Nijmegen Everyday Language Test. Both groups were assessed at the start of the study and following 7 months of treatment.

Researchers use this approach when a control condition is not feasible or ethical, often comparing a "new" treatment with an "old" standard or alternative treatment. Even though there is no traditional control group, this design provides experimental control because we can establish initial equivalence between groups formed by random assignment. In this example, the word sound group acts as a control for the semantic treatment group and vice versa. If one group improves more than the other, we can attribute that difference to the fact that one treatment was more effective. This design is appropriate when the research question specifically addresses interest in a difference between two treatments, but it does not allow the researcher to show that treatment works better than no intervention.

The *multigroup pretest-posttest control group design* (see Figure 10.3) allows researchers to compare several treatment and control conditions.

Researchers wanted to determine the effectiveness of aquatic and on-land exercise programs on functional fitness and activities of daily living (ADLs) in older adults with arthritis.^{8} Participants were 30 volunteers, randomly assigned to aquatic exercise, on-land exercise or a control group. The control group was asked to refrain from any new physical activity for the duration of the study. Outcomes included fitness and strength measures, and functional assessments before and after an 8-week exercise program.

As these examples illustrate, the pretest-posttest control group design can be expanded to accommodate any number of levels of one independent variable, with or without a traditional control group. This design is strong in internal validity. Pretest scores provide a basis for establishing initial equivalence of groups, strengthening the evidence for causal factors. Selection bias is controlled because subjects are randomly assigned to groups. History, maturation, testing, and instrumentation effects should affect all groups equally in both the pretest and posttest. The only threat to internal validity that is not controlled by this design is attrition.

The primary threat to external validity in the pretest-posttest control group design is the potential interaction of treatment and testing. Because subjects are given a pretest, there may be reactive effects, which would not be present in situations where a pretest is not given.

**Analysis of Pretest-Posttest Designs.** Pretest-posttest designs are often analyzed using change scores, which represent the difference between the posttest and pretest.^{*} With interval-ratio data, difference scores are usually compared using an unpaired *t*-test (with two groups or a one-way analysis of variance (with three or more groups). With ordinal data, the Mann-Whitney *U*-test can be used to compare two groups, and the Kruskal-Wallis analysis of variance by ranks is used to compare three or more groups. The analysis of covariance can be used to compare posttest scores, using the pretest score as the covariate. The design can be analyzed as a two-factor design, using a two-way analysis of variance with one repeated factor, with treatment as one independent variable and time (pretest and posttest) as the second (repeated) factor. Discriminant analysis can also be used to distinguish between groups with multiple outcome measures.

^{*}See discussion about the reliability of change scores in Chapter 6

The **posttest-only control group design** (see Figure 10.4) is identical to the pretest-posttest control group design, with the obvious exception that a pretest is not administered to either group.

A study was designed to test the hypothesis that high-risk patients undergoing elective hip and knee arthroplasty would incur less total cost and shorter length of stay if inpatient rehabilitation began on postoperative day 3 rather than day 7.^{9} Eighty-six patients who were older than 70 years were randomly assigned to begin rehabilitation on day 3 or day 7. The main outcome measures were total length of stay and cost from orthopedic and rehabilitation admissions.

In this study of hospital cost and length of stay, the dependent variables can only be assessed following the treatment condition. This design is a true experimental design which, like the pretest-posttest design, can be expanded to include multiple levels of the independent variable, with a control, placebo or alternative treatment group.

Because this design involves random assignment and comparison groups, its internal validity is strong, even without a pretest; that is, we can assume groups are equivalent prior to treatment. Because there is no pretest score to document the results of randomization, this design is most successful when the number of subjects is large, so that the probability of truly balancing interpersonal characteristics is increased.

The posttest-only design can also be used when a pretest is either impractical or potentially reactive. For instance, to study the attitudes of health care personnel toward patients with AIDS, we might use a survey instrument that asked questions about attitudes and experience with this population. By using this instrument as a pretest, subjects might be sensitized in a way that would influence their scores on a subsequent posttest. The posttest-only design avoids this form of bias, increasing the external validity of the study.

**Analysis of Posttest-Only Designs.** With two groups, an unpaired *t*-test is used with interval-ratio data, and a Mann-Whitney *U*-test with ordinal data. With more than two groups, a one-way analysis of variance or the Kruskal-Wallis analysis of variance by ranks should be used to compare posttest scores. An analysis of covariance can be used when covariate data on relevant extraneous variables are available. Regression or discriminant analysis procedures can also be applied.

The designs presented thus far have involved the testing of one independent variable, with two or more levels. Although easy to develop, these single-factor designs tend to impose an artificial simplicity on most clinical and behavioral phenomena; that is, they do not account for simultaneous and often complex interactions of several variables within clinical situations. Interactions are generally important for developing a theoretical understanding of behavior and for establishing the construct validity of clinical variables. Interactions may reflect the combined influence of several treatments or the effect of several attribute variables on the success of a particular treatment.

A * factorial design* incorporates two or more independent variables, with independent groups of subjects randomly assigned to various combinations of levels of the two variables. Although such designs can theoretically be expanded to include any number of variables, clinical studies usually involve two or three at most. As the number of independent variables increases, so does the number of experimental groups, creating the need for larger and larger samples, which are typically impractical in clinical situations.

Factorial designs are described according to their dimensions or number of factors, so that a *two-way* or two-factor design has two independent variables, a *three-way* or three-factor design has three independent variables, and so on. These designs can also be described by the number of levels within each factor, so that a 3 × 3 design includes two variables, each with three levels, and a 2 × 3 × 4 design includes three variables, with two, three and four levels, respectively.

A factorial design is diagrammed using a matrix notation that indicates how groups are formed relative to levels of each independent variable. Uppercase letters, typically *A, B* and C, are used to label the independent variables and their levels. For instance, with two independent variables, *A* and *B,* we can designate three levels for the first one (*A*_{1}, *A*_{2} and *A*_{3}) and two levels for the second (*B*_{1}, *B*_{2}).

The number of groups is the product of the digits that define the design. For example, 3 × 3 = 9 groups; 2 × 3 × 4 = 24 groups. Each cell of the matrix represents a unique combination of levels. In this type of diagram there is no indication if measurements within a cell include pretest-posttest scores or posttest scores only. This detail is generally described in words.

**Two-Way Factorial Design.** A *two-way factorial design* (see Figure 10.5) incorporates two independent variables, *A* and *B.*

Researchers were interested in studying the effect of intensity and location of exercise programs on the self-efficacy of sedentary women.^{10} Using a 2 × 2 factorial design, subjects were randomly assigned to one of four groups, receiving a combination of moderate or vigorous exercise at home or a community center. The change in their exercise behavior and their self-efficacy in maintaining their exercise program was monitored over 18 months.

In this example, the two independent variables are intensity of exercise (*A*) and location of exercise (*B*), each with two levels (2 × 2). One group (*A*_{1}*B*_{1}) will engage in moderate exercise at home. A second group (*A _{2}B*

_{1}) will engage in vigorous exercise at home. The third group (

*A*

_{1}

*B*

_{2}) will engage in moderate exercise at a community center. And the fourth group (

*A*

_{2}

*B*

_{2}) will engage in vigorous exercise at a community center. The two independent variables are

*completely crossed*in this design, which means that every level of one factor is represented at every level of the other factor. Each of the four groups represents a unique combination of the levels of these variables, as shown in the individual cells of the diagram in Figure 10.5A. For example, using random assignment with a sample of 60 patients, we would assign 15 subjects to each group.

This design allows us to ask three questions of the data: (1) Is there a differential effect of moderate versus vigorous exercise? (2) Is there a differential effect of exercising at home or a community center? (3) What is the interaction between intensity and location of exercise? The answers to the first two questions are obtained by examining the **main effect** of each independent variable, with scores collapsed across the second independent variable, as shown in Figure 10.5B. This means that we can look at the overall effect of intensity of exercise without taking into account any differential effect of location. Therefore, we would have 30 subjects representing each intensity. The main effect of location is also analyzed without differentiating intensity. Each main effect is essentially a single-factor experiment.

The third question addresses the **interaction effect** between the two independent variables. This question represents the essential difference between single-factor and multifactor experiments. Interaction occurs when the effect of one variable varies at different levels of the second variable. For example, we might find that moderate exercise intensity is more effective in changing exercise behavior, but only when performed at a community center.

This example illustrates the major advantage of the factorial approach, which is that it gives the researcher important information that could not be obtained with any one single-factor experiment. The ability to examine interactions greatly enhances the generalizability of results.

**Three-Way Factorial Design.** Factorial designs can be extended to include more than two independent variables. In a *three-way factorial design* (see Figure 10.6), the relationship among variables can be conceptualized in a three-dimensional format. We can also think of it as a two-way design crossed on a third factor.

For example, we could expand the exercise study shown in Figure 10.5 to include a third variable such as frequency of exercise. We would then evaluate the simultaneous effect of intensity, location and frequency of exercise. We could assign subjects to exercise 1 day or 3 days per week. Then we would have a 2 × 2 × 2 design, with subjects assigned to one of 8 independent groups (see Figure 10.6).

A three-way design allows several types of comparisons. First, we can examine the main effect for each of the three independent variables, collapsing data across the other two. We can examine the difference between the two intensities, regardless of the effect of location or frequency. We can test the difference between the two locations, regardless of intensity or frequency. And we can evaluate the effect of frequency of exercise, regardless of intensity or location. Each of the three main effects essentially represents a single-factor study for that variable.

Then we can examine three *double interactions:* intensity × location, intensity × frequency, and location × frequency. For example, the interaction between intensity and location is obtained by collapsing data across the two levels of frequency of exercise. Each double interaction represents a two-way design. Finally, we can examine the *triple interaction* of intensity, location and frequency. This interaction involves analyzing the differences among all 8 cells.

Many clinical questions have the potential for involving more than one independent variable, because response variables can be influenced by a multitude of factors. In this respect, the compelling advantage of multidimensional factorial designs is their closer approximation to the "real world." As more variables are added to the design, we can begin to understand responses, increasing construct validity of our arguments. The major disadvantages, however, are that the sample must be extremely large to create individual groups of sufficient size and that data analysis can become cumbersome.

**Analysis of Factorial Designs.** A two-way or three-way analysis of variance is most commonly used to examine the main effects and interaction effects of a factorial design.

When a researcher is concerned that an extraneous factor might influence differences between groups, one way to control for this effect is to build the variable into the design as an independent variable. The **randomized block design** (see Figure 10.7) is used when an attribute variable, or *blocking variable*, is crossed with an active independent variable; that is, homogeneous *blocks* of subjects are randomly assigned to levels of a manipulated treatment variable. In the following example, we have a 2 × 3 randomized block design, with a total of 6 groups.

A study was performed to assess the action of an antiarrhythmic agent in healthy men and women after a single intravenous dose.^{11} Researchers wanted to determine if effects were related to dose and gender. Twenty-four subjects were recruited, 12 men and 12 women. Each gender group was randomly assigned to receive 0.5, 1.5 or 3.0 mg/kg of the drug for 2 minutes. Therefore, 4 men and 4 women received each dose. Through blood tests, volume of distribution of the drug at steady state was assessed before and 72 hours after drug administration. The change in values was analyzed across the six study groups.

In studying the drug's effect, the researchers were concerned that men and women would respond differently. We can account for this potential effect by using gender as an independent variable. We can then assume that responses will not be confounded by gender.

We can think of this randomized block design as two single-factor randomized experiments, with each block representing a different subpopulation. Subjects are grouped by blocks (gender), and then random assignment is made within each block to the treatment conditions. When the design is analyzed, we will be able to examine possible interaction effects between the treatment conditions and blocks. When this interaction is significant, we will know that the effects of treatment do not generalize across the block classifications, in this case across genders. If the interaction is not significant, we have achieved a certain degree of generalizability of the results.

For the randomized block design to be used effectively, the blocking factor must be related to the dependent variable; that is, it must be a factor that affects how subjects will respond to treatment. If the blocking factor is not related to the response, then using it as an independent variable provides no additional control to the design, and actually provides less control than had random assignment been used. Randomized block designs can involve more than two independent variables, with one or more blocking variables.

Generalization of results from a randomized block design will be limited by the definition of blocks. For example, classification variables, such as gender or diagnosis, are often used as blocking variables. The number of levels of these variables will be inherent. When the blocking factor is a quantitative variable, however, such as age, two important decisions must be made. First, the researcher must determine the range of ages to be used. Second, the number and distribution of blocks must be determined. Generally, it is best to use equally spaced levels with a relatively equal number of subjects at each level. If the researcher is interested in trends within a quantitative variable, three or more levels should be used to describe a pattern of change. For instance, if four age groups are delineated, we would have a clearer picture of the trends that occur with age than if only two levels were used.

**Analysis of Randomized Block Designs.** Data from a randomized block design can be analyzed using a two-way analysis of variance, multiple regression or discriminant analysis.

To this point, we have described multifactor designs in terms of two or three independent variables that are completely crossed; that is, all levels of variable A have occurred within all levels of variable B. This approach does not fit all multifactor analyses, however, when attribute variables are involved. Sometimes attribute variables cannot be crossed with all levels of other variables. Consider the following example.

An occupational therapist was interested in studying an intervention to facilitate motivational behaviors in individuals with psychiatric illness who had motivational deficits.^{12} The intervention was based on strategies of autonomy support. Patients were randomly assigned to either an experimental or control group, and to one of two different groups of therapists who carried out the treatments.

To study the effectiveness of the intervention, scores would be compared across 10 "therapists," each providing either the experimental treatment or control condition. If we used a traditional two-way design (10 × 2), all 10 levels of therapists would be crossed with both levels of treatment. This would allow the researcher to look at the main effect of therapists, to determine if differences were due to their application of the treatment. If a significant interaction occurred between therapist and treatment, it would mean that the effectiveness of intervention was dependent on which therapist provided it.

If we wanted to follow up on this interaction, we might suspect that less experienced therapists provided a different quality of intervention than more experienced therapists. To test this, we could divide our sample of therapists into two groups based on their years of experience: "less experienced" and "more experienced." This introduces a third independent variable, experience, with two levels. But these two levels cannot be crossed with the 10 levels of therapists; that is, the same therapist cannot appear in both experience groups. Therefore, "therapists" are *nested* within "experience." All levels of therapist and experience can be crossed with the two methods in this **nested design** (see Figure 10.8). Although this resembles a three-way randomized block design, it must be analyzed differently because the interactions of therapist × experience and therapist × experience × method cannot be assessed.

Most variables in clinical studies can be completely crossed; however, with certain combinations of attribute variables, a nested arrangement is required. Nesting is commonly used in educational studies where classes are nested in schools or schools are nested in cities. For instance, Edmundson and associates studied an educational program to reduce risk factors for cardiovascular disease.^{13} They evaluated the effect of the program on 6,000 students from 96 schools in four states. The schools were nested in states. Within each state the schools were randomly assigned to receive the program or a control condition.

**Analysis of Nested Designs.** An analysis of variance is used to test for main effects and relevant interactions. The dimensions of that analysis depend on how many variables are involved in the study. Nested designs require a complicated approach to analysis of variance, which goes beyond the scope of this book. See Keppel for discussion of analysis of nested designs.^{14 (pp. 550-565)}

All of the experimental designs we have considered so far have involved at least two independent groups, created by random assignment or blocking. There are many research questions, however, for which control can be substantially increased by using a **repeated measures design**, where one group of subjects is tested under all conditions and each subject acts as his own control. Conceptually, a repeated measures design can be considered a series of trials, each with a single subject. Therefore, such a design is also called a **within-subj ects design**, because treatment effects are associated with differences observed within a subject across treatment conditions, rather than between subjects across randomized groups.

The major advantage of the repeated measures design is the ability to control for the potential influence of individual differences. It is a fairly safe assumption that important subject characteristics, such as age, sex, motivation and intelligence, will remain constant throughout the course of an experiment. Therefore, differences observed among treatment conditions are more likely to reflect treatment effects, and not variability between subjects. Using subjects as their own control provides the most equivalent "comparison group" possible.

One disadvantage of the repeated measures approach is the potential for *practice effects*, or the learning effect that can take place when one individual repeats a task over and over. Another disadvantage is the potential for *carryover effects* when one subject is exposed to multiple-treatment conditions. Carryover can be reduced by allotting sufficient time between successive treatment conditions to allow for complete dissipation of previous effects. For instance, if we study the effect of different forms of heat on intramuscular temperature to relieve pain, we may need to repeat testing on different days to be sure that tissues have returned to resting temperatures. We would also have to be assured that the patient's pain level was constant across these days.

Therefore, repeated measures can only be used when the outcome measure will revert back to baseline between interventions, and the patient problem will remain relatively stable throughout the study period. There are many treatments for which carryover cannot be eliminated. For example, if we evaluate the effects of different exercise programs for increasing strength over a 4-week period, the effects of each exercise regimen will probably be long lasting, and rest periods will be ineffective for reversing the effect. With variables that produce permanent or long-term physiological or psychological effects, repeated measures designs are not appropriate.

Because repeated measures designs do not incorporate randomized comparison groups, they may not qualify as true experiments. However, they may be considered experiments when they incorporate randomization in the order of application of repeated conditions, and the comparison of one condition or intervention to another within one subject.

The simplest form of repeated measures design involves a single-factor experiment, where one group of subjects is exposed to all levels of one independent variable (see Figure 10.9).

Researchers were interested in the effect of using a cane on the intramuscular forces on prosthetic hip implants during walking.^{15} They studied 24 subjects with unilateral prosthetic hips under three conditions: walking with a cane on the side contralateral to the prosthesis, on the same side as the prosthesis, and on the contralateral side with instructions to push with "near maximal effort." They monitored electromyographic (EMG) activity of hip abductor muscles and cane force under each condition. The order of testing under the three test conditions was randomly assigned.

For the study of cane use, the researchers wanted to examine EMG activity of the hip abductor muscles, with all subjects exposed to all three cane conditions. It would be possible to use a randomized design to investigate this question, by assigning different groups to each condition, but it doesn't make logical sense. By using a repeated measures format we can be assured that differences across conditions are a function of cane use, and not individual physiological differences. In this example, the independent variable does not present a problem of carryover that would preclude one subject participating at all three levels. This design is commonly referred to as a *one-way repeated measures design*.

**Order Effects.** Because subjects are exposed to multiple-treatment conditions in a repeated measures design, there must be some concern about the potentially biasing effect of test sequence; that is, the researcher must determine if responses might be dependent on which condition preceded which other condition. Effects such as fatigue, learning or carryover may influence responses if subjects are all tested in the same order.

One solution to the problem of **order effects** is to randomize the order of presentation for each subject, often by the flip of a coin, so that there is no bias involved in choosing the order of testing. In the study of cane use, the researchers were concerned that the subjects' responses could be affected if one condition was always tested first. This approach does theoretically control for order effects; however, there is still a chance that some sequences will be repeated more often than others, especially if the sample size is small. This design is sometimes considered a randomized block design, with the blocks being considered the specific sequences.

A second solution utilizes a **Latin Square**, which is a matrix composed of equal numbers of rows and columns, designating random permutations of sequence combinations.^{†} For example, in the cane study, if we had 30 subjects, we could assign 10 subjects to each of three sequences, as shown in Figure 10.10. Using random assignment, we would determine which group would get each sequence, and then assign each testing condition to A, B or C.

**Analysis of One-Way Repeated Measures Designs.** The one-way analysis of variance for repeated measures is used to test for differences across levels of one repeated factor.

^{†}For examples of Latin Squares of different sizes, see Fisher RA, Yates F. *Statistical Tables for Biological, Agricultural and medical Research*. Longman Group UK, Ltd, 1974.

When only two levels of an independent variable are repeated, a preferred method to control for order effects is to *counterbalance* the treatment conditions so that their order is systematically varied. This creates a **crossover design** in which half the subjects receive Treatment A followed by B, and half receive B followed by A. Two subgroups are created, one for each sequence, and subjects are randomly assigned to one of the sequences.

Researchers were interested in comparing the effects of prone and supine positions on stress responses in mechanically ventilated preterm infants.^{16} They randomly assigned 28 infants to a supine/prone or prone/supine position sequence. Infants were placed in each position for 2 hours. Stress signs were measured following each 2 hour period, including startle, tremor, and twitch responses.

A crossover design should only be used in trials where the patient's condition or disease will not change appreciably over time. It is not a reasonable approach in situations where treatment effects are slow, as the treatment periods must be limited. It is similarly impractical where treatment effects are long term and a reversal is not likely. This design is especially useful, however, when treatment conditions are immediately reversible, as in the positioning of infants. When the treatment has some cumulative effect, however, a **washout period** is essential, allowing a common baseline for each treatment condition (see Figure 10.11). The washout period must be long enough to eliminate any prolonged effects of the treatment.

Researchers were interested in the effectiveness of a cranberry supplement for preventing urinary tract infections in persons with neurogenic bladders secondary to spinal cord injury.^{17} They treated 21 individuals, evaluating responses based on urinary bacterial counts and white blood cell counts. Subjects were randomly assigned to standardized 400-mg cranberry tablets or placebo 3 times a day for 4 weeks. After 4 weeks and an additional 1-week "washout period," participants were crossed over to the other group.

In this example, one week was considered sufficient for removal of effects from the patient's system.

**Analysis of Crossover Designs.** In the analysis of a crossover design, researchers will usually group scores by treatment condition, regardless of which order they were given. A paired *t*-test can then be used to compare change scores, or a two-way analysis of variance with two repeated measures can be used to compare pretest and posttest measures across both treatment conditions. The Wilcoxon signed-ranks test should be used to look at change scores when ordinal data are used. In some situations, the researcher may want to see if order did have an effect on responses, and subjects can be separated into independent groups based on sequence of testing. This analysis may include a two-way analysis of variance with one repeated measure, with sequence as an independent factor and treatment condition as a repeated measure.

Repeated measures can also be applied to studies involving more than one independent variable (see Figure 10.12).

The use of back belts in industry is a subject of controversy. A study was designed to investigate the effect of back belts on oxygen consumption during lifting movements.^{18} To study this question, researchers recruited 15 healthy subjects who were fitted with a semi-rigid lumbosacral orthosis. Oxygen consumption was measured while subjects participated in 6-minute submaximal lifting bouts of 10 kg. Each subject performed squat and stoop lifting, with and without the orthosis, for a total of four lifting bouts, in random order.

In the study of back belts, researchers created a 2 × 2 design with two repeated measures: type of lift (squat or stoop) and wearing of the orthosis (yes or no). Each subject was exposed to four test conditions. This design can be expanded to include three independent variables.

**Analysis of Two-Way Repeated Measures Designs.** The two-way analysis of variance with two repeated measures is used to analyze differences across main effects and interaction effects.

A **mixed design** (see Figure 10.13) is created when a study incorporates two independent variables, one repeated across all subjects, and the other randomized to independent groups.

A study was designed to evaluate the effectiveness of a treatment program of stabilizing exercises for patients with pelvic girdle pain after pregnancy.^{19} The researchers based their design on the importance of activation of muscles for motor control and stability of the lumbopelvic region. Eighty women with pelvic girdle pain were assigned randomly to two treatment groups for 20 weeks. One group received physical therapy with a focus on specific stabilizing exercises. The other group received individualized physical therapy without specific stabilizing exercises. Assessments were administered by a blinded assessor at baseline, after intervention and at 1 year post partum. Main outcome measures were pain, functional status and quality of life.

In the comparison of the two exercise programs, subjects were randomly assigned to treatment groups. Each subject was tested three times (pretest and two posttests). The variable of exercise program is considered an *independent factor* because its levels have been randomly assigned, creating independent groups. The variable of time is a *repeated factor* because all subjects are exposed to its three levels. Therefore, this design is also called a *two-way design with one repeated measure*, or a 3 × 3 mixed design. This example illustrates a commonly used approach, where researchers want to establish if the effects of intervention are long lasting, and not just present immediately following completion of the program.

Mixed designs are often used with attribute variables. For instance, we could look at differences in pelvic girdle pain across three age groups. This would be a special case of a randomized block design, where subjects within a block act as their own controls. Mixed designs may incorporate more than two independent variables.

**Analysis of Mixed Designs.** A two-way analysis of variance with one repeated measure is used to analyze main effects and interaction effects with a two-way design with one repeated factor.

The **sequential clinical trial** is a special approach to the randomized clinical trial, which allows for continuous analysis of data as they become available, instead of waiting until the end of the experiment to compare groups. Results are accumulated as each subject is tested, so that the experiment can be stopped at any point as soon as the evidence is strong enough to determine a significant difference between treatments. Consequently, it is possible that a decision about treatment effectiveness can be made earlier than in a fixed sample study, leading to a substantial reduction in the total number of subjects needed to obtain valid statistical outcomes and avoiding unnecessary administration of inferior treatments. Sequential trials incorporate specially constructed charts that provide visual confirmation of statistical outcomes, without the use of formal statistical calculations.

The idea of sequential analysis was originally developed during World War II for military and industrial applications, and was for a time considered an official secret.^{20} Soon after, it was recognized as a useful model for medical research, particularly in clinical trials of pharmacological agents. Even though there are a few examples of its application in rehabilitation literature,^{21,22,23} sequential analysis remains a relatively unused technique in rehabilitation research. This is unfortunate because the sequential clinical trial is a convenient design that is applicable to many clinical research questions.

The specific purpose of a sequential trial is to compare two treatments, a "new" or experimental treatment (A) and an "old" or standard treatment (B). Treatment can also be compared with a control or placebo. The design is most often applied to independent samples, but may be used with repeated measures.

The process begins by admitting the first eligible patient into the study. This patient is assigned to either Treatment A or B, using the flip of a coin or some other randomization process. When the next eligible patient is admitted (and this may be days or months later), he or she is assigned to the alternate treatment. These two patients now form a *pair*, the results of which can be considered a "little experiment"; that is, we can determine for these two people whether Treatment A or B was better. The whole experiment is a *sequence* of these "little experiments," with each pair representing a comparison. The comparison between A and B is then assessed as **preference** for A or B. Preferences are based on subjective but clearly defined criteria for saying that one treatment is clinically more effective than the other.

Preference is defined on the basis of clinically meaningful differences between two treatments. The specific criteria for expressing preference for one treatment over another can vary in objectivity. At one extreme, the patient can merely express subjective feelings that one treatment seems to work better or is more comfortable than the other. At the other extreme, outcomes can be totally objective, such as death-survival or cured-not cured. In between are many subjective and objective types of measurements. A clinician might express preference based on a subjective evaluation of function or on the patient's general reaction to treatment. It is necessary, of course, to develop reliable criteria for making such dichotomous judgments.

It is also possible to reduce continuous data to a measure of preference. For instance, if we were measuring the effect of two treatments for increasing range of motion, we could specify that Treatment A would be preferred if it could produce at least 20 degrees *more* of an increase in range than Treatment B. In other words, any difference between treatments smaller than 20 degrees would not be clinically meaningful, and both treatments would be considered equally effective. This is a convenient approach, but the researcher must be aware that it results in a loss of information by reducing the data to a dichotomous outcome. Any difference greater than 20 degrees would indicate preference, whether that difference was 25 or 100 degrees. If analysis was based on the magnitude of differences, the amount of difference would be taken into account. The researcher must determine if the magnitude of difference is important or if the comparison between treatments is adequately assessed simply by expressing preference.

When two treatments are compared, there are four possible outcomes for classifying preference, as shown in Figure 10.14. In outcome 1, both treatments are equally successful, in which case we would not be able to specify a preference for A or B. In outcome 2, neither treatment is successful. In either of these two cases, we have no information as to which treatment is superior. These outcomes are considered *ties* and are dropped from the analysis. In outcomes 3 and 4, one treatment is preferred over the other, providing one piece of evidence in favor of either A or B.

The result of each comparison within a pair of subjects is plotted on a *sequential chart.* Two types of charts have been used. The chart developed by Bross^{24} has strong appeal because it has a fixed format (see Figure 10.15). The plot begins in the lower left comer square (a free square). As each comparison is made within a pair, an "x" is placed in the square either above the last occupied square (if A is superior) or to the right (if B is superior). If neither treatment is preferred within a pair, nothing is entered. The path continues until one of the boundaries is crossed. If the path goes upward, Treatment A is superior; if it goes to the right, Treatment B is superior. The middle boundary represents the null hypothesis; that is, if the path moves diagonally, the conclusion is that no difference exists. The longest possible path in this plan is 58 squares (116 patients).^{‡}

###### FIGURE 10.15

Sequential trial grid,^{24} showing preference for low-load prolonged stretch (LLPS) over high-load brief stretch (HLBS) for treatment of knee flexion contractures. (From Light KE, Nuzik S, Personius W, et al. Low-load prolonged stretch vs. high-load brief stretch in treating knee contractures. *Phys Ther* 1984; 64:330–333, Figure 3, p. 332. Reprinted with the permission of the American Physical Therapy Association.)

Researchers studied the differential effect of low-load prolonged stretch (LLPS) versus the more traditional high-load brief stretch (HLBS) for treating knee flexion contractures in elderly patients.^{21} Subjects were admitted to the study based on the presence of bilateral knee flexion contractures of at least 3 months' duration, and at least 30 degrees short of full extension. In addition, subjects had to be unable to walk or pivot transfer without maximal assistance. Subjects' limbs were randomly assigned to receive either LLPS or HLBS. Treatment was performed twice daily, 5 days a week, for 4 weeks. Range of motion was measured before and after 4 weeks, and preference was defined as a difference of at least 10 degrees between limbs.

In this example, the first patient tested demonstrated a preference for HLBS, and so the first "x" was placed just above the starting square. As shown in Figure 10.15, all further testing showed a preference for LLPS.

The second type of sequential chart was developed by Armitage,^{25} and allows for more flexibility in design. Different size charts are drawn, allowing for different expectations of effects. The chart in Figure 10.16 shows results of one study to evaluate varied supplementary doses of opioids in terminally ill cancer patients.^{26} The chart is drawn to detect a significant treatment difference if one drug regimen was better in at least 85% of the pairs.^{§} In this format, the boundaries are drawn above and below a center baseline. Preferences for one treatment or the other are indicated by moving up or down from the last plotted point. The first four patients favored the 50% dose, the next three the 25% dose, the next two the 50% dose, and so on. This example shows an outcome that moves towards the middle boundary, indicating no significant difference.

^{‡}This plan is based on *comparative success rates*, indicating whether Treatment A has an "advantage" over Treatment B. Bross includes a table that statistically defines "important advantage."^{24} For instance, if Treatment B is known to "cure" 25% of the patients, Treatment A would demonstrate an important advantage over B if it could cure 44%. If B cures 50%, treatment A would be important if it could cure 70%; if Treatment B cures 75%, Treatment A should cure 88%. The power of this analysis is approximately 86% when Treatment A offers an important advantage over B; that is, 86% of the time the upper boundary will be correctly crossed.

^{§}See Armitage^{25} for tables and figures for different effect sizes.

After each successive "little experiment" is plotted, the researcher stops to consider the results of all the pairs completed thus far and makes one of three decisions: (1) Stop and make a *terminal decision* to recommend A or B; (2) stop the experiment and make a terminal decision that treatments A and B are not different; or (3) continue to collect data because the cumulated data are not yet sufficient to draw a conclusion. This process of considering cumulative results after each pair of subjects has been tested is called *sequential analysis*. The decision to stop or go on will depend on how strong the evidence is to that point in favor of one treatment. This is the primary benefit of a sequential analysis, in that a trial can be stopped as soon as it is evident that one treatment is superior to the other, or that no difference is going to be found.^{27}

These boundaries represent three **stopping rules**: (1) If the upper boundary is crossed, we can make a terminal decision to recommend A; (2) if the lower boundary is crossed, we can make a terminal decision to recommend B; (3) if the middle boundary is crossed (either above or below the origin), there is no preference. In the stretching study (Fig. 10.15), 11 subjects were required to cross the lower boundary, indicating a significantly greater effect for LLPS. Had the first subject also "preferred" LLPS, only 8 subjects would have been needed to demonstrate significance. For the opioid study (Fig. 10.16), the first few subjects showed a preference for the 50% dose, but the final results showed a marked inconsistency, leading to the middle boundary. In this case, the line did not cross the boundary because of missing data in three of the pairs. The authors decided not to recruit additional subjects and concluded that the data did not support a difference between the two doses.

###### FIGURE 10.16

Sequential chart for analysis of treatment with an expected difference in 85% of the pairs, based on the procedures of Armitage.^{25} Supplementary doses of opioids were given to terminally ill cancer patients at 25% and 50% of their 4-hour standard dose. The outcome measure was reduction in dyspnea. The study showed no final preference for either dose, leading the researchers to conclude that the lower 25% dose would be sufficient. (From Allard P, Lamontagne C, Bernard P, et al. How effective are supplementary doses of opioids for dyspnea in terminally ill cancer patients? A randomized continuous sequential clinical trial. *J Pain Symptom Manage* 1999;17:256–265. Used with permission).

A theoretical issue arises in the consideration of the effect of ties. When the difference between two treatments within a pair does not meet the criterion for demonstrating preference, that pair of subjects is discarded from the sequential analysis. If many ties occur, the final sample that is used for analysis is not a true random sample; that is, it is not a true representation of all tied and untied pairs that were originally chosen.^{28} It is useful to keep a record of ties, as they do provide information about the similarity of treatments. If the researcher finds that too many pairs result in ties, it might be reasonable to end the trial, as very little information will be gained by continuing to collect data. Such a decision is considered a *conditional decision* (as opposed to a terminal decision that occurs when a boundary is crossed). A conditional decision is rendered without crossing a boundary, but is based on practical considerations and observation of the plotted path.

Sequential trials are also somewhat limited by the time frame within which treatment effects can be expected to occur. The response should be observable relatively soon after treatment is begun. The outcome should at least be available within an observation period that is short relative to the total time of the study.^{28} Otherwise, at any point in time, there will be a large number of subjects entered into the trial, but only a small proportion of results will be available. For instance, if a treatment effect is not expected for 1 year, within 6 months many subjects may have started treatment, but hardly any results would have been obtained. Consequently, the sequential rationale for economizing on time and subjects is subverted.

A major advantage of sequential analysis is that it more readily fits a clinical research model, allowing for subjects to enter the study as they are admitted for treatment and providing a structure for administering treatment within a clinical context; that is, the experimental treatments can be applied as they would be during normal practice, without having to create an artificial experimental environment. The sequential trial also provides a useful mechanism for studying qualitative outcomes, using the measure of preferences. In a practical sense, this approach can be quite effective as an adjunct to clinical decision making, because it allows the decision to be based on a variety of empirical criteria.

The randomized controlled trial is generally considered the gold standard for evaluating the effects of treatment. Researchers will often distinguish between efficacy and effectiveness in clinical studies. **Efficacy** is generally defined as the benefit of an intervention as compared to a control or standard program. It provides information about the behavior of clinical variables under controlled, randomized conditions. This lets us examine theory and draw generalizations to large populations. **Effectiveness** refers to the benefits and use of the procedure under "real world" conditions. It is the expectation that when we apply treatments, we do so without being able to control all the circumstances around us, and our results may not be the same as those obtained with a randomized experiment. This distinction is often seen as one reason for the perceived gap between research and practice, based on the mistaken assumption that effectiveness logically follows from a successful efficacy study.^{29}

Studies may be closer to one end of this continuum or the other, depending on many factors in the design of the trial.^{30} Gartlehner et al^{31} have proposed seven criteria to help researchers and clinicians to distinguish between efficacy and effectiveness studies, as shown in Table 10.1. Which patients are eligible, degree of control over the delivery of the intervention, what outcomes are assessed, which patients are included in the final analysis, how missing data are handled, and which statistical procedures are appropriate—all of these influence whether the results of a trial can be considered measures of efficacy or effectiveness. These two types of trials may yield very different results, but both are important to our understanding of patient responses to treatment.

Criterion | Efficacy Study | Effectiveness Study |
---|---|---|

Health care setting | Frequently conducted in large tertiary care referral settings. | Conducted in primary care settings available to a diverse population with the condition of interest. |

Eligibility criteria | Highly selective, stringent eligibility criteria; sample may not be representative of the general population. | Source population reflects the heterogeneity of external populations. Comorbidities are not exclusion criteria. |

Outcome measures | Common use of objective and subjective outcomes, such as symptom scores, lab values, disease recurrence. | Use of functional capacity, quality of life and other health outcome measures relevant to the condition of interest. |

Study duration and clinically relevant treatment | Research protocol manipulated. Study duration often based on time needed to demonstrate safety and demonstrate an effect. Compliance must be assessed to determine if intervention works. | Research protocol based on clinical reality. Duration based on minimum length of treatment to allow assessment of health outcomes. Compliance may be unpredictable and should be defined as an outcome measure. |

Assessment of adverse events | Not typically reported. | Objective scales used to define and measure adverse event rates. |

Sample size | Large trials with few levels of analysis provide ideal design to detect small but clinically meaningful treatment effects. | Sample size sufficient to detect at least a minimally important difference on a health outcome scale. |

Intention to treat (ITT) analysis | Research protocol will seek to limit factors that can alter treatment effects; may use completer analysis. | Factors such as compliance, adverse events, drug regimens, comorbidities, other treatments and costs are taken into account using ITT. |

These concepts help us understand the situation where the findings of a controlled trial demonstrate that a treatment works, but clinicians find that it does not have the same effect when used on their individual patients in actual treatment conditions. The efficacious treatment was tested on a defined sample, with inclusion and exclusion criteria, and was applied under controlled and defined conditions. It then becomes imperative to determine if the same result can be obtained when personnel, patients and the environment cannot be manipulated. Factors that potentially limit application across settings, populations, and intervention staff need to be addressed in both types of trials.

The importance of understanding concepts of experimental design cannot be overemphasized in the planning stages of an experimental research project. There is a logic in these designs that must be fitted to the research question and the scope of the project, so that meaningful conclusions can be drawn once data are analyzed. Alternative designs should be considered on the basis of their relative validity, and the strongest designs should be chosen whenever possible. The design itself is not a guarantee of the validity of research findings, however. Process must be controlled within the structure. Attention to measurement issues is especially important to ensure that outcomes will be valid. It is also important to note that the strongest design for a given question need not be the most complicated design. In many cases, using the simpler designs can facilitate answering the research question, where a more complex design creates uninterpretable interactions. The choice of a design should ultimately be based on the intent of the research question:

… the question being asked determines the appropriate research architecture, strategy, and tactics to be used—not tradition, authority, experts, paradigms, or schools of thought.^{32}

The underlying importance of choosing an appropriate research design relates to consequent analysis issues that arise once data are collected. Many beginning researchers have had the unhappy experience of presenting their data to a statistician, only to find out that they did not collect the data appropriately to answer their research question. Fisher^{33} expressed this idea in his classical work, *The Design of Experiments:*

Statistical procedure and experimental design are only two different aspects of the same whole, and that whole comprises all the logical requirements of the complete process of adding to natural knowledge by experimentation.

The relevant point is the need to use a variety of research approaches to answer questions of clinical importance. Although the clinical trial or experiment is considered a gold standard for establishing cause and effect, it is by no means the best or most appropriate approach for many of the questions that are most important for improving practice. The real world does not operate with controls and schedules the way an experiment can. Quasi-experimental and observational studies, using intact groups or nonrandom samples, play an important role in demonstrating effectiveness of interventions.^{31} As we continue our emphasis on evidence-based practice, we must consider many alternatives to the traditional clinical trial in order to discover the most "effective" courses of treatment.^{34}

*N Engl J Med*2002;347:567–575. [PubMed: 12192016]

*Am J Public Health*1955;45:1–63.

*Curr Drug Targets*2004;5:71–88. [PubMed: 14738219]

*Experimental and Quasi-experimental Designs for Research*. Chicago: Rand McNally, 1963.

*Quasi-experimentation: Design and Analysis Issues for Field Settings*. Boston: Houghton Mifflin, 1979.

*J Vase Surg*2004;39:79–87.

*Stroke*2004;35:141–146. [PubMed: 14657447]

*Arch Phys Med Rehabil*2003;84:1589–1594. [PubMed: 14639556]

*JAMA*[JAMA and JAMA Network Journals Full Text] 1998;279:847–852. [PubMed: 9515999]

*Br J Health Psychol*2003;8:477–495. [PubMed: 14614794]

*J Clin Pharmacol*1997;37:799–809. [PubMed: 9549633]

*Occup Ther J Res*2001;21:142–167.

*Prev Med*1996;25:442–454. [PubMed: 8812822]

*Design and Analysis: A Researcher's Handbook*(4th ed.). Englewood Cliffs, NJ: Prentice Hall, 2004.

*Phys Ther*1998;78:490–501. [PubMed: 9597063]

*J Adv Nurs*2002;40:161–169. [PubMed: 12366646]

*J Spinal Cord Med*2004;27:29–34. [PubMed: 15156934]

*Ergonomics*1998;41:790–797. [PubMed: 9629064]

*Spine*2004;29:351–359. [PubMed: 15094530]

*Phys Ther*1984;64:330–333. [PubMed: 6366834]

*Phys Ther*1985;65:1052–1054. [PubMed: 4011683]

*Arthritis Rheum*1969;12:34–44. [PubMed: 5765688]

*J Pain Symptom Manage*1999;17:256–1265. [PubMed: 10203878]

*Am J Public Health*2003;93:1261–1267. [PubMed: 12893608]

*Can J Psychiatry*2002;47:552–556. [PubMed: 12211883]

*BMJ*1997;315:1636. [PubMed: 9448521]

*N Engl J Med*2000;342:1887–1892. [PubMed: 10861325]