++
+
Nowadays people know the price of everything and the value of nothing.
—Oscar Wilde
+
*Mr. Ketterman's Case
I know I need to read the articles about Mr. Ketterman's care, but I just don't know where to start. I often read just the Abstract or the Introduction and the Conclusion because all that information in the Methods and Results just confuses me. But, I know I'm missing a lot of important information! (See Appendix for Mr. Ketterman's health history.)
++
Having found evidence that relates most closely to a clinical question, the next step in the evidence based practice (EBP) process requires the careful reading and critiquing of the evidence. But this is perhaps the most daunting task facing clinicians who wish to practice in an evidence based manner—determining whether evidence found in the literature can be of value to their patients. As the sophistication of physical therapy research increases, the task of evaluating any research report for its accuracy and truthfulness becomes more challenging. After all, the evidence can only be of value to your patient if it is derived from a valid study. Validity, used here to mean trustworthiness, is an essential characteristic of any individual study. The more the results of a study can be trusted, the greater its validity in helping to make an informed clinical decision. An understanding of some of the methodological and analytical standards for conducting valid, high-quality clinical research experiments is a good starting point to build one's critical appraisal abilities. In this chapter we will focus on understanding the elements of good research. Chapters 13, 14, 15, 16 will then apply the principles of good research practice to the various types of evidence, ranging from single studies to systems (Fig. 12-1).^{1}
++
++
It is not the intention of this text to provide direction for researchers, nor to be a full source of all the details of research design and management that a clinician might find interesting. Rather, we have attempted to identify the essential elements that should be considered in assessing the literature. Because there are many such elements and you may wish to focus on them specifically when reviewing articles, we provide an overview of what will be covered in this chapter (Box 12.1).
++
We will compare two large categories of research designs, experimental research designs and observational research designs. The research design is selected by the investigator based on the research question of interest. As you have seen in Chapter 10, questions may relate to selecting the best examinations in order to determine a diagnosis for the patient or may relate to the risks and benefits of a particular intervention. Experimental research designs provide the strongest evidence for causal relationships between an intervention and an outcome of care. They are the designs typically used to demonstrate the efficacy of an intervention. efficacy is a measure of the capacity of an intervention to actually produce a change. Observational research designs can also provide good evidence for intervention studies and are more frequently used to study diagnostic or prognostic questions. They are designs typically used to demonstrate the effectiveness of an intervention. effectiveness is a measure of the ability of the intervention to bring about the expected change in the real world.
++
Research designs in biopsychosocial sciences stem from one of two philosophical paradigms, the quantitative paradigm or the qualitative paradigm. The difference between these two approaches as to what is knowable is described in Chapter 8, and the qualitative research paradigm and research designs are discussed there, including criteria with which you might determine the value of a study that uses a qualitative research design. The research designs described in this chapter stem from the positivist philosophy, which purports that truth is independent of the investigator and knowable through direct and carefully controlled observations, interventions, and measurements of subjects. There are many rubrics that categorize research designs in medical science, with discrete definitions and methodological characteristics. In this text we will describe several of the designs most commonly found in literature useful to physical therapists and identify the questions that are useful for determining the value of findings derived from each.
++
Box 12.1 Overview of Topics Covered in This Chapter Research Design
Determining Value in Studies of Clinical Interventions Criteria for Evaluating Internal Research Validity
Did the Research Design Maximize the Prevention of Bias?
Did the Selection and Treatment of Subjects Avoid the Introduction of Bias?
Obtaining the Best Sample of Subjects for the Study
Proper Assignment of Subjects to Groups
Tracking Subjects Through the Study
Controlling What Study Subjects Know
Did the Selection and Methods of Interventions and Measurement of Outcomes Avoid the Introduction of Bias?
Evaluating External Validity
Evaluating Statistical Conclusion Validity
Evaluating Outcomes: Effect Size and Number Needed to Treat
Summary of Statistical Conclusion Validity
Determining Value in Studies of Diagnostic/Prognostic Accuracy Did the Selection of Subjects Avoid the Introduction of Bias?
Did the Selection of Measures and Procedures Avoid the Introduction of Bias?
Were the Statistical Estimates Developed and Presented Without Bias or Error?
Other Statistics Found in Observational Research
+++
Experimental Research Designs
++
Two categories of quantitative research designs are useful for organizing one's thoughts (Fig. 12-2). The first category is experimental research, within which we will discuss the type of experimental research that epitomizes this category, the randomized controlled trial (RCT). An experimental research design is selected when the investigator wishes to examine if a cause-and-effect relationship exists between one or more interventions and the subsequent outcomes. Traditional criteria for true experimental research designs include random selection or allocation of research subjects to groups, the inclusion of a control group for comparison to a treatment group, and purposeful manipulation (dose, frequency, or duration) of the treatment variable. Of these three criteria, randomization of subjects is often considered the single most important element. Randomization of subjects to treatment groups helps ensure that the groups will be equivalent on important measures at the beginning of the study, thus giving the best measure of the relative effects of the two contrasted treatments. These research designs also employ tightly controlled research methods in order to provide the strongest possible statement about observed differences in groups at the end of the experiment.
++
++
In addition to true experimental research designs, some researchers may use a variety of designs that are similar to experimental research but do not quite meet all the rigorous requirements of true experimental research. In the past, these designs might have been referred to as quasi-experimental. Quasi-experimental designs share the purpose of establishing a cause-and-effect relationship between the treatment and the outcome, but fall short of the design standards of true experimental research, perhaps because it is unfeasible to randomly assign subjects or to use a control group.
++
Many aspects of good experimental research design are difficult to accomplish in physical therapy research, and more recently, some have used the term, practical clinical trials (PCTs) to indicate a research design that is not quite as rigorous as that found in a RCT.^{2} Calif recommends such designs, commenting in an Institute of Medicine Roundtable on Evidence-based Medicine, "the sheer volume of clinical decisions made in the absence of support from randomized controlled trials requires that we understand the best alternative methods when the classical RCTs are unavailable, impractical, or inapplicable."^{2} Investigators have called for clinical trials that, although perhaps somewhat less rigorous, will better meet the needs of the clinician and the patient by including a comparative or alternate therapy that is relevant to the choices of the patient.^{3,4,5} Whether these PCTs are considered to be experimental or non-experimental research is not as important as making a clear judgment about the value of each study to you and your patient.
+++
Observational Research Designs
++
The second category of research designs is termed observational designs, or nonexperimental designs. While within this category there are many, quite different designs, each of them shares one characteristic: they are conducted by making careful observations of clinical practice as it happens, or happened, rather than planning a controlled test or manipulation of treatment. Patient care goes on as planned, and close observations are made about relationships among variables, treatments, and outcomes. These designs are used when the knowledge to posit testable hypotheses for experimental research is unavailable; when the phenomenon of interest is too complex to be assessed with a clinical trial; or when the purpose of the study is not related to interventions, but to the quality of the tests or outcomes of care. Observational research designs are quite varied; some are most useful in answering questions about interventions, although lacking the strength of the RCT, and some are best suited to questions about diagnostic accuracy of tests or the natural history of an illness.
++
Brief definitions of the major observational research designs follow:
++
Cohort: In this design, the researcher is typically interested in describing the future health outcomes in one cohort of subjects who have an exposure to a risk factor or will be given a certain intervention, and one cohort that is similar, but did not have either the exposure or intervention. Both groups will be followed forward in time and measured frequently to determine the occurrence of the outcome in each group.
Case-Control: In this design, the researcher will identify two groups of subjects: the case subjects who have an outcome of interest, for example, a disease; and a group of subjects similar to them who do not have the outcome. Then the research uses medical records or interviews to look back in time to determine when and how each group member was exposed to the suspected causative factor.
Cross-Sectional: With this design the researcher will identify a group of appropriate subjects and will measure their outcomes and their exposure to potential causative factors at the same time, generally just one point in time. This design is also typically used in studies of the diagnostic accuracy of tests, as the subjects are given two or more tests at one point in time.
Methodological: These designs include repeated measures on subjects with the outcome of interest, for example, lax knee ligaments, in order to assess the reliability or validity of tests.
Descriptive: These designs sample specific groups of subject, often using random sampling, and measure opinions or collect data about the course of an illness. These designs are useful for investigating satisfaction with health care.
Case Study: This research design calls for careful measurement and description of typical clinical practice and is useful for documenting care given to patients with unique circumstances or responses to treatment.
+++
DETERMINING VALUE IN STUDIES OF CLINICAL INTERVENTION
++
Our framework for assessing value has two components: research validity (sometimes termed experimental validity) and statistical conclusion validity. Traditional research methodologists Campbell and Stanley^{5} provided one of the first frameworks for evaluating the quality of experimental research by describing the elements of research necessary to establish a causal inference. They proposed that an experiment designed to demonstrate a cause-and-effect relationship between interventions and outcomes should be judged on a set of criteria that define the research validity, or truthfulness, of the study. Research validity has two components, internal and external. We will focus primarily on internal research validity, which are those research procedures that create the greatest trust in the results of the study. For experimental studies this means having trust in the inference of the causal relationship demonstrated in the study. For observational studies, this means having trust that the phenomena being observed are accurately represented. In both designs, research validity helps to determine the extent to which bias has been controlled in the study. Bias, in a research context, refers to the tendency toward systematic errors that can arise from the design, sampling, and measurement used. Therefore, we are interested in criteria that allow us to assess how well the researchers have controlled for alternative explanations for the outcomes of the study, other than the influence of the variables of interest. These criteria are focused on the methods of the study and they will influence our confidence in its outcomes.
++
External research validity concerns how generalizable the findings are to patients who were not in the study. Commonly, external validity is concerned with asking how similar are the people in the study and the circumstances (time and place) of the study to our specific patients and the circumstances of the care we plan to give.
++
Another set of criteria developed by Campbell and Stanley^{6} helps us evaluate the appropriateness of the statistical analysis of the study, or the statistical conclusion validity. Critiquing the internal and external research validity and the statistical conclusion validity will allow us to adequately discuss the value of the research reported in studies that appear to relate to our clinical questions (Fig. 12-3).
++
+++
Criteria for Evaluating Internal Research Validity
++
The information needed to assess internal research validity is generally found in the methods section of a paper. This section contains very detailed descriptions of the procedures, measures, instruments, subjects, setting, timing, and participation required of the subjects. In experimental designs, these aspects of the study design are controlled very specifically so as to have the best chance of identifying a cause-and-effect relationship between the intervention being tested, also called the independent variable, and the outcomes measured, or the dependent variables. As you read the methods section of a randomized controlled trial (RCT), you will be judging whether or not the authors exerted sufficient control to eliminate bias in the procedures, measures, sample selection, or any other important aspect of the study. Keep in mind that most authors find that the assessment of truthfulness in research is a subjective evaluation, one that places the study on a continuum from strong to weak.^{7,8} We will look at three categories of good research practice:
++
selection of the correct research design,
selection and treatment of subjects, and
selection and methods of intervention and measurement of outcomes.
+++
Did the Research Design Maximize the Prevention of Bias?
++
Each of these types of research designs, experimental and observational, brings strengths and weaknesses to the study of best physical therapy practice. Recent studies of the impact of research design on the evidence for practice have shown no difference in results when observational studies are compared to RCTs,^{9} while others argue that this is not the case. Califf^{2} provides two examples of incorrect conclusions drawn from multiple observational studies in cardiology, identified as incorrect practice only after sufficient RCTs were performed. He comments on the hope that analytical methods might enhance the causal relationships in observational studies, "… no amount of statistical analysis can substitute for randomization in ensuring internal validity when comparing alternative approaches to diagnosis or treatment."^{2} The evidence to support clinical decisions in physical therapy research is comprised of both experimental and observational research designs, with some appearing less frequently than others (cohort and case-control). Table 12.1 provides a summary of examples of research designs you will find in the literature for questions about interventions, diagnosis, and prognosis in physical therapy.
++
+++
Did the Selection and Treatment of Subjects Avoid the Introduction of Bias?
++
People who consent to participate in health care research studies may not be patients of any practitioners at the time of the study, because they do not require health care. Alternatively, they may be patients under the care of the investigator or another practitioner who has informed them about the study. In both cases, these people are referred to as subjects in terms of their role in a research study. This distinction is very important to the investigators conducting the study and to the clinician reading the study. The investigator must distinguish between procedures given to the patient that are tightly controlled research procedures and others that are customary care. Readers of research are interested in learning sufficient detail about the study subjects so that they can determine if the subjects in the study are sufficiently like their own patients; otherwise, the study might be of less value to their practice.
++
The selection and treatment of the subjects who participate in research studies can be a significant source of bias. A myriad of criteria might be used to determine if just the right study subjects were included and if all the proper procedures were used with them, including inclusion and exclusion criteria that will provide the best sample for the study, the proper allocation of subjects to study groups, the importance of tracking study subjects across time, and controlling what the study subjects know during the study (Fig. 12-4).^{10,11,12}
++
+++
Obtaining the Best Sample of Subjects for the Study
++
One of the hallmarks of experimental research in the social sciences is the selection of a random sample of subjects, but this process is often unavailable to or cost-prohibitive for health care researchers. Instead, much of the research is conducted with samples of available subjects, called samples of convenience. Once a research question and an appropriate research design are selected, the next decision is, who should be in this study? The resources to study everyone to whom the research question applies, often referred to as the population of interest, would be unavailable in most health care research. The method of selecting a smaller group of patients to whom the research question applies is referred to as sampling the population, and this is undertaken with the hope of achieving a manageable group of research subjects who appropriately represent the original population. This group is called the sample, and good research practice requires the clear definition of who can and cannot be in the sample so as to maximize the research design's purpose to discover a causal relationship between intervention and outcome, if one exists, while not making the sample so narrowly focused that it no longer represents the population of interest. When this happens, clinicians find the research to be less useful to their own practice.
++
The use of criteria that a person must have to be eligible for the study, inclusion criteria, and those that they must not have to be eligible for the study, exclusion criteria, allows the researcher to recruit people to become research subjects. Some research designs may require very intricate inclusion and exclusion criteria for subsamples of the study, but we will use the structure of an RCT with an experimental and a comparison group to illustrate good sampling procedures. The first step to critique is the method used to recruit all subjects. Ask yourself if the researcher used sampling criteria that served to eliminate bias in the subjects chosen, for example, by disqualifying subjects who had multiple comorbidities to the one being studied. Having such subjects in the study could introduce competing explanations for the results. The study should provide a description of objective characteristics of the sample selected so that you might judge how similar or homogeneous a group was identified; for example, were the subjects similar in the severity of their injury or illness, the length of time since onset, and the occurrence of surgery or not? It is important that the overall sample of subjects is homogenous on important clinical, demographic, and temporal characteristics, not only to decrease sample bias but to help ensure equivalent groups once group allocation is undertaken.
+++
Proper Assignment of Subjects to Groups
++
Many of the research designs described in the previous section make use of groups of subjects who receive different treatments or different levels of treatment in the study. If a comparison of X with Y is planned in the RCT, the sample will need to be assigned to one or the other group. Because the sample has generally been created using available subjects, it is crucial that the significant benefits of randomization be applied to research designs at this point. While sampling is generally not done randomly, random assignment, or allocation of subjects to groups, provides a vital element of experimental validity to the research process. Randomization helps distribute subjects with both known and unknown characteristics equally into all study groups, lending confidence that the groups were made as equivalent as possible at the beginning of the study. If the two groups are different before the study starts on some characteristic related to the intervention or the outcome, then bias has been introduced and weakened the ability to identify a cause-and-effect relationship. If one group had a very different prognosis for recovery than the other, imagine how that would influence your thinking about the outcome of the study. Again, the study should provide an objective description of the equivalency of the groups, using tables or charts, and often a statistical test, as evidence of the effectiveness of the randomization procedure.
++
There are several methods to accomplish random assignment of subjects to research groups, most of which depend on whether or not the whole sample is available for randomization at the start of the study, or if randomization must be done consecutively as subjects are enrolled. This is the case for most studies that use a sample of convenience. In this case, an ordered list of subject identification numbers is randomized and subjects are assigned to groups as they enter the study. Clinical trials often use another assignment technique called matching assignment, which allows the investigator to exert some control over the equivalency of the groups in addition to that provided by randomization. In matching assignment, each subject is categorized on an important and possibly confounding variable, for example sex or leg length, matched with another subject with similar characteristics, and then randomly assigned to a group, such that there are similar numbers of women in each group or equal numbers of subjects with short leg lengths in each group. In this way, the researcher does not leave it to chance (the randomization process) to equally distribute subjects with known attributes that might introduce bias into the study, should they by chance all end up assigned to one group.
++
We will discuss later in the chapter the importance of keeping clinicians who are involved in treating or measuring research subjects unaware, or blind, to the group assignment of the subjects. However, it is also considered important to know whether or not the person enrolling subjects into the study is aware of the group to which the next subject enrolled will be assigned. Good research practice will keep this consecutive order assignment concealed from the person enrolling subjects to avoid a bias introduced by the clinician who may not want a potential subject to be assigned to the control group. This practice is called allocation concealment.
+++
Tracking Subjects Through the Study
++
Bias can be introduced into a study if the researcher loses track of subjects who do not complete their participation. If the study takes place over days, weeks, or months, it is likely that some study participants will become ill, reinjured, discouraged, or in some way no longer willing or eligible to participate in the study. Other participants may recover sooner than anticipated and no longer wish to participate in the study. Still others stop participating and the researcher never learns why. It is important to account for each subject at the end of the study, and particularly so if these dropouts have happened with differential frequency among the groups. At the end of the study, researchers should remain confident that the randomization performed so carefully at the beginning is not negatively affected by differential rates of attrition in the groups. Standard schematics or flow diagrams are typically provided in a study to document the progress of participants through each stage of a clinical trial (Fig. 12-5).^{13,14}
++
+++
Controlling What Study Subjects Know
++
Controlling who knows what, when, in an experimental study is an important tool for avoiding various types of biases introduced by those involved. Here we discuss controlling what the subjects know specifically about their involvement. Good research practice requires that all potential and enrolled subjects be fully informed as to the procedures that will be used in the study so that they might make the best judgment about their involvement. In some studies, a subject will learn from the informed consent process that they have a one in three chance to be in a group that gets exercise A, exercise B, or no exercise at all. The subject is told in the informed consent document whether or not they will know to which group they have been randomly assigned and how that randomization will take place. In studies of medications, it is easier to keep subjects blind to their group assignment than it is in studies of procedures, surgeries, or physical therapy interventions. Thus the ability to include this good research practice is limited in rehabilitation research. When the subject cannot be blinded from the group assignment, then care is taken to utilize outcome measures that may be less influenced by the subject's knowledge of his or her group assignment, for example, physiological measures of body temperature, wound healing, and range of motion; however, it is not certain that these measures could be influenced by the subject themselves.
+++
Did the Selection and Methods of Interventions and Measurement of Outcomes Avoid the Introduction of Bias?
++
To answer a question about which intervention may have a better effect on health outcomes, the researcher must make many choices for the conduct of the study. Methodological decisions focus on two major components: the study procedures and the study measurements. Good research practice in these areas must blend a keen focus on the research question with an eye to avoiding bias in all decisions about procedures and measures. For example, the most intricately designed study procedures could be wasted if the selected outcome measures are not the best or are performed in a biased manner. We will look first at the decisions to be made about the study procedures regarding interventions.
+++
Selection of Interventions
++
A researcher wishes to investigate whether eccentric or concentric quadriceps strengthening exercises will produce better outcomes for patients who received a total knee arthroplasty (TKA). It is obvious to the researcher that either type of exercise will be better than no exercise, so there is no need (and perhaps an ethical concern) for having a control group that receives no exercise. The researcher knows one of the study variables will be type of exercise, and only two types, eccentric and concentric exercise, will be studied. What about the dose of the exercise for each type? Should subjects perform the intervention exercises daily or less frequently? How many repetitions of each exercise? Should subjects be required to visit an outpatient physical therapy clinic or perform the intervention as a home program? How many days or visits should be expected until an effect can be found? Will exercise equipment be involved? Will therapists be required to deliver the intervention directly? If so, what bias might that introduce? It is very difficult to keep the treating physical therapists in clinical trials unaware of (blind to) the group assignment of the subjects, but good research practice requires that the researcher consider whether this can occur in the study, to avoid the bias introduced by a research team member who might favor one group outcome over another.
+++
Measurement of Outcomes
++
The second set of methodological decisions concerns the measurement of the outcomes of the study. In the study of choice of exercises for patients who have undergone TKA, the investigator must decide what outcomes to choose in order to determine which exercises are better for the patients. Over the years, investigators of physical therapy clinical trials have expanded the types and numbers of outcome measures considerably. Today, a clinical trial is likely to have a mix of outcome variables, often classified as primary and secondary, based on the importance to the investigators and to patients. The measures may be selected to assess physical and social well-being, as well as outcomes that address the processes of care, such as patient satisfaction. A wide range of measures can be selected that represent elements of the WHO International classification of Functioning, Disability, and Health (ICF) model.^{15,16} Applying this model, described in Chapter 1, to the selection of outcome measures could result in assessing body structures and functions with, for example, ROM or strength about the knee and a 6-minute walk test and assessing activities and participation with, for example, the Knee injury and Osteoarthritis Outcome Score (KOOS),^{17} The Guide to Physical Therapist Practice includes a compendium of over 800 tests and measures from which to select the measures used in a study.^{18}
++
With so many available measures, what criteria might be used to select the best ones for a specific study? Two fundamental principles of measurement are the reliability and validity of the measure.^{19} These characteristics of measurement provide the clinician and researcher with confidence that they can rely on scores derived from these tests to be dependable (reliable) and accurate (valid). It is important to remember that no test or measurement process is ever perfect, and errors of one type or another are associated with every outcome measure. This is particularly true for measures that require the interaction of the therapist with the patient, as compared to tests that the patient completes independently, like a survey. Reliable measures are ones for which repeatable scores would be achieved when the test is administered more than once, over a time period in which it can be reasonably expected that the phenomenon of interest does not change, for example, two measures of isometric quadriceps force production measured on a dynamometer, taken 30 minutes apart. Valid measures are said to be accurate, and by that we mean they measure what we intend for them to measure, and not another phenomenon. If we believe that an isometric quadriceps force production test on a dynamometer is an accurate measure of the concept of quadriceps strength, then the measure can be said to be valid for the purpose of the study. When there are several tests that purport to measure the same concept, studies are performed for the purpose of comparing one test to another and thereby establishing a description of the validity of one test in comparison to another. We will expect the author of the clinical trial to report the reliability and validity of the primary outcome measures in the study as a means of assuring us that the best tests for the study were selected.
++
Two other important decisions should be evaluated when considering the methodological aspects of measurement in a study: the timing of the measurements and who takes them. In general, the closer in time measures are taken to the phenomenon of interest, the less error will be represented in the score. In designing the study of eccentric versus concentric quadriceps strengthening exercises over a 12-week period, the researcher must decide when to take the measurements. The two obvious time periods to be selected are before the study starts and when it ends. If the pretest measure is taken too long ahead of time, an unexpected occurrence might cause the subject to become weaker before the study starts, thus introducing error into that subject's score. Good research practice will include measures closely timed to specified points in the study for all subjects. One other aspect to consider in terms of the usability of the study to your practice is whether enough measures were taken, especially follow-up measures, ones that typically follow the conclusion of the study. Follow-up measures allow assessment of the long-term effects of interventions, so that clinicians might learn how to best time interventions over an episode of care for their patients.
++
The most important methodological decision related to outcome measures is whether or not the person taking the measures is blinded to the group assignment of the subjects. If the clinician giving the intervention cannot be blinded, it is very important that a different person take the measures. Even this amount of blinding of the measurer can be difficult in physical therapy studies, since interventions often change the appearance of a subject's physical body. In such studies, the timing of the measurements may need to be adjusted, or instructions given to subjects not to reveal his or her group assignment, as well as other safeguards put in place to maximize the objectivity of the clinician taking the measurements.
+++
Evaluating External Validity
++
One additional aspect of good research practice that you might consider in determining the value of a study to your practice is how generalizable the results are to your practice. This is referred to as the external validity of the study. External validity asks, can we apply the results from a sample of patients in the controlled environment of the experiment to real life? If the controlled aspects of the setting, the subjects, and the timing of the study design are similar to your practice environment and patients, then the study findings will be generalizable to your patients. If the methods of the study bear little resemblance to how you might carry out the intervention in your practice, then you cannot be as confident that you will find the same outcomes as reported in the study. Generally, these are questions a clinician must answer based on his or her best clinical judgment. The desire to improve internal validity can sometimes result in less external validity; the reverse is also true.
+++
Evaluating Statistical Conclusion Validity
++
Statistical conclusion validity refers to the degree to which the analysis performed on the data in the study allows you to make the correct decision regarding the truth or approximate truth of the hypotheses tested. This is different from the process we just reviewed, the goal of which was to determine if we could trust the causal relationship between the intervention and the outcomes of the study, based on the methods of discovery used in the study, or its experimental validity. For most clinicians, research validity is the easier conclusion to draw about a study. For some experimental studies, the procedures described have clearly introduced some type of bias, for example, failing to mask the examiners from the subjects' group assignment. It is then both easier and logical that the reader will consider research validity before reading the data analysis and results section of the paper. With practice, the reader will begin reading the results section with some expectations in mind about the outcomes of the study. For example, you might anticipate small between-group differences because the intensity of exercise given to the experimental group was, in your clinical judgment, insufficient to create a large effect. It is best to be aware of your propensity to accept the results before you begin reading them, for the statistical analysis may indicate that a treatment was beneficial to one group of subjects as compared to another. Such statistical differences cannot be considered truthful unless you are satisfied with the experimental validity of the work.
++
If you are satisfied with the research validity of the study, the second component to determining the usefulness of this study to your practice is to evaluate its statistical conclusion validity. Domholdt defines this simply as your assessment as to whether or not statistical tools have been used correctly to analyze the data in the study.^{20} In the following section we will present an overview of basic statistical principles and terms and define their use in analyzing the data collected in experimental and observational studies. To understand statistical conclusion validity, one must master a certain amount of measurement and statistical theory.
++
Norman and Streiner suggest that the main purpose for understanding statistics is to be able to distinguish true differences from natural variation.^{21,p2} This is what we wish to know when reading in the results of a study that, for example, a 5-point difference on a pain scale existed at the end of a comparison between a group receiving an experimental treatment and a group receiving a standard treatment. Were these two groups truly different from each other at the end of the study, or does this difference reflect natural variation in subjects' pain response to treatment? We might also say that the difference of 5 points between the groups might have happened by chance, and this is another way to describe natural variation between and within individual subjects. Without statistics to help us make sense of this statement: "there was a 5-point difference in pain scores between groups at the end of the study," we would not be able to have confidence that this was a true difference, caused by the treatment.
++
There are two large families of statistics that you will find in most studies: descriptive statistics and inferential statistics. Descriptive statistics help define the average subject, intervention, and outcomes, as well as the variability in these data. They are important analyses to present to the reader first, in order to describe the subjects, the interventions, and the outcome measures. The subjects selected for a study are referred to as the sample, and the numerical measures of central tendency and dispersion calculated for the sample are called statistics. These calculated statistics are used to project or to infer what might happen should all possible appropriate individuals be studied. The term used for all possible appropriate individuals is the population. Thus, we conduct studies on selected samples of subjects in order to infer what might happen could we study the entire population. The numerical measures of central tendency and the dispersion hypothesized for a population are called parameters. The second large family of statistics is then termed inferential statistics, for they allow inferences from the sample statistics collected to the population parameters, which are generally unavailable. One can think of this as trying to discover a causal relationship in a sample from one study, which can then be recommended with confidence to the entire community.
+++
Descriptive Statistics and Point Estimates
++
Data are collected to describe subjects following rules of measurement that allow consistent communication of the value of what is being measured. Data can be communicated, or measured, as names, numerals, or numbers.^{20,21} It is important to distinguish numerals from numbers, as numerals generally cannot be entered into mathematical formulas because they are labels and do not have quantitative value (Table 12.2). There are four classical categories, or levels, of measurement referred to in the literature: nominal, ordinal, interval, and ratio. The distinguishing elements of these categories are the mathematical operations one can perform upon them, based on the units of the measurement and the presence or absence of an absolute zero point.
++
++
Levels of Measurement
Nominal: words or numerals assigned to distinguishable characteristics with no absolute value nor rank
Ordinal: words or numerals or numbers assigned to distinguishable characteristics with rank, but no equal intervals between levels
Interval: numerals or numbers assigned to distinguishable characteristics with equal intervals, but no absolute zero point representing total absence of the characteristic
Ratio: numbers assigned to distinguishable characteristics with equal intervals, and an absolute zero indicating total absence of the characteristic
++
The three types of measures (name, numerals, and numbers) can also be classified into two broad categories that tell us something about the form they may take: discrete and continuous measures. Discrete measures are those that can only assume a limited set of values (generally, a known set of values). Data measured with names and numerals are generally discrete measures. Data measured with numbers can be either discrete measures (whole numbers only, for example, number of pregnancies a woman has had) or continuous measures (those that can be represented with decimals, for example, timed up-and-go score of 13.5 seconds). These distinctions about measurement of data influence later data analysis and may influence your judgment about the precision of measurement used in a study.
++
One of the first tables one might find in an experimental or observational study contains descriptive data about the subjects as a total group and/or divided into their assigned groups. When you look at these tables, you often find the data for several variables provided, with specific statistics for each type of data. When more than 10 subjects are in a study, it is useful to calculate statistics that tell something about the distribution of all the scores, because there is too much information to use otherwise. (For small studies of 10 or fewer subjects you might find a table that lists each subject and their individual scores). These statistics are designed to help picture the distribution of all the scores for each variable. The descriptive statistics typically found in these tables are defined in Table 12.3. An example of such a table is found in Table 12.4.
++
++
++
Measures of central tendency are single values that tell you something about the mid-point in a distribution of numbers. Measures of dispersion are designed to tell you something about how variable the scores are around the measure of central tendency. The age of subjects is often reported in studies as a mean and a standard deviation; for example, in Table 12.4 the mean age of the experimental group subjects was 37 years with a standard deviation of 6.5 years. Instead of the mean, the median or the mode may be used to describe the center of the distribution if the data are discrete or if the continuous data have extreme scores, such that the mean might be misleading to the reader. In Table 12.4, the median and range are used to describe the distribution for the variable "weeks since onset of pain," since at least one person in the experimental group had an extreme score of 15 weeks. Together, the measures of central tendency and dispersion give the reader an idea of how the distribution of all the scores for subjects in that group would look. The measures of dispersion are also selected to fit the type of measure being reported, so understanding what each means can give the reader an understanding of how spread out the scores are for that measure. For example, compare the two means and standard deviations for the variable doses of pain medication taken in the past week in Table 12.4. The two groups have similar means, but the variability in the experimental group is twice that in the control group.
++
These measures of central tendency and dispersion are calculated from one sample, and it is quite possible that another study reported in the literature measuring the same variable may report quite different statistics. Which descriptive data are the most accurate to use in making inferences to the population of interest? If study A reports a 5-point mean decrease on a pain scale following an experimental intervention and study B conducted on the same intervention reports a 7-point mean drop in pain scores, which can you expect to happen with your patients if you use this intervention? There are additional statistics that help enhance predictions of measures of central tendency such as means as well as inferential statistical estimates, discussed later.
++
The standard error of the mean (SEM) is a useful statistic to understand how stable inferences about a population mean are, and this statistic is highly influenced by the sample size. Examining the calculation of the SEM (Table 12.5), one can see that large samples will have a smaller SEM, giving more confidence in the prediction of the population mean, while small samples will have larger errors in estimating the mean of the population. The SEM is a component of the formula for another statistic, the confidence interval (CI), which helps understand the accuracy of point estimates like the mean. This statistic is conveniently named, as it gives a range of scores within which the true value for the mean is expected to be found. The data in Box 12.2 illustrate how useful confidence intervals are to interpreting descriptive data.
++
++
Box 12.2 Interpreting Confidence Intervals Study A
Study B
++
Assuming that both studies enrolled 25 subjects and study B had both a higher mean and standard deviation, you can see that the range of the CIs for the two studies also differ similarly. From study A, we are 90% certain that if the true population mean is not 5, as calculated in the study, then it falls between the values 4.6 to 5.4. From study B, we are 90% certain that if the true population mean is not 7, as calculated in the study, it then falls between 6 and 8. The CI for study A is smaller than that for study B because the standard deviation around the mean found in study A is smaller. So, if you wish to know how much reduction in pain scores you might expect from this intervention, you can make use of both CIs to predict that the change to expect in your patients would be no smaller than 4.6 points and might be as high as 8 points.
+++
Homogeneous and Heterogeneous Data
++
Two useful terms to understand when evaluating either a total sample of a study or the two or more groups in a sample, are homogeneous and heterogeneous. If subjects in a group have very similar scores on the measures taken, you can say the group is homogeneous in, for example, the amount of drugs they are taking during the study. If you examine the means and standard deviation/variance for a control and experimental group and find the standard deviations to be quite different, you can say the group with the larger standard deviation is more heterogeneous than the other group. Homogeneous groups are those that are composed of subjects who score similarly on the measures, and heterogeneous groups have subjects who score quite differently from each other. The statistical analysis that is performed on the measures to look for statistical differences between groups is influenced by the homogeneity of the groups in relation to each other. If the control group is quite similar in scores (for example, most subjects changed less than 5 points from pre-test to post-test) while the experimental group is extremely varied in scores (some subjects changed only 2 points and some subjects changed 15 points and everything in between), then an important assumption for the statistical analysis, homogeneity of variance, is violated and the analysis may be incorrect. Random allocation of subjects to their groups in RCTs is very important to help achieve two or more groups who have homogeneity of variance. Inspecting the descriptive statistics for each group in the study design will allow you to decide if the groups are homogeneous.
++
The normal curve, or bell-shaped curve, is a concept important to understanding the measures of central tendency and dispersion we have just defined. When large numbers of scores are analyzed, a typical bell-shaped distribution for that score emerges that can be defined mathematically in order to predict the probability of any one score occurring. This is very helpful to clinicians in selecting what we consider to be normal values for clinical tests, for example, defining normal systolic blood pressure. The normal distribution provides us with a mean for systolic blood pressure and also an understanding of the range of scores one could expect in patients. Particularly high or low scores, which may not be normal and could require treatment, are easier to identify.
++
In normally distributed data:
++
the mean, median, and mode will hold the same value, so any of these statistics can be used to describe the middle of the distribution;
the probability of any one score falling within the middle of the distribution, defined as the mean minus 1 standard deviation or plus 1 standard deviation, is 68%; thus the scores that cluster closely around the mean are the most frequently occurring scores;
the probability of any one score falling within the area of the distribution, defined as the mean ±2 standard deviations, is 95%;
the probability of any one score falling within the area of the distribution, defined as the mean ±3 standard deviations, is 99%; capturing almost all scores. Since only 1% of scores fall more than 3 standard deviations above or below the mean, we can understand these scores to be very rarely occurring scores.
++
Applying these assumptions to any presentation of a mean and standard deviation given in a table allows us to understand how widely dispersed all likely scores are. Scores that are unexpectedly high or low could represent errors of measurement or could be recognized as rare values. Many of the inferential statistical tests that are used to tell us if statistically significant relationships or differences exist between variables or groups have as an assumption that the scores are normally distributed.
+++
Visual Representations of Measures of central Tendency and Dispersion
++
A graphic figure frequently used to express the measures of central tendency and dispersion for a set of scores is called a box plot. Figure 12-6 shows the conventions used to construct a box plot. The lower and upper margins of the box are the score values occurring at the 75th percentile of the distribution (upper quartile) and the 25th percentile of the distribution (lower quartile), so the box itself illustrates what is called the interquartile range, or the middle 50% of the distribution. If the box demonstrates a wide variation in scores from the upper quartile to the lower quartile, it will be quite wide; a distribution with more closely clustered scores will be narrower. A symbol (* or +) or a line is often used within the box to indicate the median of the scores. If the median falls equally between the upper quartile and lower quartile, we understand that the scores are evenly spread in the distribution, resembling a normal distribution. If the median falls closer to the upper or lower quartile, there are more higher scores than lower scores in the distribution. In some box plots a different symbol may be found inside the box to represent the mean score.
++
++
With the use of lines extending from the box in both directions, called whiskers, the box plot also provides more data about the extreme scores on both the high and low end of the distribution. The whisker may or may not have a crosshatch line at the end of the whisker. The conventions for what the whiskers represent are less consistent than for the box portion of the plot. Traditionally, the whisker represents the score occurring closest to but not further than 1.5 times the interquartile value and is called a step, or the inner fence. (There also might be a whisker that illustrates the outer fence, or 3 times the interquartile range.) The whisker may be drawn to represent the lowest- and highest-occurring scores or the score value at 1 standard deviation above and below the mean. The whisker might also represent a confidence interval around the mean. Symbols occurring beyond the whiskers indicate outliers or extreme outliers, which indicate score values that are quite high or low. It is important to check the legend of the figure when interpreting a box plot to be certain what measures of dispersion are being plotted by the whiskers. In Figure 12-6 it is easy to quickly see that the variability in change of the scores for the experimental group was larger than the variability in the control group, thus the control group has a more normal distribution than the experimental group. The experimental group has more low scores than high scores, as indicated by the 50th percentile score occurring closer to the 25th percentile than the 75th percentile.
+++
Inferential Statistics and Hypothesis Testing
++
Once you understand the distributions of the data on each measurement in the study, you can evaluate how researchers have chosen to test the hypotheses they used to design the study. To understand the hypotheses, you must be able to clearly identify the structure of the study: the variables and their roles in forming hypotheses to be tested.
+++
Variables and Hypotheses
++
As you read the description of the methods of the study, you will be able to identify the variables, or measures that are important to the study design. There are two primary roles that a variable may play in the study design: the independent variable, or factor, and the dependent variable, or outcome. Independent variables are those that categorize the intervention that is being studied; for example, an independent variable might be generally termed "treatment," with one group receiving the experimental treatment and the other group receiving the control or standard treatment. The independent variables are set or manipulated by researchers to occur at the levels they desire, for example, two groups versus three groups. Most RCTs will have an independent variable that functions to study the intervention of interest; however, the number of levels of this independent variable, or number of subject groups, can vary, with a minimum of two groups. Time is another common independent variable found in studies of interventions in physical therapy. When the study design reports measuring the outcome variables before the intervention, after the intervention, and perhaps at one or more follow-up times, then we consider time to be an independent variable that the researcher chooses to manipulate. It is important to identify all the independent variables in the study, for these are the variables that the authors hypothesize may have caused the outcomes of the study.
++
The dependent variables are the tests and measures that the researchers choose to determine the effects of the intervention. If they study the effects of manual therapy versus traction on cervical spine pain for patients with osteoarthritis, then the changes in pain would depend on which type of treatment the patient received. It would be rare to find an RCT designed with only one dependent variable. It is much more likely to find many tests used, often grouped into categories, for example, impairments to bodily functions or limitations to participation and activities in daily life. Researchers may choose their outcome or dependent variables based on previous research in order to make their research comparable to other studies. Identifying the dependent variables is important to your evaluation of the statistical conclusion validity of the study because each dependent variable will be tested by a research hypothesis.
++
The research hypotheses are a statement of what the author believes will happen in the study to each dependent variable. For example, a research hypothesis may predict that subjects in the manual therapy group will have a greater decrease in neck pain following one treatment than will the subjects in the cervical traction group. The author will predict an outcome for each dependent variable in the study, for example, predicting that manual therapy will cause a greater decrease in pain for the experimental group as well as an increase in cervical range of motion in flexion, extension, right and left rotation, and an improvement on a functional test for the upper limbs. This goal would require six research hypotheses to be tested and the possibility for manual therapy to be favored over cervical traction on all six dependent variables or in fewer than six variables.
++
The study results will present the means and standard deviations measured for each group in the study on the dependent variable. Table 12.6 shows sample data. In these data, it appears that the manual therapy group showed better changes after treatment on three of the dependent variables, pain, and AROM in rotations, while the traction group had better changes in AROM in cervical flexion and extension. Some differences between groups are large and some are small. Because we can never know for certain whether these changes observed in the data between the two study groups happened by chance or were influenced by some aspect of the study design, we utilize a statistical hypothesis test to estimate how confident we are of the differences we see in Table 12.6. In other words, how likely is it that the differences between the two groups are attributable to the different treatments they received? If good research study methods, as outlined in the first part of this chapter were followed, then large differences between groups are unlikely to have happened by chance alone and are more likely to have been caused by the intervention. Therefore, each of the five variables listed in Table 12.6 will undergo a statistical hypothesis test.
++
++
There are many families of statistical tests that may be used to examine different hypotheses. Two large categories are tests of differences between means and tests of relationships between variables. Statistical hypotheses that examine differences in means between two or more groups utilize a form of hypothesis called the null hypothesis. The null hypothesis concerning differences in means would propose that there is no difference between the two group means, thus no true change has happened or what might be called a null event. For example, in Table 12.6, the change score for pain for the manual therapy group was 4 points and the change score for the traction group was 2. The null hypothesis to test this variable would state that although these two change scores are not the same, their difference is so small that it could have happened by chance. An appropriate statistic, in this instance a paired t-test, would be reported by the researcher to test this null hypothesis of no difference between groups. If the results of the hypothesis test are such that the null hypothesis is not likely to be true, then an alternative hypothesis is considered to be true. In this case, the difference in pain scores between the manual therapy group and the traction group are found to favor a better outcome for the manual therapy group. This would be reported as a statistically significant difference between group means. However, if the statistical test found that the size of the difference was too small to know if chance variations might have caused the difference, then the outcome of the statistical test is said to be nonsignificant.
++
Hypothesis testing allows us to calculate how certain we are about the comparisons we wish to make. The reported results of statistical tests often include one or more of the following statistics: the observed value of the specific statistic used, the probability statistic (p) that estimates of how likely it is that the results of the hypothesis test should be attributed to chance rather than the intervention being tested, or a confidence interval surrounding the difference score. A p value accompanying a statistical test that is lower than.05 is often considered a reasonable cutoff for attributing the outcomes of the study to the intervention, rather than to chance. A p value of.05 is interpreted as a likelihood of 5 in 100 of observing a difference between groups this large by chance alone—a rare occurrence. Here is another way to interpret a p value: if this study were repeated 100 times, with 100 different subject groups, the likelihood of finding the null hypothesis of no difference between groups is 5 times, while the likelihood of finding a true difference between groups is 95 times. A mean difference between groups may be presented with a confidence interval, for example, 2 (1.5–2.5). If the 99% CI around a mean difference score does not include the null value 0, this increases our confidence that the group differences are large enough to be attributable to the study interventions, for it is quite unlikely that the true difference score is zero.
+++
Evaluating Outcomes: Effect Size and Number Needed to Treat
++
In evaluating the outcomes of the study, we can ask the question, were the groups similar at the end of the clinical trial? You will recall that several good research methods are used to ensure that the groups are similar at the beginning of the study, for example, random assignment of subjects to groups. The two groups illustrated in Table 12.6 appear to be quite similar on the dependent variables of interest to the researcher before the treatment was given. A comparison of the groups' pre-test scores is often performed, with a statistical test, to assure the reader that the randomization process worked and that the two groups are in fact similar before the intervention is given. From such comparisons of the equivalency of the groups prior to intervention, the researcher hopes to fail to reject the null hypothesis of no difference between groups. In doing so, one can conclude that the differences between groups on the dependent variables is not large enough to represent a statistical difference. When this is the case, any group differences found to be statistically significant at the end of the intervention can be more easily understood.
++
If the groups are statistically different at the end of the study on one or more dependent variables, our next question is, how different were they? This is a conclusion that readers can develop based on their clinical experience. Is a decrease in 4 points on an 11-point pain scale, before and after receiving manual therapy a big effect or a small effect? The statistical analysis will tell you whether or not the mean differences/change scores between groups are likely to have happened by chance or are due to the intervention. This is rather like a pass/fail decision; the hypothesis test either shows statistical significance or it does not. If it does show statistical significance, the reader can also benefit from interpreting a statistic that gives a sense of how large the differences, or effect of the treatment, were found to be.
++
The effect size is a statistic that nicely communicates exactly what its intent is: to tell you the relative effect of the tested intervention. The effect size statistic is calculated by dividing the difference between the two group means, or the difference between pre- and post-intervention means, by a pooled standard deviation of the two means.^{20} For example, in Table 12.6 we find that the after-treatment mean pain score for the traction group was 6 and for the manual therapy group it was 4. The standard deviation for both groups was approximately 2. The ES statistic for pain scores would be calculated as 6 – 4/2 = 1. An ES of 1 tells us that the relative change caused by the intervention was approximately equal to 1 standard deviation in pain scores. Since the standard deviation tells something about the variability in the data for pain scores, evaluating the magnitude of the change score against the typical variability in the group's pain scores helps us understand how big an impact, or effect, the manual therapy treatment had. If we examine the effect of both treatments, comparing pre-tests to post-tests, we see that the effect within the traction group was 1 while the effect within the manual therapy group was twice as large, with an ES of 2.
++
Another statistic, the number needed to treat (NNT) can be useful in interpreting the magnitude of a comparison of treatments in a RCT.^{22,23} This statistic is calculated by identifying the percentage of successful outcomes in the control group versus the experimental group, calculating the difference, and dividing that difference into 1 (this formula is described later in the chapter).The resulting number provides an estimate of the number of patients who would need to be treated, in order to have one more successful case than if you had used the control or comparison treatment. For the NNT analysis, the data must be dichotomous, for example, successful outcome versus failure. Any continuous data can be translated into this dichotomous format by selecting a cutof point to define success and failure. A NNT of 1 would indicate that every patient treated with the experimental treatment will have a successful outcome, while a NNT of 7 would indicate that you could expect to treat 7 people before you increased your success rate by 1 person, with the new treatment compared to the old treatment. A NNT of –1 would indicate that in every case, the control treatment will produce a better outcome. The NNT statistic is helpful in comparing the relative effectiveness of competing treatments, such that the treatment with the lowest NNT should be considered preferable, given similar costs and side effects of treatment. This statistic is not yet found with regularity in the physical therapy literature, but it is thought to be a clear and intuitive statistic for clinicians to use to compare clinical practice alternatives.
++
As you read the results section of a paper, identify each hypothesis test that is reported and find in either table or text the result of the test. Your focus should be on identifying the statistically significant findings, as those are the ones on which the author will focus. If not all dependent variables measured in the study are found to have statistically significant changes in the desired direction, the author will discuss this finding in the paper, identifying perhaps methodological issues that precluded the anticipated results. One of the most common reasons identified for failure to find statistically significant results has to do with the power of the study. The power of the study is an expression of the capability of the study to find statistically significant results of a certain desired size when in truth there is a difference.^{19} Researchers can use simple calculations to determine the power of their study to find differences between treatments. The calculation requires knowledge of the typical variability in the primary dependent variable, an estimate of the desired effect size of interest to the researcher, and the sample size. By adjusting these elements the researcher can plan for sufficient subjects to assure the statistical conclusion validity of the study. If the researcher does not allow for a customary number of dropouts in the methods and the study is plagued by large attrition rate, the power of the study could be compromised. It is not uncommon for a study to be underpowered for examination of all important variables.
++
As we evaluate the power of a study, we use one or more criteria that indicate how small a change, or difference between groups, we wish to be able to find in the study. Two terms with unique definitions may be provided by the author: the minimum detectable change (MDC) or the minimal clinical important difference (MCID). The MDC is defined as the smallest change in score that can be statistically detected beyond random error.^{24} The MCID is often identified by the researcher as a sufficiently large change score between groups that would justify a change in practice. Both of these values can be calculated from data found in previous studies of the intervention of interest (for example, the mean change score or multiplying the desired effect size by the standard deviation) using one of several formulas. For instance, if a study of manual therapy compared to cervical traction to reduce cervical spine pain found a mean change score in the experimental group of 3 points, an investigator may set 3 points as his MCID for a study of the effects of manual therapy in patients with lumbar spine pain. These statistics are useful to understand in reading studies, as they alert the clinician to the likely magnitude of changes that might be expected from the intervention, and they show that the author designed the study with sufficient power to find those differences, if, in fact, they existed.
+++
Summary of Statistical Conclusion Validity
++
We have covered some of the basic concepts that will allow the reader of a randomized clinical trial to comprehend the basic statistical approach to the results section of the study. We focused on the principles of data analysis that will permit you to examine the data presented in the study, read the tables and figures, and confirm what the author provides in the text. An understanding of concepts such as variability within the data in one group or between two or more groups helps you to accept the statistical validity of the study. If you wish to master additional statistical concepts, you will find some recommended resources at the end of the chapter.
+++
DETERMINING VALUE IN STUDIES OF DIAGNOSTIC/PROGNOSTIC ACCURACY
++
As clinicians we require research evidence to allow us to select the best interventions for our patients; for example, evidence that will support the choice of manual techniques, the dose of exercise, the timing of treatment, or the decision not to treat at all. But these intervention choices are only useful in the context of our clinical judgment about the patient's diagnosis and prognosis, important first steps in the patient management model. So we must also seek good research evidence to support our diagnostic and prognostic decisions. Diagnostic accuracy studies focus on evaluating clinical tests against the best available gold standard tests to determine how accurate our clinical tests are at finding patients that have and do not have a certain diagnosis.^{4} Studies of patient prognoses are designed to learn the likely course of recovery for patients in similar groups, with or without certain characteristics, for example, those patients who are likely to reinjure a joint if they return to sport with or without a protective brace. In physical therapy literature, we have experienced a significant increase in diagnostic accuracy studies, but there are fewer studies that help us understand a patient's prognosis, especially the long-term outcomes of our care. The criteria for good research practice are similar for both these types of research questions, so we will focus on diagnostic accuracy studies in this discussion. As with the criteria for determining value in the previous section on intervention studies, the goal is to avoid the influence of bias in the design and analysis of these studies. We will discuss three categories of good research practice: the selection of research subjects, the selection of measures and procedures, and the statistical estimates.
+++
Did the Selection of Subjects Avoid the Introduction of Bias?
++
The methods used in observational studies to assess the diagnostic accuracy of clinical tests can have a large impact on the truthfulness of the statistical estimates for the examination. The research design options for observational studies is primarily defined by the selection of subjects: cohort studies identify subjects before any exposure has happened and follow the subjects forward; case-control studies identify subjects with the disease and those without the disease and retrospectively evaluate their exposure to the variable of interest; and cross-sectional designs select subjects and perform tests at the same point in time. Our first set of quality criteria will then focus on the research design selected or how the subjects were identified.
++
The selection of appropriate research subjects is as important in ensuring quality in diagnostic accuracy studies as it is in clinical trials. The investigator will carefully identify the target population of patients and establish inclusion and exclusion criteria that provide a group of research subjects who are likely to need the diagnostic test under study. Subjects who have no injury at all or are so clearly impaired as to make a diagnosis obvious should be excluded from diagnostic accuracy studies.^{25} If these subjects are not excluded, the results of the study will be biased, usually overestimating the accuracy of the test. If the study fails to have a heterogeneous sample of subjects with a range of severity of the suspected impairment, spectrum bias may be introduced.^{26,27} Spectrum bias occurs when the sample selected for the study contains subjects at the furthest extremes of the diagnostic spectrum, while ignoring subjects in the middle of the spectrum. Spectrum bias in diagnostic accuracy studies has been linked to the selection of a case-control research design. Boyer et al^{28} performed a systematic review of diagnostic studies of clinical examinations for diagnosing carpal tunnel syndrome, finding spectrum bias in 65% of the 23 highest-quality studies, all of which used a case-control research design and were determined to have overestimated the accuracy of the tests.
++
To judge the adequacy of the sampling and research design, the author must provide a detailed description of the sampling methods and of the subjects' participation in the study so that you know the number of dropouts in the study. A flowchart similar to what you would expect in a RCT would be helpful.
+++
Did the Selection of Measures and Procedures Avoid the Introduction of Bias?
++
The research design preferred for diagnostic accuracy studies is cross-sectional, requiring one group of subjects who receive two or more tests in a closely determined time frame. Enrollment of subjects into these studies is usually sequential and prospective, but some retrospective studies of diagnostic accuracy are reported. These studies are of less value to the clinician because so many of the following procedural aspects of the study cannot be controlled. A diagnostic accuracy study will be performed to assess a new or existing test, the index test, against a reference test or gold standard. In physical therapy research, the reference test is assumed to be the most accurate test or group of tests available to diagnose an impairment in body structures or functions or in participation in desired social activity. Any differences between the index test and the reference test are assumed to be attributable to errors in the index test.^{29,30,31} It is important that the index test not be considered a part of the reference test, for example, choosing to compare a subtest score on the SF-36 to the total score on the SF-36. (The SF-36 is a standardized measure of health-related quality of life with subsets of items that provide scores in eight separate health domains as well as the composite score.) When this is done, it is likely the results will over-estimate the accuracy of the index test.
++
The procedures used to administer the tests in the study are also very important to the value of the study. The first aspect to consider is whether all subjects received both the index test and the reference test. If the reference test is expensive or could cause the subject discomfort or other risks considered inappropriate for that subject, the result is that a nonrandom group of study subjects receive both tests, and this could introduce a selection bias into the findings. This is less of a concern for diagnostic accuracy studies in physical therapy, but if not all subjects can receive the reference standard, the best research practice would require a random sample of study subjects to be selected for the reference test. This practice will also control for additional selection bias if only subjects who receive a certain score on the index test are given the gold standard test.
++
The timing of the two tests is also a criterion on which to evaluate the study. If the measures are not taken in close proximity, there is a chance of finding disease progression bias. This can occur when the subject's status changes, for better or worse, in the intervening time between measures. This must be evaluated for each study, taking into consideration how likely it is that the subjects' status might have changed. If the measures are assessing physiological properties like heart rate or blood pressure, the index and reference measures should be taken at the same time or very close in time. Other diagnostic clinical tests use imaging results as a reference standard and may be taken days apart. The author should carefully describe the decisions for the timing of the measures in these types of studies, and you will find value when you believe that the phenomenon being measured has not changed between index and reference testing.
++
Diagnostic accuracy studies should contain detailed descriptions of how each measure was performed, including figures or photos if helpful. This is important both for replicating the study findings or using the tests in your clinical practice and for your assessment of the appropriateness of the test. For tests of physical performance or assessment of joint structures, knowledge of the tester's position, the subject's position, the direction and magnitude of force application, and many other aspects of the test will be crucial to your assessment of the value of the test to your practice. It is also important to evaluate who performed each test, to assure that the testers were blind to the outcomes of the reference test if they were taking the index tests, and vice versa. The tester who takes the first measure will not be at risk for knowing the second test results, but even if the order of testing is randomized, the author should state the procedures taken to ensure that each tester had no knowledge of the other test results for subjects.
+++
Were the Statistical Estimates Developed and Presented Without Bias or Error?
++
The next elements to evaluate in a paper reporting the diagnostic accuracy of a test are the results and estimates provided for each test. If the previously discussed methods of procuring the test results from the correct subjects have been followed, the next step is to interpret the results of the statistical analysis used by the authors. There are a variety of statistics used to analyze the data on the index test. We will define several of the most common and discuss the interpretation of each. Descriptive data (for example, frequency counts) for the subject's scores on the two tests will provide the first level of analysis of the accuracy of the index test.
++
The simplest way to examine diagnostic accuracy is to dichotomize the results of each test as either positive or negative. This will occasionally require collapsing data from a test that has more than two categories of outcome, for example four grades of tendon lesions might be collapsed by denoting the highest three grades of injuries as positive tests and the lowest grade as a negative test. When the test scores are not categorical but continuous data, a cutoff point must be selected to indicate when a positive test should be recorded. Table 12.7 provides a description of a typical 2 × 2 cross tabulation table that provides this first level of analysis for the results of diagnostic accuracy studies and the basic framework that allows the construction of diagnostic accuracy estimates. Table 12.8 gives the calculations for each of the statistical estimates and a brief description of how to interpret each one.
++
++
++
The basic diagnostic accuracy statistics allow the clinician to select tests and measures with the best statistical conclusion validity. We would desire tests that have both high sensitivity and specificity such that incorrect classification of patients is minimized, but few tests would fall into this category. The sensitivity of a test is perhaps the easiest statistic to understand, for it means that the test items were sensitive enough to identify individuals who fall into the true positive category, a positive test, and actually have the diagnosis. It is useful to have screening tools with high sensitivity, for if a patient tests negative on a highly sensitive screening tool, you are more confident in ruling out that diagnosis. A mnemonic for this situation is SNOUT, i.e., highly Sensitive test, Negative test result, rule the diagnosis OUT. The specificity of a test describes its ability to find individuals in the true negative category, a negative test, and do not have the diagnosis. Highly specific tests are good at finding individuals who do not have the diagnosis, so a positive test result on a highly specific test tells the clinician to pay attention to that result, for it may rule in the diagnosis. The mnemonic for this situation is SPIN, i.e., highly specific test, Positive test result, rule the diagnosis IN.^{26,32}
++
The positive and negative predictive values (PPV and NPV) are proportions that can be calculated from the 2 × 2 table by working the table "horizontally" rather than "vertically," as is done to calculate sensitivity and specificity.^{20} As with the sensitivity and specificity statistical estimates, we also wish for high PPV and NPV. However, these two statistical estimates are influenced by the prevalence rate of the diagnosis for all subjects in the study. The prevalence expresses the percentage of the population of interest who have the diagnosis at any given point in time.^{20} If the prevalence of the diagnosis is high in the study sample, this increases the predictive value of a positive test and decreases the predictive value of a negative test, and the converse is true if the prevalence of the diagnosis is very low in the study sample.
++
Likelihood ratios give us a proportion of subjects (with either a positive or negative test) that contrasts those who really have the diagnosis versus those who do not. For example, a likelihood ratio of a positive index test examines just those who tested positive on the index test, making a ratio of those with the diagnosis in the numerator and those without the diagnosis in the denominator (those with diagnosis/those without diagnosis). You can also think of this as a ratio of true positive subjects (TP) to false positive subjects (FP). It is clear, then, that we wish a positive likelihood ratio statistic to be high for an index test. Likelihood ratios greater than 10 or less than 0.1 are considered to be characteristic of tests that are quite strong.
+++
Other Statistics Found in Observational Research
++
We will conclude this section with a discussion of additional statistics terms which may be found in observational studies that address either diagnostic accuracy or the impact of an intervention. Some simple statistics calculated from frequency data that are helpful to understand include ratios, proportions, and rates.^{20}
++
Ratios are created by dividing one frequency by another and are expressed using a colon (:). Proportions are calculated by dividing a subset frequency by the total frequency. Rates are calculated by expressing a proportion over a period of time. We will illustrate these statistics with hypothetical data for mechanisms of glenohumeral joint dislocation.
++
Your outpatient physical therapy clinic provides coverage for the sports teams at three area high schools, enrolling 800 student athletes. In the past year, you treated 40 student athletes with a diagnosis of glenohumeral joint dislocation, 38 of which were anterior in direction and 2 of which were posterior dislocations.
++
The ratio of anterior to posterior dislocations is 38/2 = 19:1. There were 19 anterior dislocations for every posterior dislocation patient.
The proportion of student athletes who experienced an anterior dislocation is calculated as 38/40 = 0.95 × 100 = 95%.
++
The rate of occurrence of anterior glenohumeral dislocations is calculated as 38/800 = 0.0475; and the rate of occurrence of posterior glenohumeral dislocations is calculated as 2/800 = 0.0025. These small decimal values are often multiplied by a constant to express the rate in easier-to-understand terms; for example, if we multiply the rate of anterior dislocations by a constant of 1000, the reported rate of anterior dislocations would be 47.5 student athletes in 1000 in the past year.
++
If the study examines how proportions may differ for groups of subjects, the calculation of risks and odds may be used. Suppose we chose to follow the 40 athletes in our practice who experienced shoulder dislocation because we wish to understand the influence of their rehabilitation on the occurrence of redislocation of the shoulder. Buteau et al have described a high prevalence of reoccurrence of glenohumeral dislocation in athletes in their case report on the use of a new piece of exercise equipment, the body blade, which oscillates to provide resistance in upper-limb strengthening programs.^{33} Looking retrospectively at the experiences of these injured athletes, we find the data reported in Table 12.9. Surprisingly, one-half of the injured athletes were not referred for rehabilitation, but were only immobilized in a sling; and the other half were referred for physical therapy, including the use of the body blade! The frequency data in this 2 × 2 table can be translated into several statistical estimates: risk ratios, absolute risk reduction, number needed to treat, and odds ratios. It is not common to find a 2 × 2 table presented in a research report; however, these statistical estimates are often used to describe how the groups differed.
++
++
One can quickly see that the risk of reinjury to the athlete's shoulder was higher for the immobilization-only group than the experimental/body blade group. A calculation of the absolute risk reduction allows the subsequent calculation of the number needed to treat; in this instance, it is quite low at 2.22, indicating that you would need to treat only two athletes with shoulder dislocation with the body blade to decrease their risk of reoccurrence of their dislocated shoulder. The odds ratio is constructed from the odds of a person in the immobilization group having or not having a second dislocation; in this case the odds of the control subjects having a second dislocation are much higher than the odds that they did not have a second dislocation. The odds ratio of 8.5 indicates that a person with a second dislocation is 8.5 times more likely to have been in the immobilization group than in the body blade group. Odds ratios are often presented with a confidence interval, so you can observe the upper and lower bound for this estimate. Odds ratios are also used to indicate the magnitude of relationship between more than two groups and an outcome variable, and in this type of analysis, one of the groups is used as the reference standard, and odds ratios are calculated for each of the other groups in reference to that group. For example, if we seek to understand the impact of age on balance, and we group our subjects into four age groups, odds ratios may be presented to express the likelihood that the groups have different balance performance.
++
This chapter has provided an overview of important elements that should be considered when assessing the value of experimental and observational research as it is used to help determine the accuracy of diagnostic and prognostic measures and the worth of interventions for patients. The primary focus in discussing each of these elements has been to help assure that the researchers have done what can be done to reduce the risk of bias in conducting their research and in presenting the results. It is only through such efforts that you can determine the level of confidence you should have in applying the results of the study to patient care. These elements were discussed in the context of an individual study, but they apply to all of the sources of evidence that we will discuss in subsequent chapters.
++
Find an article that supports an intervention provided to Mr. Ketterman. Identify each of these features in the article:
Design
Methods intended to reduce bias in selection and management of subjects
Methods designed to reduce bias related to selection of interventions and measurement of outcomes
Using the same article, identify the information that would determine external validity.
Using the same article, identify the descriptive and inferential statistics. Do they support the conclusions reached by the authors?
Find an article that supports a diagnostic/prognostic test used in Mr. Ketterman's care. Identify each of these features in the article:
Methods intended to reduce bias in selection and management of subjects
Methods intended to reduce bias in the selection of measures and procedures
Select a diagnostic/prognostic article that presents ratios, proportions, and rates. How can you use these data as you make decisions about Mr. Ketterman?
++
++
DiFabio R. Essentials of Rehabilitation Research: A Statistical Guide to Clinical Practice, Philadelphia: FA Davis; 2013.
++
Domholdt E. Rehabilitation Research: Principles and Applications, 3rd ed. St. Louis: Elsevier Saunders; 2005.
++
Fetters L, Tilson J. Evidence-Based Physical Therapy, Philadelphia: FA Davis; 2012.
++
Jewell DV. Guide to Evidence-based Physical Therapist Practice, 2nd ed. Ontario, Canada: Jones and Bartlett Learning LC; 2011.
++
Portney L, Watkins M, Foundations of Clinical Research, 3rd ed. Pearson Prentice Hall, 2009.
++
Select and become familiar with one of these texts so you can use it to explore questions you have as you read the evidence.
++
1. +
DiCenso
A, Bayley
L, Haynes
RB. Accessing pre-appraised evidence: Fine-tuning the 5S model into a 6S model.
Evid Based Nurs. 2009;12:99–101.
[PubMed: 19779069]
2. +
Califf
RM. Evolving Methods: Alternatives to Large Randomized Controlled Trials. In: The Learning Health Care System: Workshop Summary. Roundtable on Evidence Based Medicine. Institute of Medicine of the National Academies. Washington, DC: The National Academies Press; 2007:84–92.
3. +
Tunis
S. Practical Clinical Trials. In: The Learning Health Care System: Workshop Summary. Roundtable on Evidence Based Medicine. Institute of Medicine of the National Academies. Washington, DC: The National Academies Press; 2007:57–60.
4. +
Olsen
L, Aisner
D, McGinnis
JM. Roundtable on Evidence-Based Medicine. In: The Learning Healthcare System. Washington DC: The National Academies Press; 2007.
5. +
Campbell
DT, Stanley
JC. Experimental and Quasi-Experimental Designs for Research. Boston: Houghton Mifflin Co.; 1963.
6. +
Campbell
DT, Stanley
JC. Experimental and Quasi-Experimental Designs for Research. Boston: Houghton Mifflin Co.; 1966.
7. +
McKibbon
A.
et al. Finding the evidence. In: Guyatt
G., Rennie
D. (eds). Users' Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. Chicago: JAMA and Archives; AMA Press; 2002.
8. +
Guyatt
C, Devereaux
M, Straus. User's Guides to the Medical Literature: Essentials of Evidence-Based Clinical Practice. 2000.
9. +
Concato
J, Shah
N, Horwitz
RI. Randomized, controlled trials, observational studies, and the hierarchy of research designs. NEJM. 2000.
10. +
Guyatt
G, Rennie,
D, eds. 1B1 Therapy. JAMA and Archives. Chicago: AMA Press; 2002.
11. +
Heneghan
C, Badenoch
D. Evidence-Based Medicine Toolkit, 2nd ed. BMJ Books Blackwell Publishing; 2006.
12. +
Greenhalgh
T. How to Read a Paper: The Basics of Evidence-Based Medicine, 3rd ed. BMJ Books, Blackwell Publishing; 2006.
13. +
Schulz
KF, Altman
DG, Moher
D for the CONSORT Group. CONSORT 2010 Statement: Updated guidelines for reporting parallel group randomized trials. BMJ. 2010;340:c332.
[PubMed: 20332509]
14. +
Altman
DG,
et al: The revised Consort statement for reporting randomized trials: Explanation and elaboration. AIM. 134(April);2001:8.
15. +
Eden
J, Wheatley
B, Mc Neil
B, Sax
H (eds). Knowing What Works in Health Care: A Roadmap for the National Institute of Medicine. Washington DC: National Academies Press; 2008.
16. +
Steiner
WA
et al. Use of the ICF Model as a clinical problem-solving tool in physical therapy and rehabilitation medicine. Phys Ther. 82(11, November);2002; 1098–1107.
[PubMed: 12405874]
17. +
Roos
EM, Lohmander
LS. The Knee Injury and Osteoarthritis Outcome Score (KOOS): From joint injury to osteoarthritis. Health Qual Life Outcomes. 2003;1:64.
[PubMed: 14613558]
18. +
APTA. Guide to Physical Therapist Practice, 2nd ed. Alexandria, VA: American Physical Therapy Association; 2001.
19. +
Jewell
DV. Guide to Evidence-Based Physical Therapist Practice, 2nd ed. Ontario Canada: Jones and Bartlett Learning LC; 2011.
20. +
Domholdt
E. Rehabilitation Research: Principles and Applications, 3rd ed. St. Louis: Elsevier Saunders; 2005.
21. +
Norman
GR, Streiner
DL. Biostatistics: The Bare Essentials, 3rd ed. Shelton, CT: People's Medical Publishing House; 2008.
22. +
Dalton
GW, Keating
JL. Number needed to treat: A statistic relevant for physical therapists. Phys Ther. 2000;80:1214–1219.
[PubMed: 11087308]
24. +
Turner
D.
et al. The minimal detectable change cannot reliably replace the minimal important difference. J Clin Epidem. 2010;63:28–36.
25. +
Jaeschke
R
et al. Diagnostic tests. In: Guyatt
G, Rennie
D, eds. Users' Guides to the Medical Literature: A Manual for Evidence-Based Clinical Practice. Chicago: JAMA & Archives, AMA Press; 2002.
26. +
Cook
C, Cleland
J, Huijbregts
P. Creation and critique of studies of diagnostic accuracy: Use of the STARD and QUADAS methodological quality assessment tools. J Man Manip Ther. 2007;15(2):93–102.
[PubMed: 19066649]
27. +
Lijmer
JG
et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA. 1999:282:1061–1066.
[PubMed: 10493205]
28. +
Boyer
K
et al. Effects of bias on the results of diagnostic studies of carpal tunnel syndrome. J Hand Surg Am. 2009;Jul–Aug;34(6):1006–1013.
29. +
Whiting
PF
et al. Sources of variation and bias in studies of diagnostic accuracy: A systematic review. Ann Intern Med. 2004;140:189–202.
[PubMed: 14757617]
30. +
Whiting
PF
et al. The development of QUADAS: A tool for the quality assessment of studies of diagnostic accuracy included in systematic reviews. BMC Med Res Method. 2003;3:25.
31. +
Whiting
PF
et al. Evaluation of QUADAS: A tool for the quality assessment of diagnostic accuracy studies. BMC Med Res Method. 2006;6:9
32. +
Pewsner
D, Battaglia
M, Minder
C, Marx
A, Bucher
HC, Egger
M. Ruling a diagnosis in or out with "SpPIn" and "SnNOut": A note of caution. BMJ 2004;329:209–13.
[PubMed: 15271832]
33. +
Buteau
JL, Eriksrud,
MS, Hasson,
SM. Rehabilitation of a glenohumeral instability utilizing the body blade. Physiother Theory and Prac. 2007;23(6):333–349.