Measurement validity is an essential component of evidence-based practice, to assure that our assessment tools provide us with accurate information for decision making. Although clinicians constantly face uncertainty in patient management, many decision making strategies can be applied to reduce this uncertainty.

Concepts of measurement validity were introduced in Chapter 6. In this chapter we present statistical procedures related to the accuracy of diagnostic tools, choosing cut-off scores, the application of clinical prediction rules, and methods for measuring clinically meaningful change.

Many measuring instruments are specifically designed as screening or diagnostic tools. In a traditional medical framework, a diagnostic test is used to determine the presence or absence of a disease or abnormal condition. A screening test is usually done on individuals who are asymptomatic, to identify those at risk for certain disorders, and to classify patients who are likely to benefit from specific intervention strategies. Because these procedures involve allocation of resources, present potential risks to patients and are used for clinical decision making, it is important to verify their validity.

The results of a diagnostic or screening procedure may be dichotomous, categorical or continuous. The simplest tests will have only a dichotomous outcome: positive or negative, such as pregnancy or HIV status. A categorical test would involve ratings on an ordinal scale, such as +++, ++, +, − to reflect degree of sensation or reflexes. A continuous scale provides the most information regarding the outcome, such as a test measuring degrees of range of motion or hearing decibel level. Ordinal and continuous scales are often converted to dichotomous outcomes using cutoff scores to indicate a "normal" or "abnormal" response.

The ideal diagnostic test, of course, would always be accurate in discriminating between those with and without the disease or condition; it would always have a positive result for someone with the condition, whether a mild or severe case, and a negative result in everyone else. But we know that such tests are not perfect. They may miss abnormalities in those with a particular disorder, or they may identify abnormalities in those without the disorder.

We determine how good a test is by comparing the test result with known diagnostic findings obtained by a **reference standard**.^{∗} The reference standard will reflect the patient's true status, either the presence or absence of the condition. The assumption is made that the individual performing the test is blind to the true condition, eliminating possible bias. In some situations, the reference standard will be a concurrent test, such as an X-ray or blood test. In other situations, it will be obtained at a future time, as with a long-term outcome or autopsy. Sometimes there is no clear standard, and one must be defined or created. For instance, studies related to falls often use the patient's report of a fall within the past 6 months or year as the standard for being a "faller" or "nonfaller."^{1} Studies of delirium in hospitalized patients have used expert opinion as the reference standard to validate measures of confusion.^{2} When objective definitive standards are not available, the reference must be adequately described so that others can determine its applicability.

^{∗}We use the designation *reference standard* in place of "gold standard," as many tests do not have a true gold standard. The reference standard is defined as the basis for determining the patient's true diagnostic status. This may or may not reflect a true gold standard measure. The researcher must operationalize the reference standard.

The validity of a diagnostic test is evaluated in terms of its ability to accurately assess the presence and absence of the target condition. A diagnostic test can have four possible outcomes, summarized in the 2 × 2 arrangement shown in Table 27.1. Classification is assigned according to the true presence or absence of disease (Dx+ or Dx−) versus positive or negative test results. In Table 27.1 the cells labeled *a* and *d* represent **true positives** and **true negatives**, respectively, that is, individuals who are correctly classified by the test as having or not having the target condition. Cell *b* reflects those who are incorrectly identified as having the condition, or **false positives**, and cell *c* represents those who are incorrectly identified as not having the condition, or **false negatives**.

**Sensitivity** is the test's ability to obtain a positive test when the target condition is really present, or the true positive rate. Using the notation presented in Table 27.1,

This value is the proportion of individuals who test positive for the condition out of all those who actually have it, or the probability of obtaining a correct positive test in patients who have the target condition. The sensitivity of a test increases as the number of persons with the condition who are correctly classified increases; that is, fewer persons with the disorder are missed.

**Specificity** is the test's ability to obtain a negative test when the condition is really absent, or the true negative rate. As shown in Table 27.1,

This value is the proportion of individuals who test negative for the condition out of all those who are truly normal, or the probability of a correct negative test in those who do not have the target condition. A highly specific instrument will rarely test positive when a person does not have the disease.

The complement of sensitivity (1 – sensitivity) is the **false negative rate,** or the probability of obtaining an incorrect negative test in patients who do have the target disorder. The complement of specificity (1 − specificity) is the **false positive rate**, sometimes called the "false alarm" rate.^{3} This is the probability of an incorrect positive test in those who do not have the target condition.

To illustrate the application of these measures, let's consider a study of the validity of the Functional Reach Test (FRT) to identify elders with Parkinson's disease who are at risk for falls.^{4} The FRT is designed to assess anterior-posterior stability by measuring the maximum distance an individual can reach while leaning forward over a fixed base of support.^{5} Based on previous research, a cutoff score of 10 in. (25.4 cm) was used to classify subjects as "at risk" or "not at risk." Screening results were compared with a known history of falls (the reference standard), as shown in Table 27.2A.

The sensitivity of the test for this population was low, at 30%. Of the 30 patients identified as having a history of falls, only 9 tested positive using the FRT. The specificity of the test, however, was 92%. Of the 13 patients who did not have a history of falls, 12 tested negative. Therefore, although almost all of those not at risk were correctly identified (true negatives), a large percentage of patients who were at risk were missed (false negatives). The graphic in Table 27.2A illustrates these proportions.

In addition to sensitivity and specificity, the usefulness of a clinical screening tool can be assessed by its feasibility. A test must demonstrate that it is an efficient use of time and resources and that it yields a sufficient number of accurate responses to be clinically useful. This characteristic is assessed by the test's predictive value. A **positive predictive value (PV +)** estimates the likelihood that a person who tests positive actually has the disease. Using the notation given in Table 27.1,

which represents the proportion of those who tested positive who were true positives. Therefore, a test with a high positive predictive value will provide a strong estimate of the actual number of patients who have the target condition. Similarly, a **negative predictive value (PV –)** indicates the probability that a person who tests negative is actually disease free. Therefore,

which is the proportion of all those who tested negative who were true negatives. A test with a high negative predictive value will provide a strong estimate of the number of people who do not have the target condition.

For the FRT study (see Table 27.2), the positive predictive (PV+) value of 90% tells us that almost all of those who tested positive actually had a history of falls. Only one patient who tested positive was not at risk. The negative predictive value (PV–) was lower, at 36%. Therefore, only one-third of patients who tested negative were actually not at risk.

Predictive value may be of greatest importance in deciding whether or not to implement a screening program. When the positive predictive value is low, only a small portion of those who test positive actually have the target condition. Therefore, considerable resources will probably be needed to evaluate these people further to separate false positives, or unnecessary treatments will be applied. Policy decisions are often based on a balance between the use of available resources and the potential harmful effects resulting from not identifying those with the target condition.^{6}

Sensitivity, specificity and predictive value are influenced by the prevalence of the target condition in the population. **Prevalence** refers to the number of cases of a condition existing in a given population at any one time. For a test with a given sensitivity and specificity, the likelihood of identifying cases with the condition is increased when prevalence is high (the condition is common). Therefore, when prevalence is high, a test will tend to have a higher positive predictive value. This is illustrated in Table 27.2 for our example of fall risk. The prevalence of a history of falls is 30 out of 43 patients, or 70%. Therefore, a large proportion of patients in this sample had a history of falls, and we could expect a high PV+, which was 90%. When prevalence is low (the condition is rare), one can expect many more false positives, just by chance. A positive predictive value can be increased either by increasing the specificity of the test (changing the criterion) or by targeting a subgroup of the population that is at high risk for the target condition.

When we consider the diagnostic accuracy of a test, high values of sensitivity and specificity provide a certain level of confidence in interpretation. If a test has high sensitivity, it will properly identify most of those who have the disorder. If the test has high specificity, it will properly identify most of those without the condition. But how do these definitions relate to confidence in diagnostic decisions? Consider these two questions:

If a patient has a positive test, can we be confident in ruling IN the diagnosis?

If a patient has a negative test, can we be confident in ruling OUT the diagnosis?

Sensitivity and specificity help us answer these questions, but probably not the way you would expect. When a test has *high specificity*, a positive test rules *in* the diagnosis. When a test has *high sensitivity*, a negative test rules *out* the diagnosis. Straus and colleagues^{7} offer two mnemonics to remember these relationships:

Think of it this way: A highly specific test will properly identify most of the patients who do *not* have the disorder. If the test is so good at finding those who are normal, we can be pretty sure that someone with a positive test *does* have the disorder (ruling IN the diagnosis) because if he didn't have the disorder, the test would have correctly identified him as normal! Conversely, a highly sensitive test will find most of those who *do* have the disorder. Therefore, we can be pretty sure that someone with a negative test does *not* have the disorder (ruling OUT the diagnosis) because if he did have the disorder, the test would have correctly diagnosed him!

These concepts are also related to predictive value. With a more specific test, negative cases are identified more readily. Therefore, it is less likely that an individual with a positive test will actually be normal. This results in a high positive predictive value. With a more sensitive test, positive cases are identified more readily; that is, we will not miss many true cases. Therefore, it is less likely that an individual with a negative test will have the disease. This leads to a high negative predictive value.

If we use the example of the Functional Reach Test (Table 27.2), with specificity of 92% (and a PV– of 90%), we can be confident that someone with a positive test is at risk for falls. However, with sensitivity of only 30% (and a PV + of 36%), if someone has a negative test, we cannot be sure that person is really not at risk. Because the test is not good at finding those who are at risk, having a negative test does not help us safely rule out this risk.

The ultimate purpose of a diagnostic test is to help the clinician make a decision about the presence or absence of a disorder for an individual patient. The validity of a test is based on how strongly it can support a decision to rule the disorder in or out. Therefore, a test is considered a good one if it can help to increase our certainty about a patient's diagnosis.

When we begin to evaluate a patient by taking a history and using screening or other subjective procedures, we begin to rule in and rule out certain conditions and eventually generate a hypothesis about the likely diagnosis. This hypothesis can be translated into a measure of probability or confidence, indicating the clinician's estimate of how likely a particular disorder is present. This has been termed the **pretest probability** (or prior probability) of the disorder—or what we think might be the problem before we perform any formal testing.^{8}

**Finding the Pretest Probability.** The process for determining a pretest probability is not an obvious one. Conceptually, it represents a "best guess" or clinical impression based on experience and clinical judgment. Clinicians may have sufficient experience with certain types of patients to estimate the probability of a diagnosis based on initial examination findings,^{9,10} although such estimates are not always reliable.^{11,12} Using the functional reach scenario, a clinician might have sufficient experience with patients with Parkinson's disease to generate an initial hypothesis about the patient's likelihood to fall.

Information from the literature can also be used to help with this estimate by referring to the prevalence of a disorder.^{13} For instance, studies have shown that the prevalence of idiopathic scoliosis in children aged 10-16 years is 2–4%;^{14} 26% of patients with orthopedic trauma have been found to experience depression;^{15} 34% of children who have been enrolled in special education classes have been diagnosed with asthma;^{16} the presence of postoperative delirium following hip fracture repair is estimated at 36%;^{2} the prevalence of mortality 1-month following a stroke in patients with prestroke dementia is 44%.^{17} These values, reported in the literature, allow the clinician to estimate the likelihood that any individual patient could have these disorders.

Suppose you are working with a patient with Parkinson's disease, and you believe she may be at risk for falling. You think it may be useful to perform a test to determine if such a risk is present. Consider the study of the Functional Reach Test once again (see Table 27.2) This study demonstrated a 70% prevalence of falls in its sample of patients with Parkinson's disease.^{4} Knowing this, before you have done any further testing, your best estimate is that the pretest probability of your patient being at risk for falls is 70%.

**Decision-Making Thresholds**. Being able to estimate a pretest probability is central to deciding if a condition is present and if testing or treatment is warranted. Based on the initial hypothesis and pretest probability of a condition, the clinician must decide if a diagnostic test is necessary or useful to confirm the actual diagnosis. Straus et al^{7} suggest that two thresholds should be considered, as shown in Figure 27.1A. With a very low pretest probability, the diagnosis is so unlikely that testing is not useful; that is, even with a positive test, results are likely to be false positives. Therefore, treatment is not initiated and other diagnoses need to be considered.

###### FIGURE 27.1

Thresholds for deciding to test or treat. (**A**) Thresholds based on pretest probabilities; (**B**) thresholds based on posttest probabilities. (Adapted from Straus SE, Richardson WS, Glasziou P, et al. *Evidence-Based Medicine: How to Practice and Teach EBM* (3rd ed.). Edinburgh, Churchill Livingstone, 2005, Figure 3.3, p. 85.)

With a very high pretest probability, the likelihood of the diagnosis is so strong that testing may be unnecessary, and treatment should just be initiated. Even with a negative test, results are likely to be false negatives. A strong pretest probability means that the results of a test are unlikely to offer any additional useful information. When the pretest probability is not definitive, however, with more intermediate values, testing is necessary to pursue the diagnosis, and treatment decisions will then be based on those results.

This approach must also take into consideration the relative severity of the disorder; the threshold for testing may vary for different conditions. For example, a patient may exhibit symptoms that lead a clinician to suspect the presence of a deep venous thrombosis (DVT), which is potentially life threatening. Even if the symptoms are minimal and the pretest probability is low, the clinician may feel compelled to test for the condition to safely rule it out before continuing with other interventions. At the same time, the clinician must also be able to justify that the benefits of performing the test outweigh any potential risks. A test that includes potentially harmful procedures may not be worthwhile if the condition has little consequence. Effective treatment should also be available, should the test be positive. The effort of a test is not reasonable if the results have no chance of leading to successful intervention.

A diagnostic test allows a clinician to revise the pretest probability estimate of the disorder.^{18} Once we have the data from a test, we expect to be more confident in the diagnosis; that is, we hope to improve our certainty. The revised likelihood of the diagnosis based on the outcome of a test is the **posttest probability** (or posterior probability)—what we think the problem is (or is not) now that we know the test result. A good test will allow us to have a very high posttest probability confirming the diagnosis, or a very low posttest probability causing us to abandon it (see Figure 27.1B). When the posttest probability is not definitive, further testing may be necessary.

Once we have established a hypothesis that the patient may have a particular diagnosis, we want to determine if a test can make us more confident in that diagnosis. A measure called the **likelihood ratio** helps us in this effort. The likelihood ratio tells us how much more likely it is that a person has the diagnosis after the test is done; that is, it will help us determine the posttest probability. It indicates the value of the test for increasing certainty about a diagnosis,^{19} or its "confirming power."^{20} Likelihood ratios are being reported more often in the medical literature as an important standard for evidence-based practice. The likelihood ratio has an advantage over sensitivity, specificity and predictive values because it is independent of disease prevalence, and therefore can be applied across settings and patients.

We can determine a likelihood ratio for a positive or negative test. To understand this statistic, let's assume that a patient tests positive on the FRT. If this were a perfect test, then we would be certain that the patient is at risk for falls (true positive). But we hesitate to draw this conclusion definitively because we know that some patients who are not at risk will also test positive (false positive). Therefore, to determine if this test improves our diagnostic conclusion we must correct the true positive rate by the false positive rate. This is our **positive likelihood ratio (LR +):**

The LR+ will tell us how many times more likely a positive test will be seen in those with the disorder than in those without the disorder. A good test will have a high positive likelihood ratio.

Now let's assume the patient has a negative test. With a perfect test we would be sure this patient was not at risk for falls. But we are still concerned about the possibility of a false negative. Therefore, to determine if a negative test improves our diagnostic conclusion, we look at the ratio of the false negative rate to the true negative rate. This is our **negative likelihood ratio (LR-):**

The LR– will tell us how many times more likely a negative test will be seen in those with the disorder than in those without the disorder. A good test will have a very low negative likelihood ratio.

**It is important to note that likelihood ratios always refer to the likelihood of the disorder being present. ^{21}** That's why we would like to see a high LR+, to indicate that the disorder is likely to be present with a positive test. A very low LR− means that the disorder has a small probability of being present with a negative test.

The value of the likelihood ratio is somewhat intuitive, in that a larger LR+ indicates a greater likelihood of the disease, and a smaller LR– indicates a smaller likelihood of the disease. These values have been interpreted according to the following scale:^{18}

A LR+ over 5 and a LR– lower than 0.2 represent relatively important effects. Likelihood ratios between 0.2 to 0.5 and between 2 to 5 may be important, depending on the nature of the diagnosis being studied. Values close to 1.0 represent unimportant effects. A likelihood ratio of 1.0 essentially means the test is useless; that is, the true positive and false positive (or true negative and false negative) rates are the same.

Let's apply this measure to the functional reach data. As shown in Table 27.2B, the LR+ = 3.75. Therefore, with a positive test, the likelihood of a patient being at risk for falls is increased by almost 4 times. This represents a potentially important value. The LR– = 0.76. This represents a small and unimportant value, close to 1.0. Therefore, based on these data, the FRT may help to improve our confidence with a positive test, but does not add important information with a negative test. Going back to the concepts of SpPin and SnNout, a large LR+ tells us that a positive test is good at ruling the disorder IN. A very low LR− tells us that the negative test is good at ruling the disorder OUT. We can confirm this by looking at the posttest probabilities.

A nomogram, shown in Figure 27.2, has been developed to determine posttest probabilities based on pretest probabilities and likelihood ratios.^{22} To use the nomogram, we begin on the left by marking the pretest probability. The center line identifies the likelihood ratio. If we draw a line connecting these two points and extend it to the right margin, we find the posttest probability associated with the test.

For our example, Figure 27.3 shows a mark for 70% pretest probability based on prevalence data. Therefore, if we obtain a positive test (LR+ = 3.75), our posttest probability would approach 90%. With a positive test we have improved our confidence in this patient being at risk for falls by almost 20%. If we obtained a negative test (LR– = 0.76), our posttest probability would be approximately 60%. The patient still has a 60% chance of being at risk for falls—we have not improved our diagnostic certainty very much. Therefore, with a negative test other assessments may be necessary to accurately identify if the patient is truly not at risk.

It is important to realize that the posttest probability will be dependent on the sensitivity and specificity of the diagnostic test (translated to a likelihood ratio) as well as the clinician's estimate of the pretest probability for an individual patient. For example, with a positive FRT, if we start with a pretest probability of 20%, we would get a posttest probability of 50% for a positive test. With a pretest probability of 5%, this test would increase our posttest certainty to only 15%. Where we start will influence the degree to which our certainty can be improved by the test.

When the nomogram is not handy, posttest probabilities can be obtained by converting the pretest probability to an odds value as follows:

**1**. Convert the pretest probability (prevalence) to pretest odds:

**2**. Multiply the pretest odds by the likelihood ratio to get the posttest odds:

**3**. Convert the posttest odds to the posttest probability:

Once again, we show that with a 70% pretest probability and a LR+ of 3.75, our posttest probability has risen to 90%. This form of calculation will usually be more precise than using the nomogram. Several Internet programs are also available for calculation of posttest probabilities.^{23,24}

Applying likelihood ratios to clinical practice will necessitate a strong understanding of diagnostic principles. The threshold for making a decision about a diagnosis may not be reached until several tests have been completed. When tests are performed serially, the posttest probability for one test can be used to estimate the pretest probability for the subsequent test. This is appropriate only when the tests are independent of each other. Straus et al^{7} recommend "chaining" likelihood ratios for this purpose. The posttest odds of the first test become the pretest odds for the second test. Therefore, by multiplying the new pretest odds by the likelihood ratio for the second test, we obtain the posttest odds for the second test. This can then be converted to posttest probability.

Sensitivity, specificity and likelihood ratios can also be expressed in terms of confidence intervals (see Table 27.2C). Although these values are often not reported, they are important to understanding the true nature of these estimates.^{25} Given a sample of scores, the confidence interval will indicate the range within which we can be sure the true population value will fall. Although not interpreted in terms of significance testing, confidence intervals for these measures of diagnostic accuracy will indicate the relative stability of the test's results; that is, with a wide confidence interval we would be less likely to consider the value a good estimate.^{26} Calculators for confidence intervals are available on the Internet.^{27,28}

In 2000, a consensus meeting of international researchers and journal editors resulted in a recommendation for quality reporting of diagnostic studies. This group developed the Standards for Reporting of Diagnostic Accuracy (STARD) statement, consisting of a checklist of 25 items that would allow authors to ensure they were including all relevant information in an article, including essential elements of the design and conduct of their study, the execution of tests, and their results.^{29} The checklist will also allow readers to determine the potential for bias in a study, and to judge the generalizability of the results.

The STARD checklist is shown in Table 27.3. The statement has been published in several journals, and the reader is encouraged to refer to any one of these references for detailed descriptions of item criteria.^{30,31,32,33} A flow diagram is also recommended to illustrate the number of participants at each stage of the study and to communicate the key elements of the design (see Figure 27.4).

###### FIGURE 27.4

Study profile flow diagram of patients with suspected appendicitis evaluated in an emergency department during a 6-month study period. At each point, the accuracy of a positive or negative test is documented. (From Garcia Pena, BM et al. Ultrasonography and limited computed tomography in the diagnosis and management of appendicitis in children. *JAMA* 1999;282:1041–1046, Figure 2, p. 1044. Used with permission of the American Medical Association.)

Section and topic | Item | on page # | |
---|---|---|---|

TITLE/ABSTRACT/KEYWORDS | 1 | Identify the article as a study on diagnostic accuracy (recommend MeSH heading 'sensitivity and specificity'). | |

INTRODUCTION | 2 | State the research questions or study aims, such as estimating diagnostic accuracy or comparing accuracy between tests or across participant groups. | |

METHODS | |||

Participants | 3 | Describe the study population: the inclusion and exclusion criteria, setting and location(s) where the data were collected. | |

4 | Describe participant recruitment: was recruitment based on presenting symptoms, results from previous tests, or the fact that the participants had received the index test(s) or the reference standard? | ||

5 | Describe participant sampling: was the study population a consecutive series of participants defined by the selection criteria in items (3) and (4)? If not specify how patients were further selected. | ||

6 | Describe data collection: was data collection planned before the index test and reference standards were performed (prospective study) or after (retrospective study)? | ||

Test methods | 7 | Describe the reference standard and its rationale. | |

8 | Describe technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index tests and reference standard. | ||

9 | Describe definition of and rationale for the units, cutoffs and/or categories of the results of the index test(s) and the reference standard. | ||

10 | Describe the number, training and expertise of the persons executing and reading the index tests and the reference standard. | ||

11 | Describe whether or not the readers of the index tests and reference standard were blind (masked) to the results of the other test and describe any other clinical information available to the readers. | ||

Statistical methods | 12 | Describe methods for calculating or comparing measures of diagnostic accuracy, and the statistical methods used to quantify uncertainty (e.g. 95% confidence intervals). | |

13 | Describe methods for calculating test reproducibility, if done. | ||

RESULTS | |||

Participants | 14 | Report when study was done, including beginning and ending dates of recruitment. | |

15 | Report clinical and demographic characteristics of the study population (e.g. age, sex, spectrum of presenting symptoms, comorbidity, current treatments, recruitment centers). | ||

16 | Report the number of participants satisfying the criteria for inclusion that did or did not undergo the index tests and/or the reference standard; describe why participants failed to receive either test (a flow diagram is strongly recommended). | ||

Test results | 17 | Report time interval from the index tests to the reference standard and any treatment administered between. | |

18 | Report distribution of severity of disease (define criteria) in those with the target condition; describe other diagnoses in participants without the target condition. | ||

19 | Report a cross tabulation of the results of the index tests (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the test results by the results of the reference standard. | ||

20 | Report adverse events from performing the index tests or the reference standard. | ||

Estimates | 21 | Report estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals). | |

22 | Report how indeterminate results, missing responses and outliers of the index tests were handled. | ||

23 | Report estimates of variability of diagnostic accuracy between subgroups of participants, readers or centers, if done. | ||

24 | Report measures of test reproducibility, if done, | ||

DISCUSSION | 25 | Discuss the clinical applicability of the study findings. |

Available at: <http://www.stard-statement.org/website%20stard/> Accessed April 12, 2007.

Although continuous scales are considered preferable for screening because they are more precise, they are often converted to a dichotomous outcome for diagnostic purposes; that is, a **cutoff score** is established to demarcate a positive or negative test. For example, a specific level of blood pressure (a continuous scale) is used to determine if a patient should or should not be placed on a therapeutic regimen for hypertension. In the previous example of functional reach, a test score of less than 10 in. was considered indicative of fall risk. However, if a cutoff score of 12 in. was used, the sensitivity and specificity would be different. The problem, then, is to determine what cutoff score should be used. This decision point must be based on the relative importance of sensitivity and specificity, or the cost of incorrect outcomes versus the benefits of correct outcomes. Consider this analogy: Although there may be costs associated with unnecessary preparations in predicting a storm that does not occur (false positive), these costs would probably be considered minor relative to the danger of failing to predict a storm that does occur (false negative).

Suppose we use the Functional Reach Test to predict if an elderly individual is at risk for falling, and the individual with a low score is referred to a balance exercise program. If the individual is not truly at risk (false positive), the outcome may be considered low cost, compared to the situation where an individual who is at risk is not correctly diagnosed (false negative), not referred for treatment, and injures herself in a fall. Therefore, it might be reasonable to set the cutoff score low, to avoid false negatives, thereby increasing sensitivity. Conversely, consider a scenario where a test is used to determine the presence of a condition that requires potentially life-threatening surgery. A physician would want to avoid the procedure for a patient who does not truly have the condition. For this situation, the threshold might be set high to avoid false positives, increasing specificity. We would not want to perform this procedure unless we knew for certain that it was warranted.

Obviously, it is usually desirable for a screening test to be both sensitive and specific. Unfortunately, there is often a compromise between these two characteristics. One way to evaluate this decision point would be to look at several cutoff points, to determine the sensitivity and specificity at each point. We could then consider the relative trade-off to determine the most appropriate cutoff score. Those who use a screening tool must decide what levels of sensitivity and specificity are acceptable, based on the consequences of false negatives versus false positives. It is often necessary to combine the results of several screening tests to minimize the trade-off between specificity and sensitivity. We will discuss this approach shortly in relation to clinical prediction rules.

The balance between sensitivity and specificity can be examined using a graphic representation called a **receiver operating characteristic (ROC) curve**. This procedure actually evolved from radar and sonar detection strategies developed during World War II to improve signal-to-noise ratios. Suppose we were listening to a radio station that has a weak signal. We turn the volume up so we can hear better, but as we do so, we not only pick up the desired signal, but background noise as well. At lower volume settings, we will hear the signal more than the noise. But there will come a point, as we increase volume, that the noise will grow faster than the signal; that is, the signal has reached its full capacity, but the noise continues to increase. If we set the volume to its maximum, we may claim that the signal is strong, but the noise will be so great that the signal will be indecipherable. Therefore, the optimal setting will be where we detect the largest ratio of signal to noise.

This is essentially what we are trying to do with a diagnostic test. We want to detect the "signal" (the presence or absence of the disease—the true positive and true negative) with the least amount of interference possible (incorrect diagnoses—false positive and false negative). The ROC curve diagrams this relationship. It allows us to answer the question: How well can a test discriminate between signal and noise—can it discriminate between the presence or absence of disease?^{34}

The process of constructing an ROC curve involves setting several cutoff points for a test and calculating sensitivity and specificity at each one. The curve is then created by plotting a point for each cutoff score that represents the proportion of patients correctly identified as having the condition (true positives) on the *Y* axis against the proportion of patients incorrectly identified as having the condition (false positives) on the *X* axis. The *Y* axis represents sensitivity, and the *X* axis represents one minus specificity (1 – specificity).^{†}

To illustrate this process, consider again the example of Functional Reach in patients with Parkinson's disease. We have created a hypothetical dataset for the sample of 43 patients who, based on their 6-month history of falls, have been identified as "at risk" (they have fallen at least once, *n* = 30) or "not at risk" (they have not fallen, *n* = 13). Table 27.4A shows the distribution of scores for the patients in each risk group, converting the continuous scores to 1-inch increments.

Table 27.4B shows the distribution of scores at 5 cutoff points. It is generally recommended that at least 5 to 6 points should be used to plot an ROC curve. We calculate the sensitivity and specificity of the test at each cutoff point. For this example, higher scores indicate better balance, and therefore, less likelihood to fall. Lower scores will result in a diagnosis of "at risk" for falls. Table 27.4B shows the number of true positive and false positive scores at each cutoff point, and the corresponding values for sensitivity and 1 – specificity. For example, if we use a cutoff score of 10 in., then all those with a score of 10 inches or less will be diagnosed "at risk." Those who obtained a score greater than 10 inches will be considered "not at risk." With this cutoff score, 9 individuals have been correctly identified "at risk," and 1 has been incorrectly diagnosed. This leads to a corresponding sensitivity of .30 and specificity of .92, which results in 1 – specificity of .08. Similarly, with a cutoff score of 12 inches, all those who obtained a score of 12 inches or less will be considered "at risk." Those with scores above 12 inches will be diagnosed "not at risk." When this cutoff score is used, 25 individuals are correctly diagnosed and 7 are incorrectly diagnosed. This leads to a corresponding sensitivity of .83 and 1 − specificity of .54. These values are then plotted to create the ROC curve (see Figure 27.5).

The curve is completed at the origin and the upper right hand corners, reflecting cutoff points above and below the highest and lowest scores. For example, with a cutoff score at 14, *all subjects* will be diagnosed "at risk." Therefore, all those truly "at risk" are correctly diagnosed (true positive rate is 100%), and all those "not at risk" are incorrectly diagnosed (false positive rate is 100%). Similarly, with a cutoff score of 8, *all subjects* will be diagnosed "not at risk." Therefore, all those truly "at risk" will be incorrectly diagnosed (true positive rate is zero), and all those "not at risk" will be correctly diagnosed (false positive rate is zero).

^{†}If we take all those who are diagnosed negative [(*b* + *d*) in Table 27.4] out of this total (100%), those who tested negative (true negatives) equal 1.00–*d* (specificity). Therefore, the remainder, or those who tested positive (false positives), would be 1.00–specificity.

The ROC curve is plotted on a square with values of 1.0 for sensitivity and 1 − specificity at the upper left and lower right corners, respectively. A perfect test instrument will have a true positive rate of 1.0 and a false positive rate of zero, resulting in a curve that essentially fills the square; that is, it will go from the origin to the upper left corner to the upper right hand corner. A noninformative curve occurs when the true positive and false positive rates are equal, which means that the test provides no better information than 50:50 chance. This curve starts at the origin and moves diagonally to the upper right hand corner (shown as the broken line in Figure 27.5).

If we wanted to compare two tests to determine which was a better diagnostic tool, we could compare ROC curves to see which curve more closely approximates the perfect curve. This provides only a visual basis of comparison, however, and a quantitative standard is more definitive. The best index for this purpose is a measure of the **area under the curve (AUC).** This value equals the probability of correctly choosing between normal and abnormal signals. This means that, given a test with an ROC curve area of .76, as in our current example, and presented with a randomly chosen pair of patients, one with the disorder and one without, the clinician would choose the correct diagnosis 76% of the time. Therefore, the area represents the ability of the test to discriminate between those at risk and those not at risk. A perfect test has an area of 1.00; using such a test would allow one to always identify the patient with disease.

Table 27.4C shows output for the area under the curve as well as a test of significance and confidence intervals. We are 95% confident that the true area under the curve will fall between 61% and 91%.

In addition to making comparisons or describing the relative effectiveness of a test for identifying a disorder, we can also use the ROC curve to decide which cutoff point would be most useful. Most ROC curves have a steep initial section, which reflects a large increase in sensitivity with little change in the false positive rate. A relatively flat region across the top is also typical. Neither of these sections of the curve make sense for choosing a cutoff point, as they represent little change in one component of the curve. Usually, the best cutoff point will be at the point where the curve turns.^{34} In Figure 27.5, a marked turn occurs at the cutoff point of 11 in., suggesting that this cutoff would provide the best balance between sensitivity and specificity for this test. At that point we would miss diagnosing risk for 9 out of the 30 individuals who have fallen, and we would incorrectly target 3 out of the 13 nonfallers. The final choice of a cutoff, however, must be based on how the clinician and patient see the impact of an incorrect identification. The ROC curve should only act as a guide for that decision.

In the previous sections we have described how sensitivity, specificity and related concepts can be used to support a diagnostic or prognostic classification based on a particular test score. In clinical practice, however, the complexity of patient conditions may require that a combination of predictors be used to support an outcome classification. Although clinical experience will often provide an intuitive sense of which findings from the history and physical examination are important for an accurate assessment, our focus on evidence-based practice demands that we strive for greater certainty in our diagnostic and prognostic assessments.^{35}

**Clinical prediction rules (CPR)**^{‡} are tools that quantify the contributions of different variables to the diagnosis, prognosis or likely response to treatment for an individual patient.^{36} The objective of CPRs is to reduce uncertainty by demonstrating how specific clusters of clinical findings can be used to predict outcomes.^{37}

Perhaps the most obvious application of a CPR is to assist in the diagnosis of a disorder based on clinical signs. An excellent example of this application is found in the work of Stiell et al^{38} who developed clinical prediction rules for the use of radiography with acute ankle injuries. They noted that many patients with ankle injuries did not have a fracture, and yet the typical response in emergency care was to order an X-ray. Estimates had shown, however, that the prevalence of fractures with ankle injuries was less than 15%. So this became an interest in efficiency and cost-savings as well as a desire for diagnostic accuracy. The prediction rules that were developed through this process have come to be known as the Ottawa Ankle Rules (based on Shell's affiliation with Ottawa Civic Hospital), which include rules for both ankle and midfoot injuries. The indicators for ruling out a fracture are based on a lack of tenderness in specific areas of the foot or ankle, and the patient's ability to bear weight on the affected limb, even with a limp. Table 27.5 shows these guidelines, which have been validated in different countries^{39} and in different populations.^{40}

• Bone tenderness along the distal 6 cm of the posterior edge of the fibula or tip of the lateral malleolus OR • Bone tenderness along the distal 6 cm of the posterior edge of the tibia or tip of the medial malleolus OR • Inability to bear weight for four steps at the time of injury and when examined.
• Bone tenderness at the base of the 5 • Inability to bear weight for four steps at the time of injury and when examined. |

From Stiell I, Wells G, Laupacis A. Brison R, Verbeek R, Vandemheen K, Naylor D. A multicentre trial to introduce clinical decision rules for the use of radiography in acute ankle injuries. *BMJ* 1995; 311:594–597. (Figure from Google Images. <http://www.images.google.com> Accessed March 27, 2006. Reprinted with permission.)

Systematic review of the Ottawa Ankle Rules has shown that they are 95 − 100% sensitive, with a negative likelihood ratio of 0.08.^{41} If we apply this likelihood ratio to a pretest probability of 15% (based on prevalence estimates), we can see that there is less than 1.5% probability of actual fracture in those with a negative test (see Figure 27.6). Using the logic of SnNout, with a test that is highly sensitive, a negative test will effectively rule out the disorder. Therefore, a negative result using these guidelines will consistently and accurately rule out fractures after ankle or foot injury, making an X-ray unnecessary. Specificity tends to be closer to 50%, so a positive test does not necessarily mean a fracture is present, requiring an X-ray to rule out a fracture. Therefore, although there will still be some X-rays taken that do not show a fracture, the rules will effectively reduce the number of unnecessary radiographs taken.^{42}

###### FIGURE 27.6

Nomogram showing determination of posttest probability with use of the Ottawa ankle rules. Based on 15% prevalence of ankle fractures with ankle injury, we estimate the pretest probability at 15%. With a negative likelihood ratio of .08, we obtain a posttest probability of less than 1.5%. This indicates that with a negative test, the probability of an ankle fracture is almost nil.

Other examples of diagnostic prediction rules include guidelines for detecting deep venous thrombosis,^{43} pulmonary embolism,^{44} dementia,^{45} and to identify premenopausal women with low peak bone mass.^{46} Guidelines have also been developed for ordering X-rays for knee injuries, called the Ottawa Knee Rules,^{47,48} and cervical spine injuries, called the Canadian C-Spine Rule.^{49}

^{‡}You may also see these referred to as clinical decision rules or clinical decision guidelines.

Clinical predication rules can also be established to determine the degree to which individuals are at risk for certain outcomes. For example, Kanaya and colleagues^{50} developed a CPR to identify older adults who were at risk for type 2 diabetes. They initially derived the rule on a cross-sectional cohort, and then validated it using a prospective cohort of community-dwelling men and women. Of the nine variables that were initially entered into their analysis, only two demographic and two laboratory variables were significantly associated with incident diabetes. These were age ≥70 years, being female, having a fasting plasma glucose ≥95 mg/dl and triglycerides ≥150mg/dl. They assigned points to these four risk factors and determined a total score for each participant, with scores ranging from 0 to 7 points. With a score of 4 or higher, the sensitivity of the rule was 46%, specificity 82% and LR+ = 1.9. Figure 27.7 shows the ROC curve for this analysis. Based on these results, the authors suggest that individuals who meet this threshold should receive appropriate lifestyle or pharmacologic therapies to prevent the onset of type 2 diabetes.

###### FIGURE 27.7

ROC curve for the validation of a clinical prediction rule for type 2 diabetes. The curve represents results using 0 to 7 points, based on presence of one to four risk factors. The curve turns at the point representing 4 points (arrow), with sensitivity of 46% and 1 − specificity of 17%. (Adapted from Kanaya AM, Fyr CLW, de Rekeneire N, et al. Predicting the development of diabetes in older adults. *Diabetes Care* 2005; 28:404–408. Based on data from Table 3, p. 406.)

Other examples of prognostic clinical prediction rules include identifying risk for functional decline in older community-dwelling women,^{51} identifying patients at risk of complications following cardiac surgery,^{52} identifying factors that lead to hospitalization with asthma,^{53} and identifying workers with nonspecific back pain who are likely or not likely to return to work in good health.^{54}

Clinical prediction rules have also been developed to determine the likelihood that a patient will respond positively to a specific intervention. For example, Hicks et al^{55} designed a prospective cohort study to predict whether patients with nonradicular low back pain are likely to benefit from a program of stabilization exercises. They examined patients before and after an 8-week program, and assessed success based on change in the Oswestry Disability Questionnaire score. The best rule for predicting success was the presence of at least 3 of 4 variables: positive prone instability test, aberrant movements present, average straight leg raise greater than 91°, and age greater than 40 years old. This combination had a sensitivity of 56%, specificity of 86% and a LR+ of 4.0. A separate model was developed to predict failure of treatment.

Other examples of CPRs for intervention response include identifying patients who will benefit from cervical manipulation for neck pain,^{56} from spinal manipulation for low back pain,^{57} from the use of compression bandages for treatment of venous leg ulcers,^{58} from patellar taping for anterior knee pain,^{59} and those who are likely or not likely to benefit from nonarthroplasty knee surgery.^{60}

The development of a clinical prediction rule is a three-step process.^{35} First, the factors that potentially contribute to prediction of the outcome are identified in a cohort of patients. This allows for the *derivation* of the rule, establishing which variables are most predictive. The study by Hicks et al^{55} on the effectiveness of stabilization exercises is an example of this first step. Further study needs to be done to apply these results to other samples.

The second step requires *validation* of the rule in several cohorts in different settings. The study by Kanaya et al^{50} looking at variables related to onset of diabetes illustrates this step. They validated the prediction rule in a sample of over 2,000 white and African-American men and women in two major cities over 5 years.

Finally, an **impact analysis** will demonstrate if the rule has changed clinician behavior and resulted in beneficial outcomes. The Ottawa Ankle Rules, for example, have been studied in many countries and settings,^{61} and have been reported to significantly decrease the number of unnecessary ankle radiographs.^{62} Even with the widespread acceptance of this CPR, however, researchers and clinicians continue to test its validity, with various degrees of success.^{63,64,65} An excellent review of this process is available.^{66}

A hierarchy of evidence has been proposed to judge the applicability of a clinical decision rule, based on its having gone through the full process of validation (Figure 27.8).^{35} Widespread use of a CPR is not recommended until it has been validated in at least one prospective study in a variety of settings and an impact analysis has demonstrated its clinical utility.

So much of our clinical decision making rests on the intent to promote change or progress in a patient's or client's condition or behavior. We need to document change in a way that will be meaningful to the patient, the clinician and third-party payers. We use words like "better," "improved," "worse" or "declined" to indicate when someone's condition has changed, but these descriptors are clearly not sufficient to make reliable and valid judgments.

In Chapter 6 we introduced the concept of **responsiveness**, which is the ability of an instrument to measure true clinical change.^{67} Generally, we can think of responsiveness as a ratio of signal (true change) to noise (variability or error). At this time we will consider various statistical approaches for measuring change, and the implications of these statistics for interpreting clinical data.

As we search for useful ways to evaluate change, the good news is that there is extensive literature on the concept. The bad news is that there is little agreement on the best way to express or measure change in statistical terms. We will present several alternative methods for evaluating responsiveness, but we are unable to address the full scope of this topic. The reader is urged to refer to the literature for informative debate and discussion.

When we think about measuring a difference in response from one time to another, we can conceptualize the amount of change along a continuum, as shown in Figure 27.9. We start with the **minimum potentially detectable difference**, which will depend on the precision of the measurement tool being used. If we are using a goniometer, can we detect changes in range of motion (ROM) of less than one degree or a half degree? If we are using a survey tool, such as the Functional Independence Measure (FIM), we are restricted to differences of at least one point; that is, no fractions of a point can be counted.

Beyond the precision of the instrument, however, we are concerned with its reliability. When we address the issue of change, we must be confident that observed differences from before to after treatment reflect true change, and not simply random measurement error. The standard error of measurement (SEM), described in Chapter 26, is the most common statistic used to determine the **minimal detectable difference (MDD)**.^{§} This is the smallest amount of change that can be considered above the threshold of error expected in the measurement. Theoretically, this value can be interpreted as a property of a measurement, remaining constant across samples,^{68} although it can vary depending on the reliability estimate used in its calculation. Stratford et al^{69} have also shown that the SEM will vary when calculated across different ranges of initial and follow-up scores, with a smaller SEM resulting at both extremes of a scale.

The MDD is calculated using the following formula:^{70}

This estimate is most often based on the 90% (*z* = 1.65) or 95% (*z* = 1.96) confidence interval. MDD_{95} means that 95% of stable patients demonstrate a random variation of less than this amount when tested on multiple occasions.^{71}

For example, Kennedy et al^{72} studied patients with osteoarthritis to determine measurement stability in outcomes following total hip and knee arthroplasty. For the 6-minute walk test (6MWT), they calculated a SEM of 26.29 meters. They obtained MDD_{90}:

This says we can expect 90% of stable patients (those who have not changed) in this population to demonstrate random variation of less than 61 meters in repeated trials of the 6MWT. Therefore, if we take measurements of the 6MWT before and after intervention, a change of 61 meters or greater would be considered true change.

Norman et al^{73} have offered the interpretation that minimal differences are consistently close to 0.5 standard deviation for discriminating the threshold of change using quality of life instruments. They support their findings with psychophysiological evidence that people have a limit to their ability to discriminate tasks (such as saltiness of taste, loudness of sounds),^{74} and that this limit is almost always close to 0.5 standard deviation. Therefore, they suggest that this criterion is potentially appropriate to identify the minimal detectable difference.

The MDD can be considered a conservative estimate of a patient's progress, identifying the smallest amount of change that could be interpreted as *any* improvement or decline. Therefore, using the MDD as a criterion for improvement may be thought of as having high specificity (avoiding false positives) but low sensitivity (finding many false negatives).^{75}

^{§}This measure is also called the minimal detectable change (MDC), the smallest detectable change (SDC) or smallest real change (SRC); it has also been called the reliable change index.

The MDD may be considered a starting point to define change, but it is typically too small to represent a meaningful difference in the patient's response. Along the continuum of change, we are concerned with identifying how much of a change will be important. For example, if we measure ROM of knee flexion following knee arthroplasty, is a 5° change important? Does it indicate a meaningful difference in the patient's condition? While this may be an obvious example, consider a change of 5 mm in a visual analog scale for pain. Or a change of 5 points on the SF-36 measure of quality of life. Or a change of 5 mm in the measurement of a leg length discrepancy. Do these represent meaningful change? Would our threshold for important change vary across different groups of patients, conditions, levels of severity or cultural groups?^{76}

The most common threshold for meaningful change has been called the **minimal clinically important difference (MCID)** (see Figure 27.9).^{∗∗} This has been defined as the smallest change in an outcome measure that is perceived as beneficial by the patient, and that would lead to a change in the patient's medical management, assuming an absence of excessive side effects and costs.^{77} The criterion for just how much change is considered important is the crux of the dilemma in this process.

So how is the definition of "meaningful change" to be decided? This definition inherently reflects an element of judgment, and several perspectives must be considered.^{76} For the patient, it may mean change that results in noticeable improvement in function or a reduction in symptoms. We may find, however, that patients place different values on degrees of improvement. To what extent does quality of life impact this perception? What amount of change in a score will correspond to trivial, small but important, moderate or large improvement or deterioration?^{78} From the perspective of the clinician, it may mean enough change to warrant a revision in treatment or the patient's prognosis. At the institutional level, change may be viewed as important when it is sufficient to influence health care policy.^{79}

Two approaches have been used to define important change. Distribution-based methods are related to a distribution of scores, with a focus on the differences in group means as well as the variance within the distribution. Anchor-based methods use an external criterion to define clinical importance.

Researchers are often interested in assessing the degree of change in a group of patients, to determine the effectiveness of an intervention and to generalize results to others. For this purpose, meaningful change is determined using a **distribution-based** approach. Several indices have been used for this purpose (see Table 27.6).

One approach to evaluating an instrument's responsiveness has been to analyze change scores using a pretest-posttest design. Repeated measures *t*-tests or analyses of variance (ANOVA) are used to establish significant differences from time 1 to time 2.^{84} Measurements may be taken once before and after intervention, or there may be multiple measures as individuals are followed over time.^{85} This approach may involve only one group of subjects, or it may incorporate two or more groups. The assumption is made that change will occur due to treatment. Therefore, the instrument should be able to demonstrate such change from before to after treatment, or between groups that were treated and those that were not. A statistically significant difference, then, would demonstrate that the instrument was responsive to change. A confidence interval can provide an estimate of the range of change that can be expected.

The interpretation of meaningful change can be quite different, however, if one is focused on what that change means to an individual, versus decisions based on group differences.^{86} We must distinguish between the clinical significance of a particular change score for an individual patient and the statistical significance of a mean change of the same magnitude for a group of patients.^{87} Guyatt et al^{88} offer the example of a mean change in blood pressure of 2 mm Hg in a clinical trial, which may translate into a reduction in the number of strokes in a population. But this amount of change in an individual would probably be considered trivial, within the range of error of measurement.

^{∗∗}This measure is also called the minimal clinically important change (MCIC) and the minimally important change (MIC).

Looking at **effect size** is generally considered more appropriate to determine if meaningful change has occurred, because it does take group variability into account. Effect size is a standardized measure of change from baseline to final measurement. Three forms of effect size have been used.

**Effect Size Index (ES).** The **effect size index** is a ratio of the mean change score divided by the standard deviation of the baseline scores (see Table 27.6). Therefore, a measure that has high variability in initial scores will have a smaller effect size. Cohen^{81} has suggested that an effect size of .20 or less represents a small change; .50 represents moderate change; and .80 represents a large change. These values are interpreted relative to baseline variability. For instance, a moderate effect size reflects a change of at least one-half the baseline standard deviation.

**Standardized Response Mean (SRM).** The **standardized response mean** is another form of effect size index,^{82} sometimes referred to as the efficiency index.^{89} The SRM is a ratio of change from pretest to posttest divided by the standard deviation of the change scores (see Table 27.6). Therefore, a distribution that has high variability in the degree of change will have a small SRM. Cohen's criteria for small, moderate and large effect sizes are used for this index as well.

**Guyatt's Responsiveness Index (GRI).** A third form of effect size was proposed by Guyatt et al,^{90} called the **responsiveness index.** The GRI uses an anchor-based MCID for a particular measure, or the smallest difference between baseline and posttest that would represent a meaningful benefit in a group of patients. We will discuss various methods to determine an anchor-based MCID shortly. When the MCID is not known, the difference between baseline and posttest can be used.^{67} The denominator for this index is obtained from an ANOVA of repeated observations in a group of subjects who are clinically stable, which is a measure of test-retest reliability (see Table 27.6). Therefore, the denominator reflects the intrinsic variability of the instrument.^{67} A disadvantage of this index is that data on stable subjects may not always be available. Once again, the values of .20, .50 and .80 are used to represent small, moderate and large effects.^{91}

To illustrate the application of these measures, Quintana et al^{71} studied the responsiveness of the Western Ontario and McMaster Universities Osteoarthritis Index (WOMAC) in a group of patients following total hip replacement. They calculated ES, SRM and GRI at 6 months and 2 years, as shown in Table 27.7. Of interest are the consistently higher values derived from the ES index and lower values derived from the GRI. Recall that the denominators for each of these measures will result in a different estimate. All values are considered quite large, however, indicating that the WOMAC is a responsive instrument that is capable of reflecting important change over time.

At 6 months | At 2 years | |||||
---|---|---|---|---|---|---|

Subscale | ES | SRM | GRI | ES | SRM | GRI |

Pain | 2.10 | 1.86 | 1.10 | 2.24 | 1.98 | 2.18 |

Function | 2.34 | 1.80 | 1.45 | 2.58 | 1.97 | 1.79 |

Stiffness | 1.61 | 1.39 | 0.81 | 1.81 | 1.53 | 1.12 |

Another way to look at responsiveness is to consider it a way of discriminating between those who have changed and those who have not. Therefore, we can look at change as a "diagnosis," or the determination of whether a clinically important change has occurred against an external standard.^{92} In this context, responsiveness is described in terms of sensitivity and specificity. Sensitivity reflects the probability that someone who has truly changed will be identified has having changed. Specificity is the probability that someone who has not changed will be correctly identified. These values are then used to plot an ROC curve, as described earlier in this chapter.

For example, Stratford et al^{85} looked at four questionnaires for assessing pain and function in patients with low back pain. They set different cutoff scores to represent change for each test and constructed four ROC curves. They then compared the area under the curves to determine which would be preferred for detecting change over time in this population (see Figure 27.10).

###### FIGURE 27.10

Receiver operating characteristic (ROC) curves for four low back pain questionnaires: Roland-Morris (RM), Jan van Breeman (JVB) pain scale and function scale, and the Oswestry (OSW). Areas under the curve were as follows: RM = 0.79; JVB pain = 0.79; JVB function = 0.66; and OSW = 0.78. A significant difference was found between the area for the JVB function questionnaire and the other three questionnaires. (From Stratford PW, Binkley J, Solomon P, et al. Assessing change over time in patients with low back pain. *Phys Ther* 1994; 74:528–533, Figure p. 531. Reprinted with permission of the American Physical Therapy Association.)

In an **anchor-based** approach the magnitude of a change score is interpreted according to some clinical criterion or "anchor" that is assumed to have inherent meaning. A common anchor is the patient's ordinal rating of improvement or decline—"I feel a little better," "a lot better," "a little worse," or "a lot worse." A clinician may apply an anchor that relates to a minimal change in function or passing a threshold of impairment that points to a change in treatment or goals. The disadvantage of anchor-based methods is that they do not take into account the variability or potential measurement error in an instrument. Therefore, it is important to establish the reliability of an instrument when using it to estimate important change. Recall bias may also affect a patient's accurate estimate of improvement or decline.

The construct of important difference has most often been evaluated using an ordinal scale, based on the patient's or clinician's subjective rating of change. Scales generally range from "a great deal worse" to "a great deal better," with as few as 5 points^{71} and as many as 15 points,^{77,85,93} with zero indicating no change (see Table 27.8).

Rating | Description of Change |
---|---|

7 | A very great deal better |

6 | A great deal better |

5 | Quite a bit better |

4 | Moderately better |

3 | Somewhat better |

2 | A little bit better |

1 | Tiny bit better, almost the same |

0 | No change |

−1 | Tiny bit worse, almost the same |

−2 | A little bit worse |

−3 | Somewhat worse |

−4 | Moderately worse |

−5 | Quite a bit worse |

−6 | A great deal worse |

−7 | A very great deal worse |

For example, Beninato et al^{94} used this approach to assess clinical change in function following stroke for patients who were discharged from a rehabilitation hospital. They used physician ratings on a 15-point scale to determine the MCID for the Functional Independence Measure (FIM). Based on a cutoff score of 3 (somewhat better) to distinguish those who had achieved a MCID from those who did not, they identified a change of 22 points in the total FIM score as a meaningful difference in function from admission to discharge. Using a cutoff of 5 (a good deal better) the MCID was 27 points, which they defined as a "moderate important clinical change."

Wolfe et al^{95} have suggested that while MCID is an important minimum, a "really important difference" represents a clinically important goal. They used several outcome measures with patients with rheumatoid arthritis to reflect change at this level based on satisfaction with health, independence and disability level.

When assessing change, anchor-based methods are generally preferred over distribution-based methods because they reflect a definition of what is considered important.^{96} A combined strategy, however, using both approaches, may provide a stronger foundation for understanding meaningful change.^{97,98,99}

Group values must be interpreted with reference to their sample size and variability. We know that statistical significance is greatly influenced by the number of subjects in a sample. Therefore, with a large sample, small differences may turn out to be significant even when they are meaningless. We also recognize that a mean is a measure of central tendency, and that individuals in the sample do not all experience that amount of change—some will have achieved more and some less. Therefore, any conclusions about an individual patient's response based on a mean may be seriously flawed.

Consider, for example, a randomized trial of 1,000 patients who receive physical therapy to increase knee ROM following knee arthroplasty. Assume results show a significant mean difference of 5° ROM for patients with knee arthroplasty. While statistically significant because of high power, this is probably an unimportant difference. On the other hand, consider the same situation with a smaller sample, in which a mean difference of 5° ROM is not significant. In this case, the researcher would probably conclude that therapy is not effective. This conclusion, however, ignores the possibility that treatment could have had a heterogeneous effect,^{100} and some patients may have had much larger changes in ROM. Let's assume that a minimum of 15° is considered important. We would need to look at the data to determine how many subjects actually had this much of an increase or higher, to determine if the treatment really was effective (see discussion of *Number Needed to Treat* in Chapter 28). Therefore, without looking at variability within the sample, we may be missing important information.

When studying a dichotomous variable, decisions about improvement or decline are straightforward—the patient has either gotten better or not. When dealing with a continuous measure, however, it is necessary to determine how much change is meaningful. Therefore, the proportion of individuals in a group that achieve minimal change can be considered another important benchmark to evaluate an intervention's effectiveness. **MDD** or **MCID proportion** is the percentage of patients who exceed the minimal standard of change, based either on the detectable change or the clinically meaningful change. These values can be especially useful for examining group data for program evaluation and quality assurance.^{97}

In their study of responsiveness for the WOMAC, described earlier, Quintana et al^{71} determined that the MDD proportion was greater than 80% for pain and function subscales, and the MDIC proportion was between 70% and 80% after 2 years following hip joint replacement. These values demonstrate that most patients in this population do consider themselves "better." In another example, in a study of the effectiveness of a fitness intervention for children with disabilities, Fragala-Pinkham et al^{101} found that 59% of those with developmental disabilities exceeded the MDD, as compared to only 29% of those with neuromuscular disabilities. They suggest that these data are more informative for evaluating the effects of treatment than overall mean changes.

Understanding the statistical bases for measurement validity is essential as we strive to make informed evidence-based decisions. Some validity estimates are readily recognized as appropriate methods for assessing validity, such as those used to measure diagnostic accuracy. Others, such as methods for evaluating change, are still evolving, and will develop further as time goes on. The variety of indices used to assess change can be daunting, but more importantly, there is often confusion in terminology. Many authors have used "minimal detectable difference" and "minimal clinically important difference" as synonymous terms, although they have been clearly defined and distinguished. These two important benchmarks should remain distinct if we are to truly understand our measurements.

Clinical judgments regarding validity of measurements must be based on some criterion that is relevant for a particular patient. Any given study will present data from a sample that has specific properties and that has been studied in a specific context over a given time period. Clinicians must appraise that information to determine if it is appropriately applied to their patients. Published values allow us to predict our own patients' responses and give us a foundation for decision making. It is essential, however, that we remain cognizant of the limits of statistics as we apply them to our own situations.

As our understanding of validity grows, we will continue to struggle with the definitions of clinical significance. The evidence-based practitioner will benefit from more complete reporting of likelihood ratios, effect sizes and minimal change values in clinical studies. Whenever possible, confidence intervals should be used to reflect population values. Estimates are needed for different settings, age groups, disease durations and baseline conditions. Clinicians, patients and health policy analysts all want to appreciate just how much better is "better."

*Disabil Rehabil*2003;25:45–50. [PubMed: 12554391]

*Anesth Analg*2005;101:1215–1220. [PubMed: 16192548]

*Arch Phys Med Rehabil*2002;83:538–542. [PubMed: 11932858]

*J Gerontol*1990;45:M192–197. [PubMed: 2229941]

*Evidence-Based Medicine: How to Practice and Teach EBM*(3rd ed.). Edinburgh: Churchill Livingstone, 2005.

*Aust J Physiother*2002;48:227–232. [PubMed: 12217073]

*Acad Emerg Med*2005;12:587–593. [PubMed: 15995088]

*BMJ*2002;324:729–732. [PubMed: 11909793]

*J Clin Epidemiol*2005;58:1211–1216. [PubMed: 16223666]

*Acad Emerg Med*2004;11:692–694. [PubMed: 15175211]

*J Gen Intern Med*2003;18:203–208. [PubMed: 12648252]

*Am Family Physician*2001;64:111–116.

*J Bone Joint Surg Am*2006;88:1927–1933. [PubMed: 16951107]

*Am J Public Health*2006;96:1593–1598. [PubMed: 16873740]

*Cerebrovasc Dis*2005;19:323–327. [PubMed: 15795507]

*Evidence-Based Clinical Practice: Concepts and Approaches.*Boston: Butterworth Heinemann, 2000.

*Lancet*2005;366:548. [PubMed: 16099289]

*Aust Prescr*2003;26(3):111–113.

*BMJ*1999;318:1322–1323. [PubMed: 10323817]

*p*-values or narrow confidence intervals: Which are more durable?

*Epidemiology*2001;12:291–294. [PubMed: 11337599]

*Clin Radiol*2003;58(8):575–580. [PubMed: 12887949]

*Am J Clin Pathol*2003;119(l):18–22. [PubMed: 12520693]

*Ann Intern Med*2003;138(l):40–44. [PubMed: 12513043]

*BMJ*2003;326(7379):41–44. [PubMed: 12511463]

*Med Decis Making*1991;11:102–106. [PubMed: 1865776]

*JAMA*[JAMA and JAMA Network Journals Full Text] 2000;284:79–84. [PubMed: 10872017]

*JAMA*[JAMA and JAMA Network Journals Full Text] 1997;277:488–494. [PubMed: 9020274]

*N Engl J Med*1985;313:793–799. [PubMed: 3897864]

*JAMA*[JAMA and JAMA Network Journals Full Text] 1994;271:827–832. [PubMed: 8114236]

*Ann Emerg Med*2001;38:364–368. [PubMed: 11574791]

*Ann Emerg Med*2003;42:48–55. [PubMed: 12827123]

*BMJ*2003;326:417–423. [PubMed: 12595378]

*Int J Clin Pract*2003;57:625–627. [PubMed: 14529066]

*Ann Intern Med*2005;143:129–139. [PubMed: 16027455]

*Curr Opin Pulm Med*2004;10:345–349. [PubMed: 15316430]

*Arch Intern Med*[Archives of Internal Medicine Full Text] 2000;160:2855–2862. [PubMed: 11025796]

*Osteoporos lnt*2002;13:400–406.

*JAMA*[JAMA and JAMA Network Journals Full Text] 1996;275:611–615. [PubMed: 8594242]

*JAMA*[JAMA and JAMA Network Journals Full Text] 1997;278:2075–2079. [PubMed: 9403421]

*JAMA*[JAMA and JAMA Network Journals Full Text] 2001;286:1841–1848. [PubMed: 11597285]

*Diabetes Care*2005;28:404–408. [PubMed: 15677800]

*J Am Geriatr Soc*2000;48:170–178. [PubMed: 10682946]

*Med Care*2000;38:820–835. [PubMed: 10929994]

*Am J Manag Care*2003;9:538–547. [PubMed: 12921231]

*CMAJ*2005;172:1559–1567. [PubMed: 15939915]

*Arch Phys Med Rehabil*2005;86:1753–1762. [PubMed: 16181938]

*Man Ther*2005;11:306–315. [PubMed: 16380287]

*BMC Fam Pract*2005;6:29. [PubMed: 16018809]

*Am J Med*2000;109:15–19. [PubMed: 10936473]

*J Orthop Sports Phys Ther*2006;36(ll):854–866. [PubMed: 17154139]

*Arch Intern Med*[Archives of Internal Medicine Full Text] 2004;164:509–513. [PubMed: 15006827]

*Ann Emerg Med*2001;37:259–266. [PubMed: 11223761]

*J Fam Pract*2004;53:785–788. [PubMed: 15469773]

*Acad Emerg Med*2005;12:948–956. [PubMed: 16166599]

*Am J Emerg Med*2005;23:725–729. [PubMed: 16182978]

*CMAJ*1999;160:1165–1168. [PubMed: 10234347]

*Phys Ther*2006;86:122–131. [PubMed: 16386067]

*J Clin Epidemiol*1997;50:239–246. [PubMed: 9120522]

*f Clin Epidemiol*1999;52:861–873.

*Phys Ther*1996;76:359–365; discussion 366–368. [PubMed: 8606899]

*J Rheumatol*2001;28:400–405. [PubMed: 11246687]

*Osteoarthritis Cartilage*2005;13:1076–1083. [PubMed: 16154777]

*BMC Musculoskelet Disord*2005;6:3. [PubMed: 15679884]

*Med Care*2003;41:582–592. [PubMed: 12719681]

*Psychol Rev*1956;63:81–97. [PubMed: 13310704]

*J Rheumatol*2001;28:914–917. [PubMed: 11327276]

*J Clin Epidemiol*2003;56:395–407. [PubMed: 12812812]

*Control Clin Trials*1989;10:407–415. [PubMed: 2691207]

*Ann Int Med*1993;118:622–629. [PubMed: 8452328]

*J Clin Oncol*1998;16:139–144. [PubMed: 9440735]

*Med Care*1989;27(3):S178–S189. [PubMed: 2646488]

*Statistical Power Analysis for the Behavioral Sciences.*(2nd ed.). Hillsdale, NJ: Lawrence Erlbaum, 1988.

*Med Care*1990;28:632–642. [PubMed: 2366602]

*CMAJ*1986;134:889–895. [PubMed: 3955482]

*Phys Ther*1996;76:1109–1123. [PubMed: 8863764]

*Phys Ther*1994;74:528–533. [PubMed: 8197239]

*Med Care*2000;38(9 Suppl):II166–174. [PubMed: 10982103]

*Qual Life Res*1993;2:221–226. [PubMed: 8401458]

*Mayo Clin Proc*2002;77:371–383. [PubMed: 11936935]

*J Rheumatol*1993;20:535–537. [PubMed: 8478866]

*J Chron Dis*1987;40:171–178. [PubMed: 3818871]

*J Clin Epidemiol*1997;50:869–879. [PubMed: 9291871]

*J Chronic Dis*1986;39:897–906. [PubMed: 2947907]

*Phys Ther*1998;78:1186–1196. [PubMed: 9806623]

*Arch Phys Med Rehabil*2006;87:32–39. [PubMed: 16401435]

*J Rheumatol*2005;32:583–589. [PubMed: 15801011]

*Health Qual Life Outcomes*2006;4:54. [PubMed: 16925807]

*Phys Ther*2006;86:735–743. [PubMed: 16649896]

*J Clin Epidemiol*2004;57:898–910. [PubMed: 15504633]

*J Pain Symptom Manage*2002;24:547–561. [PubMed: 12551804]

*BMJ*1998;316:690–693. [PubMed: 9522799]

*Pediatr Phys Ther*2006;18:159–167. [PubMed: 16735864]