Many measuring instruments are specifically designed as screening or diagnostic tools. In a traditional medical framework, a diagnostic test is used to determine the presence or absence of a disease or abnormal condition. A screening test is usually done on individuals who are asymptomatic, to identify those at risk for certain disorders, and to classify patients who are likely to benefit from specific intervention strategies. Because these procedures involve allocation of resources, present potential risks to patients and are used for clinical decision making, it is important to verify their validity.
The results of a diagnostic or screening procedure may be dichotomous, categorical or continuous. The simplest tests will have only a dichotomous outcome: positive or negative, such as pregnancy or HIV status. A categorical test would involve ratings on an ordinal scale, such as +++, ++, +, − to reflect degree of sensation or reflexes. A continuous scale provides the most information regarding the outcome, such as a test measuring degrees of range of motion or hearing decibel level. Ordinal and continuous scales are often converted to dichotomous outcomes using cutoff scores to indicate a "normal" or "abnormal" response.
The ideal diagnostic test, of course, would always be accurate in discriminating between those with and without the disease or condition; it would always have a positive result for someone with the condition, whether a mild or severe case, and a negative result in everyone else. But we know that such tests are not perfect. They may miss abnormalities in those with a particular disorder, or they may identify abnormalities in those without the disorder.
We determine how good a test is by comparing the test result with known diagnostic findings obtained by a reference standard.∗ The reference standard will reflect the patient's true status, either the presence or absence of the condition. The assumption is made that the individual performing the test is blind to the true condition, eliminating possible bias. In some situations, the reference standard will be a concurrent test, such as an X-ray or blood test. In other situations, it will be obtained at a future time, as with a long-term outcome or autopsy. Sometimes there is no clear standard, and one must be defined or created. For instance, studies related to falls often use the patient's report of a fall within the past 6 months or year as the standard for being a "faller" or "nonfaller."1 Studies of delirium in hospitalized patients have used expert opinion as the reference standard to validate measures of confusion.2 When objective definitive standards are not available, the reference must be adequately described so that others can determine its applicability.
Sensitivity and Specificity
The validity of a diagnostic test is evaluated in terms of its ability to accurately assess the presence and absence of the target condition. A diagnostic test can have four possible outcomes, summarized in the 2 × 2 arrangement shown in Table 27.1. Classification is assigned according to the true presence or absence of disease (Dx+ or Dx−) versus positive or negative test results. In Table 27.1 the cells labeled a and d represent true positives and true negatives, respectively, that is, individuals who are correctly classified by the test as having or not having the target condition. Cell b reflects those who are incorrectly identified as having the condition, or false positives, and cell c represents those who are incorrectly identified as not having the condition, or false negatives.
TABLE 27.1SUMMARY OF ANALYSIS FOR DIAGNOSTIC TEST RESULTS ||Download (.pdf) TABLE 27.1 SUMMARY OF ANALYSIS FOR DIAGNOSTIC TEST RESULTS
Sensitivity is the test's ability to obtain a positive test when the target condition is really present, or the true positive rate. Using the notation presented in Table 27.1,
This value is the proportion of individuals who test positive for the condition out of all those who actually have it, or the probability of obtaining a correct positive test in patients who have the target condition. The sensitivity of a test increases as the number of persons with the condition who are correctly classified increases; that is, fewer persons with the disorder are missed.
Specificity is the test's ability to obtain a negative test when the condition is really absent, or the true negative rate. As shown in Table 27.1,
This value is the proportion of individuals who test negative for the condition out of all those who are truly normal, or the probability of a correct negative test in those who do not have the target condition. A highly specific instrument will rarely test positive when a person does not have the disease.
The complement of sensitivity (1 – sensitivity) is the false negative rate, or the probability of obtaining an incorrect negative test in patients who do have the target disorder. The complement of specificity (1 − specificity) is the false positive rate, sometimes called the "false alarm" rate.3 This is the probability of an incorrect positive test in those who do not have the target condition.
To illustrate the application of these measures, let's consider a study of the validity of the Functional Reach Test (FRT) to identify elders with Parkinson's disease who are at risk for falls.4 The FRT is designed to assess anterior-posterior stability by measuring the maximum distance an individual can reach while leaning forward over a fixed base of support.5 Based on previous research, a cutoff score of 10 in. (25.4 cm) was used to classify subjects as "at risk" or "not at risk." Screening results were compared with a known history of falls (the reference standard), as shown in Table 27.2A.
TABLE 27.2SUMMARY OF ANALYSIS OF SCREENING TEST RESULTS FOR FUNCTIONAL REACH TEST (FRT) IN PERSONS WITH PARKINSON'S DISEASEa ||Download (.pdf) TABLE 27.2 SUMMARY OF ANALYSIS OF SCREENING TEST RESULTS FOR FUNCTIONAL REACH TEST (FRT) IN PERSONS WITH PARKINSON'S DISEASEa
The sensitivity of the test for this population was low, at 30%. Of the 30 patients identified as having a history of falls, only 9 tested positive using the FRT. The specificity of the test, however, was 92%. Of the 13 patients who did not have a history of falls, 12 tested negative. Therefore, although almost all of those not at risk were correctly identified (true negatives), a large percentage of patients who were at risk were missed (false negatives). The graphic in Table 27.2A illustrates these proportions.
In addition to sensitivity and specificity, the usefulness of a clinical screening tool can be assessed by its feasibility. A test must demonstrate that it is an efficient use of time and resources and that it yields a sufficient number of accurate responses to be clinically useful. This characteristic is assessed by the test's predictive value. A positive predictive value (PV +) estimates the likelihood that a person who tests positive actually has the disease. Using the notation given in Table 27.1,
which represents the proportion of those who tested positive who were true positives. Therefore, a test with a high positive predictive value will provide a strong estimate of the actual number of patients who have the target condition. Similarly, a negative predictive value (PV –) indicates the probability that a person who tests negative is actually disease free. Therefore,
which is the proportion of all those who tested negative who were true negatives. A test with a high negative predictive value will provide a strong estimate of the number of people who do not have the target condition.
For the FRT study (see Table 27.2), the positive predictive (PV+) value of 90% tells us that almost all of those who tested positive actually had a history of falls. Only one patient who tested positive was not at risk. The negative predictive value (PV–) was lower, at 36%. Therefore, only one-third of patients who tested negative were actually not at risk.
Predictive value may be of greatest importance in deciding whether or not to implement a screening program. When the positive predictive value is low, only a small portion of those who test positive actually have the target condition. Therefore, considerable resources will probably be needed to evaluate these people further to separate false positives, or unnecessary treatments will be applied. Policy decisions are often based on a balance between the use of available resources and the potential harmful effects resulting from not identifying those with the target condition.6
Sensitivity, specificity and predictive value are influenced by the prevalence of the target condition in the population. Prevalence refers to the number of cases of a condition existing in a given population at any one time. For a test with a given sensitivity and specificity, the likelihood of identifying cases with the condition is increased when prevalence is high (the condition is common). Therefore, when prevalence is high, a test will tend to have a higher positive predictive value. This is illustrated in Table 27.2 for our example of fall risk. The prevalence of a history of falls is 30 out of 43 patients, or 70%. Therefore, a large proportion of patients in this sample had a history of falls, and we could expect a high PV+, which was 90%. When prevalence is low (the condition is rare), one can expect many more false positives, just by chance. A positive predictive value can be increased either by increasing the specificity of the test (changing the criterion) or by targeting a subgroup of the population that is at high risk for the target condition.
When we consider the diagnostic accuracy of a test, high values of sensitivity and specificity provide a certain level of confidence in interpretation. If a test has high sensitivity, it will properly identify most of those who have the disorder. If the test has high specificity, it will properly identify most of those without the condition. But how do these definitions relate to confidence in diagnostic decisions? Consider these two questions:
If a patient has a positive test, can we be confident in ruling IN the diagnosis?
If a patient has a negative test, can we be confident in ruling OUT the diagnosis?
Sensitivity and specificity help us answer these questions, but probably not the way you would expect. When a test has high specificity, a positive test rules in the diagnosis. When a test has high sensitivity, a negative test rules out the diagnosis. Straus and colleagues7 offer two mnemonics to remember these relationships:
Think of it this way: A highly specific test will properly identify most of the patients who do not have the disorder. If the test is so good at finding those who are normal, we can be pretty sure that someone with a positive test does have the disorder (ruling IN the diagnosis) because if he didn't have the disorder, the test would have correctly identified him as normal! Conversely, a highly sensitive test will find most of those who do have the disorder. Therefore, we can be pretty sure that someone with a negative test does not have the disorder (ruling OUT the diagnosis) because if he did have the disorder, the test would have correctly diagnosed him!
These concepts are also related to predictive value. With a more specific test, negative cases are identified more readily. Therefore, it is less likely that an individual with a positive test will actually be normal. This results in a high positive predictive value. With a more sensitive test, positive cases are identified more readily; that is, we will not miss many true cases. Therefore, it is less likely that an individual with a negative test will have the disease. This leads to a high negative predictive value.
If we use the example of the Functional Reach Test (Table 27.2), with specificity of 92% (and a PV– of 90%), we can be confident that someone with a positive test is at risk for falls. However, with sensitivity of only 30% (and a PV + of 36%), if someone has a negative test, we cannot be sure that person is really not at risk. Because the test is not good at finding those who are at risk, having a negative test does not help us safely rule out this risk.
Pretest and Posttest Probabilities
The ultimate purpose of a diagnostic test is to help the clinician make a decision about the presence or absence of a disorder for an individual patient. The validity of a test is based on how strongly it can support a decision to rule the disorder in or out. Therefore, a test is considered a good one if it can help to increase our certainty about a patient's diagnosis.
When we begin to evaluate a patient by taking a history and using screening or other subjective procedures, we begin to rule in and rule out certain conditions and eventually generate a hypothesis about the likely diagnosis. This hypothesis can be translated into a measure of probability or confidence, indicating the clinician's estimate of how likely a particular disorder is present. This has been termed the pretest probability (or prior probability) of the disorder—or what we think might be the problem before we perform any formal testing.8
Finding the Pretest Probability. The process for determining a pretest probability is not an obvious one. Conceptually, it represents a "best guess" or clinical impression based on experience and clinical judgment. Clinicians may have sufficient experience with certain types of patients to estimate the probability of a diagnosis based on initial examination findings,9,10 although such estimates are not always reliable.11,12 Using the functional reach scenario, a clinician might have sufficient experience with patients with Parkinson's disease to generate an initial hypothesis about the patient's likelihood to fall.
Information from the literature can also be used to help with this estimate by referring to the prevalence of a disorder.13 For instance, studies have shown that the prevalence of idiopathic scoliosis in children aged 10-16 years is 2–4%;14 26% of patients with orthopedic trauma have been found to experience depression;15 34% of children who have been enrolled in special education classes have been diagnosed with asthma;16 the presence of postoperative delirium following hip fracture repair is estimated at 36%;2 the prevalence of mortality 1-month following a stroke in patients with prestroke dementia is 44%.17 These values, reported in the literature, allow the clinician to estimate the likelihood that any individual patient could have these disorders.
Suppose you are working with a patient with Parkinson's disease, and you believe she may be at risk for falling. You think it may be useful to perform a test to determine if such a risk is present. Consider the study of the Functional Reach Test once again (see Table 27.2) This study demonstrated a 70% prevalence of falls in its sample of patients with Parkinson's disease.4 Knowing this, before you have done any further testing, your best estimate is that the pretest probability of your patient being at risk for falls is 70%.
Decision-Making Thresholds. Being able to estimate a pretest probability is central to deciding if a condition is present and if testing or treatment is warranted. Based on the initial hypothesis and pretest probability of a condition, the clinician must decide if a diagnostic test is necessary or useful to confirm the actual diagnosis. Straus et al7 suggest that two thresholds should be considered, as shown in Figure 27.1A. With a very low pretest probability, the diagnosis is so unlikely that testing is not useful; that is, even with a positive test, results are likely to be false positives. Therefore, treatment is not initiated and other diagnoses need to be considered.
Thresholds for deciding to test or treat. (A) Thresholds based on pretest probabilities; (B) thresholds based on posttest probabilities. (Adapted from Straus SE, Richardson WS, Glasziou P, et al. Evidence-Based Medicine: How to Practice and Teach EBM (3rd ed.). Edinburgh, Churchill Livingstone, 2005, Figure 3.3, p. 85.)
With a very high pretest probability, the likelihood of the diagnosis is so strong that testing may be unnecessary, and treatment should just be initiated. Even with a negative test, results are likely to be false negatives. A strong pretest probability means that the results of a test are unlikely to offer any additional useful information. When the pretest probability is not definitive, however, with more intermediate values, testing is necessary to pursue the diagnosis, and treatment decisions will then be based on those results.
This approach must also take into consideration the relative severity of the disorder; the threshold for testing may vary for different conditions. For example, a patient may exhibit symptoms that lead a clinician to suspect the presence of a deep venous thrombosis (DVT), which is potentially life threatening. Even if the symptoms are minimal and the pretest probability is low, the clinician may feel compelled to test for the condition to safely rule it out before continuing with other interventions. At the same time, the clinician must also be able to justify that the benefits of performing the test outweigh any potential risks. A test that includes potentially harmful procedures may not be worthwhile if the condition has little consequence. Effective treatment should also be available, should the test be positive. The effort of a test is not reasonable if the results have no chance of leading to successful intervention.
A diagnostic test allows a clinician to revise the pretest probability estimate of the disorder.18 Once we have the data from a test, we expect to be more confident in the diagnosis; that is, we hope to improve our certainty. The revised likelihood of the diagnosis based on the outcome of a test is the posttest probability (or posterior probability)—what we think the problem is (or is not) now that we know the test result. A good test will allow us to have a very high posttest probability confirming the diagnosis, or a very low posttest probability causing us to abandon it (see Figure 27.1B). When the posttest probability is not definitive, further testing may be necessary.
Once we have established a hypothesis that the patient may have a particular diagnosis, we want to determine if a test can make us more confident in that diagnosis. A measure called the likelihood ratio helps us in this effort. The likelihood ratio tells us how much more likely it is that a person has the diagnosis after the test is done; that is, it will help us determine the posttest probability. It indicates the value of the test for increasing certainty about a diagnosis,19 or its "confirming power."20 Likelihood ratios are being reported more often in the medical literature as an important standard for evidence-based practice. The likelihood ratio has an advantage over sensitivity, specificity and predictive values because it is independent of disease prevalence, and therefore can be applied across settings and patients.
We can determine a likelihood ratio for a positive or negative test. To understand this statistic, let's assume that a patient tests positive on the FRT. If this were a perfect test, then we would be certain that the patient is at risk for falls (true positive). But we hesitate to draw this conclusion definitively because we know that some patients who are not at risk will also test positive (false positive). Therefore, to determine if this test improves our diagnostic conclusion we must correct the true positive rate by the false positive rate. This is our positive likelihood ratio (LR +):
The LR+ will tell us how many times more likely a positive test will be seen in those with the disorder than in those without the disorder. A good test will have a high positive likelihood ratio.
Now let's assume the patient has a negative test. With a perfect test we would be sure this patient was not at risk for falls. But we are still concerned about the possibility of a false negative. Therefore, to determine if a negative test improves our diagnostic conclusion, we look at the ratio of the false negative rate to the true negative rate. This is our negative likelihood ratio (LR-):
The LR– will tell us how many times more likely a negative test will be seen in those with the disorder than in those without the disorder. A good test will have a very low negative likelihood ratio.
It is important to note that likelihood ratios always refer to the likelihood of the disorder being present.21 That's why we would like to see a high LR+, to indicate that the disorder is likely to be present with a positive test. A very low LR− means that the disorder has a small probability of being present with a negative test.
Interpreting Likelihood Ratios
The value of the likelihood ratio is somewhat intuitive, in that a larger LR+ indicates a greater likelihood of the disease, and a smaller LR– indicates a smaller likelihood of the disease. These values have been interpreted according to the following scale:18
A LR+ over 5 and a LR– lower than 0.2 represent relatively important effects. Likelihood ratios between 0.2 to 0.5 and between 2 to 5 may be important, depending on the nature of the diagnosis being studied. Values close to 1.0 represent unimportant effects. A likelihood ratio of 1.0 essentially means the test is useless; that is, the true positive and false positive (or true negative and false negative) rates are the same.
Let's apply this measure to the functional reach data. As shown in Table 27.2B, the LR+ = 3.75. Therefore, with a positive test, the likelihood of a patient being at risk for falls is increased by almost 4 times. This represents a potentially important value. The LR– = 0.76. This represents a small and unimportant value, close to 1.0. Therefore, based on these data, the FRT may help to improve our confidence with a positive test, but does not add important information with a negative test. Going back to the concepts of SpPin and SnNout, a large LR+ tells us that a positive test is good at ruling the disorder IN. A very low LR− tells us that the negative test is good at ruling the disorder OUT. We can confirm this by looking at the posttest probabilities.
Using a Nomogram to Determine Posttest Probabilities
A nomogram, shown in Figure 27.2, has been developed to determine posttest probabilities based on pretest probabilities and likelihood ratios.22 To use the nomogram, we begin on the left by marking the pretest probability. The center line identifies the likelihood ratio. If we draw a line connecting these two points and extend it to the right margin, we find the posttest probability associated with the test.
Nomogram to determine posttest probabilities using likelihood ratios.
For our example, Figure 27.3 shows a mark for 70% pretest probability based on prevalence data. Therefore, if we obtain a positive test (LR+ = 3.75), our posttest probability would approach 90%. With a positive test we have improved our confidence in this patient being at risk for falls by almost 20%. If we obtained a negative test (LR– = 0.76), our posttest probability would be approximately 60%. The patient still has a 60% chance of being at risk for falls—we have not improved our diagnostic certainty very much. Therefore, with a negative test other assessments may be necessary to accurately identify if the patient is truly not at risk.
Use of nomogram to show posttest probability based on 70% pretest probability with LR+ = 3.75 and LR− = 0.76.
It is important to realize that the posttest probability will be dependent on the sensitivity and specificity of the diagnostic test (translated to a likelihood ratio) as well as the clinician's estimate of the pretest probability for an individual patient. For example, with a positive FRT, if we start with a pretest probability of 20%, we would get a posttest probability of 50% for a positive test. With a pretest probability of 5%, this test would increase our posttest certainty to only 15%. Where we start will influence the degree to which our certainty can be improved by the test.
Calculating Posttest Probabilities
When the nomogram is not handy, posttest probabilities can be obtained by converting the pretest probability to an odds value as follows:
1. Convert the pretest probability (prevalence) to pretest odds:
2. Multiply the pretest odds by the likelihood ratio to get the posttest odds:
3. Convert the posttest odds to the posttest probability:
Once again, we show that with a 70% pretest probability and a LR+ of 3.75, our posttest probability has risen to 90%. This form of calculation will usually be more precise than using the nomogram. Several Internet programs are also available for calculation of posttest probabilities.23,24
When Several Tests Are Needed
Applying likelihood ratios to clinical practice will necessitate a strong understanding of diagnostic principles. The threshold for making a decision about a diagnosis may not be reached until several tests have been completed. When tests are performed serially, the posttest probability for one test can be used to estimate the pretest probability for the subsequent test. This is appropriate only when the tests are independent of each other. Straus et al7 recommend "chaining" likelihood ratios for this purpose. The posttest odds of the first test become the pretest odds for the second test. Therefore, by multiplying the new pretest odds by the likelihood ratio for the second test, we obtain the posttest odds for the second test. This can then be converted to posttest probability.
Sensitivity, specificity and likelihood ratios can also be expressed in terms of confidence intervals (see Table 27.2C). Although these values are often not reported, they are important to understanding the true nature of these estimates.25 Given a sample of scores, the confidence interval will indicate the range within which we can be sure the true population value will fall. Although not interpreted in terms of significance testing, confidence intervals for these measures of diagnostic accuracy will indicate the relative stability of the test's results; that is, with a wide confidence interval we would be less likely to consider the value a good estimate.26 Calculators for confidence intervals are available on the Internet.27,28
Reporting Diagnostic Studies: The STARD Statement
In 2000, a consensus meeting of international researchers and journal editors resulted in a recommendation for quality reporting of diagnostic studies. This group developed the Standards for Reporting of Diagnostic Accuracy (STARD) statement, consisting of a checklist of 25 items that would allow authors to ensure they were including all relevant information in an article, including essential elements of the design and conduct of their study, the execution of tests, and their results.29 The checklist will also allow readers to determine the potential for bias in a study, and to judge the generalizability of the results.
The STARD checklist is shown in Table 27.3. The statement has been published in several journals, and the reader is encouraged to refer to any one of these references for detailed descriptions of item criteria.30,31,32,33 A flow diagram is also recommended to illustrate the number of participants at each stage of the study and to communicate the key elements of the design (see Figure 27.4).
Study profile flow diagram of patients with suspected appendicitis evaluated in an emergency department during a 6-month study period. At each point, the accuracy of a positive or negative test is documented. (From Garcia Pena, BM et al. Ultrasonography and limited computed tomography in the diagnosis and management of appendicitis in children. JAMA 1999;282:1041–1046, Figure 2, p. 1044. Used with permission of the American Medical Association.)
TABLE 27.3STARD CHECKLIST OF ITEMS TO IMPROVE THE REPORTING OF STUDIES ON DIAGNOSTIC ACCURACY ||Download (.pdf) TABLE 27.3 STARD CHECKLIST OF ITEMS TO IMPROVE THE REPORTING OF STUDIES ON DIAGNOSTIC ACCURACY
|Section and topic ||Item || ||on page # |
|TITLE/ABSTRACT/KEYWORDS ||1 ||Identify the article as a study on diagnostic accuracy (recommend MeSH heading 'sensitivity and specificity'). || |
|INTRODUCTION ||2 ||State the research questions or study aims, such as estimating diagnostic accuracy or comparing accuracy between tests or across participant groups. || |
|METHODS || || || |
| Participants ||3 ||Describe the study population: the inclusion and exclusion criteria, setting and location(s) where the data were collected. || |
| ||4 ||Describe participant recruitment: was recruitment based on presenting symptoms, results from previous tests, or the fact that the participants had received the index test(s) or the reference standard? || |
| ||5 ||Describe participant sampling: was the study population a consecutive series of participants defined by the selection criteria in items (3) and (4)? If not specify how patients were further selected. || |
| ||6 ||Describe data collection: was data collection planned before the index test and reference standards were performed (prospective study) or after (retrospective study)? || |
| Test methods ||7 ||Describe the reference standard and its rationale. || |
| ||8 ||Describe technical specifications of material and methods involved including how and when measurements were taken, and/or cite references for index tests and reference standard. || |
| ||9 ||Describe definition of and rationale for the units, cutoffs and/or categories of the results of the index test(s) and the reference standard. || |
| ||10 ||Describe the number, training and expertise of the persons executing and reading the index tests and the reference standard. || |
| ||11 ||Describe whether or not the readers of the index tests and reference standard were blind (masked) to the results of the other test and describe any other clinical information available to the readers. || |
| Statistical methods ||12 ||Describe methods for calculating or comparing measures of diagnostic accuracy, and the statistical methods used to quantify uncertainty (e.g. 95% confidence intervals). || |
| ||13 ||Describe methods for calculating test reproducibility, if done. || |
|RESULTS || || || |
| Participants ||14 ||Report when study was done, including beginning and ending dates of recruitment. || |
| ||15 ||Report clinical and demographic characteristics of the study population (e.g. age, sex, spectrum of presenting symptoms, comorbidity, current treatments, recruitment centers). || |
| ||16 ||Report the number of participants satisfying the criteria for inclusion that did or did not undergo the index tests and/or the reference standard; describe why participants failed to receive either test (a flow diagram is strongly recommended). || |
| Test results ||17 ||Report time interval from the index tests to the reference standard and any treatment administered between. || |
| ||18 ||Report distribution of severity of disease (define criteria) in those with the target condition; describe other diagnoses in participants without the target condition. || |
| ||19 ||Report a cross tabulation of the results of the index tests (including indeterminate and missing results) by the results of the reference standard; for continuous results, the distribution of the test results by the results of the reference standard. || |
| ||20 ||Report adverse events from performing the index tests or the reference standard. || |
| Estimates ||21 ||Report estimates of diagnostic accuracy and measures of statistical uncertainty (e.g. 95% confidence intervals). || |
| ||22 ||Report how indeterminate results, missing responses and outliers of the index tests were handled. || |
| ||23 ||Report estimates of variability of diagnostic accuracy between subgroups of participants, readers or centers, if done. || |
| ||24 ||Report measures of test reproducibility, if done, || |
|DISCUSSION ||25 ||Discuss the clinical applicability of the study findings. || |