Once the selection process is complete, the studies are ready to be critically reviewed. As a rule, a minimum of two primary reviewers will independently assess content and rate the quality and applicability of each selected paper or information source. Therefore, the reviewers will need to discuss the details of the process, including how the papers will be assessed. It is often useful for reviewers to evaluate a few sample papers in a training process. When disagreement occur, they should be resolved by consensus or by resolution from a third party.
Reviewers will go through each study to describe its parameters, and record information on a data extraction form. This process allows the reviewers to gather the same information on each study, so that comparisons can be readily made. These forms generally record elements that are important for assessing quality of design and data analysis. A sample data extraction form is shown in Figure 16.2.
Because studies typically differ in their design, quality and validity, results of some trials may be more meaningful than others. Therefore, it is important to consider study quality.27 Four types of bias related to internal validity have been identified that can have an influence on the outcome of a systematic review.5 Selection bias is one of the most important factors that can distort treatment effects because of the way comparison groups are formed.28 Random allocation and concealment of allocation are essential elements of a clinical trial to assure that bias is not introduced. Performance bias refers to differences in the provision of care to experimental and control groups in a study. The most effective way to prevent this bias is through blinding of those who receive and give care. Attrition bias is related to the differential loss of subjects across comparison groups. This becomes especially relevant for studies with follow-up periods, and is addressed through intention to treat analysis. The fourth type is detection bias, which occurs if outcome assessment differs across comparison groups.
The use of quality assessment tools is an essential component of systematic reviews to determine if these forms of bias affect the value of included papers for answering the research question. Even with randomized trials, articles often neglect to provide important information on specific elements of the trial. Assessment of study validity may be used first as a threshold for deciding which studies to include in the review. As part of the review, the quality assessment will provide possible explanations for differences in study results.
Several rating scales have been developed. The criteria used and resultant ratings must be described in a systematic review so that readers can evaluate the validity of the reviewer's conclusions. Published checklists have varied numbers of items, with some based on "yes/no" answers for each item, while others allow for "unclear" responses. Most scales result in a total score based on summing the "yes" responses for each study. There is no gold standard for this process, and most scoring systems have not been validated. Nonetheless, these various scales generally focus on similar concepts as relevant to assessing the quality of a study.
We will describe three commonly used scales, although several others can be found in the literature.29,30,31,32,33,34 The CONSORT statement34 and the STARD statement,35 which are checklists of items that should be included in the report of a randomized trial or diagnostic study, can also be used for this purpose (see Chapters 9 and 27).
One of the original tools for evaluating quality is the Jadad scale, the Instrument to Measure the Likelihood of Bias, which is composed of three questions (see Table 16.2).36 This scale focuses on randomization, blinding and attrition to determine quality of a study. The maximum total score is 5. While simple and quick, the Jadad scale is limited in its scope and does not consider many important design issues.
TABLE 16.2JADAD SCALE: INSTRUMENT TO MEASURE THE LIKELIHOOD OF BIAS ||Download (.pdf) TABLE 16.2 JADAD SCALE: INSTRUMENT TO MEASURE THE LIKELIHOOD OF BIAS
|1. Was the study described as randomized? ||Yes (1 pt) ||No (0 pt) |
|2. Was the study described as double blind? ||Yes (1 pt) ||No (0 pt) |
|3. Was there a description of withdrawals and dropouts? ||Yes (1 pt) ||No (0 pt) |
Add 1 point:
For question 1 if the randomization method was described and it was appropriate.
For question 2 if the method of double blinding was described and it was appropriate.
Deduct 1 point:
For question 1 if the method of randomization was not appropriate.
For question 2 if the method of double blinding was not appropriate.
The Physiotherapy Evidence Database (PEDro) scale has become widely used in rehabilitation and medical literature. Developed by physiotherapists at the University of Sydney, it is based on a description of the study's structure.37 In addition to items related to randomization, blinding and attrition, the scale also includes analysis of design and statistics (see Table 16.3). Each criterion is graded 1 for "yes" and 0 for "no" or "unclear," with a maximum total score of 10. The PEDro scale has reasonable reliability,38,39 and has been shown to be a more comprehensive measure of methodological quality than the Jadad scale.40
TABLE 16.3PEDro SCALEa ||Download (.pdf) TABLE 16.3 PEDro SCALEa
|1. Eligibility criteria were specified.b ||no/yesc |
|2. Subjects were randomly allocated to groups (in a crossover study, subjects were randomly allocated an order in which treatments were received). ||no/yes |
|3. Allocation was concealed. ||no/yes |
|4. The groups were similar at baseline regarding the most important prognostic indicators. ||no/yes |
|5. There was blinding of all subjects. ||no/yes |
|6. There was blinding of all therapists who administered the therapy. ||no/yes |
|7. There was blinding of all assessors who measured at least one key outcome. ||no/yes |
|8. Measures of at least one key outcome were obtained from more than 85 percent of the subjects initially allocated to groups. ||no/yes |
|9. All subjects for whom outcome measures were available received the treatment or control condition as allocated or, where this was not the case, data for at least one key outcome was analyzed by "intention to treat." ||no/yes |
|10. The results of between-group statistical comparisons are reported for at least one key outcome. ||no/yes |
|11. The study provides both point measures and measures of variability for at least one key outcome. ||no/yes |
Whiting et al41 have validated a scale to review studies of diagnostic test accuracy, shown in Table 16.4. The Quality Assessment of Diagnostic Accuracy Studies (QUADAS) is a 14-item scale, which has been shown to have good rater reliability.42 Items are rated "yes," "no," or "unclear," and the total score is expressed as a percentage of items that are given a "yes" rating.
TABLE 16.4QUALITY ASSESSMENT OF DIAGNOSTIC ACCURACY STUDIES: QUADAS ||Download (.pdf) TABLE 16.4 QUALITY ASSESSMENT OF DIAGNOSTIC ACCURACY STUDIES: QUADAS
Was the spectrum of patients representative of the patients who will receive the test in practice?a
Were selection criteria clearly described?
Is the reference standard likely to correctly classify the target condition?
Is the time period between reference standard and index test short enough to be reasonably sure that the target condition did not change between the two tests?
Did the whole sample or a random selection of the sample, receive verification using a refer ence standard of diagnosis?
Did patients receive the same reference standard regardless of the index test result?
Was the reference standard independent of the index test (i.e. the index test did not form part of the reference standard)?
Was the execution of the index test described in sufficient detail to permit replication of the test?
Was the execution of the reference standard described in sufficient detail to permit its replication?
Were the index test results interpreted without knowledge of the results of the reference standard?
Were the reference standard results interpreted without knowledge of the results of the index test?
Were the same clinical data available when test results were interpreted as would be available when the test is used in practice?
Were uninterpretable/intermediate test results reported?
Were withdrawals from the study explained?
Presentation of Methodologic Quality
Systematic reviews will usually include tabular results of the quality assessment as a consensus score between the two reviewers. Table 16.5 is an example of such a table using the PEDro score for a systematic review of hand splinting for adults following stroke.43 Each study in the review is identified and scores are shown for each criterion as well as a total score. Some reviewers choose a cutoff score to delineate a high versus low quality study. Others may include the level of evidence that each study achieves. This type of presentation allows the reader to quickly see the overall quality of the studies included in the review.
TABLE 16.5METHODOLOGICAL RATING OF RANDOMIZED CONTROLLED TRIALS ||Download (.pdf) TABLE 16.5 METHODOLOGICAL RATING OF RANDOMIZED CONTROLLED TRIALS
| ||PEDro Criterion Scorea || || || |
|Study ||1 ||2 ||3 ||4 ||5 ||6 ||7 ||8 ||9 ||10 ||11 ||Total ||Quality ||Levelb |
|McPherson et al., 1982 ||Y ||Y ||N ||N ||N ||N ||N ||Y ||N ||Y ||Y ||4 ||LOW ||1b |
|Rose at al., 1987 ||Y ||Y ||N ||N ||N ||N ||N ||N ||N ||Y ||N ||2 ||LOW ||1b |
|Pools etal., 1990 ||N ||Y ||N ||Y ||N ||N ||Y ||Y ||N ||Y ||Y ||6 ||HIGH ||2b |
|Langlois et al., 1991 ||Y ||Y ||N ||N ||N ||N ||N ||N ||N ||Y ||Y ||3 ||LOW ||1b |
|Lannin etal., 2003 ||Y ||Y ||Y ||Y ||N ||N ||Y ||Y ||Y ||Y ||Y ||8 ||HIGH ||lb |
Once the articles of interest have been critically reviewed, the researcher must then determine if and how the results of the studies can be synthesized. The reviewers will determine the degree of heterogeneity or homogeneity in the included studies. Heterogeneity refers to dissimilarity in specific aspects of the studies:44
Composition of treatment groups, including different inclusion and exclusion criteria, different baseline levels, or differences in timing or dose of intervention
Design of the study, including length of follow-up and proportion of subjects who dropped out
Management of patients, including how treatments are regulated and the presence of complications or co-morbid conditions
If papers have published conflicting or inconclusive findings, it will be difficult to interpret the results of the systematic review. We know that studies with small sample sizes or small effect sizes may show no significant effect of the intervention, leaving the possibility of a Type II error (see Chapter 18). We also know that the choice of measurement scale or tool may affect the sensitivity or responsiveness of measurement. Other study characteristics, such as criteria for subject selection or operational definitions, may have an impact on the ability to generalize findings.
Once again, a tabular presentation is helpful to understand the variations across studies and their overall findings. Table 16.6 illustrates one possible format for a systematic review of outcomes of cardiovascular exercise programs for people with Down Syndrome.45 Notice that the table includes information on sample size and characteristics of subjects, as well as a description of the intervention and outcome measures. This table also shows the PEDro score for each study, which allows the reader to analyze the findings of the study in relation to its quality.
TABLE 16.6SUMMARY OF FINDINGS FROM FOUR STUDIES IN A SYSTEMATIC REVIEW ||Download (.pdf) TABLE 16.6 SUMMARY OF FINDINGS FROM FOUR STUDIES IN A SYSTEMATIC REVIEW
|Author ||PEDro Score ||n ||Mean Age ± SD(y) ||Sex ||Severity of Intellectual Disability ||Previous Exercise Participation ||Program Details ||Training Intensity ||Body Structure/Function Outcomes |
|Rimmer et al ||6 ||52 ||39.4 ± 6.4 ||29 W, 23 M ||Mild to moderate ||Sedentary for at least ly prior to the program ||30min aerobic machine-based (eg, treadmill, stationary bicycle) exercise program, 15min PRE; 3/wk for 12wk ||50%–70% Vo2peak ||Vo2peak; time to exhaustion; bench press and leg press 1-RM; grip strength; body weight and BMI |
|Tsimaras et al ||5 ||25 ||24.6 ± 3.3 ||25 M ||IQ 45–60 ||Not reported ||10-min warm-up, 30min jog/walking program; 3/wk for 12wk ||65V–75% max HR assessed at start of program ||vo2peak, Vepeak, time to exhaustion |
|Varela et al ||6 ||16 ||21.4 ±3.0 ||16 M ||Mean IQ 38.8 ||Not reported ||10-min warm-up, 25-min rowing program, 10-min cool down; 3/wk for 16wk ||55%–70% Vo2peak ||Vo2peak, Vepeak, time to exhaustion, distance traveled, work level reached; body weight, body fat percentage |
|Millar et al ||6 ||14 ||17.7 ± 2.9 ||3 W, 11 M ||IQ 30–70 ||Not reported ||10-min warm-up, 30-min brisk walking/jogging, 10-min cool down program; 3/wk for 10wk ||65%–75% max HR ||Vo2peak, Vepeak, time to exhaustion |
Discussion and Conclusions
The final section of the systematic review will be a discussion of findings and the reviewer's overall conclusions based on the quality of evidence that was obtained. This can be a complex process if studies have varying methods and results, as is often the case. The reviewer has the responsibility to integrate the findings to clarify the state of knowledge in a clinical context. By comparing studies in terms of their quality and procedures, the discussion will put the results in context. Suggestions for future research should be proposed.