++
The historical approach to testing reliability involved the use of correlation coefficients. In Chapter 5 we discussed the problems with this approach, in that it does not provide a measure of agreement, but only covariance (see Figure 5.1 in Chapter 5). Correlations are also limited as reliability coefficients because they are bivariate; that is, only two ratings or raters can be correlated at one time. It is not possible to assess the simultaneous reliability of more than two raters or the relationships among different aspects of reliability, such as raters, test forms, and testing occasions. As these are often important elements in reliability testing, correlation does not provide an efficient mechanism for evaluating the full scope of reliability.
++
Another objection to the use of correlation as a measure of reliability is based on the statistical definition of reliability; that is, correlation cannot separate out variance components due to error or true differences in a data set. Therefore, the correlation coefficient is not a true reliability coefficient. It is actually more accurate to use the square of the correlation coefficient (the coefficient of determination) for this purpose, because r2 reflects how much variance in one measurement is accounted for by the variance in a second measurement (see Chapter 24). This is analogous to asking how much of the total variance in a set of data is shared by two measurements (the "true" variance) and how much is not shared (the error variance). If we could correlate true scores with observed scores in a set of data, the square of the correlation coefficient would be the reliability coefficient. We can confirm this interpretation using the data from Table 26.1A. For the correlation between observed and true scores, r = .66. Therefore, r2 = .43.
++
To overcome the limitations of correlation as a measure of reliability, some researchers have used more than one reliability index within a single study. For instance, in a test-retest situation or a rater reliability study, both correlation and a t-test can be performed to assess consistency and average agreement between the data sets. This strategy does address the interpretation of agreement, but it is not useful in that it does not provide a single index to describe reliability. The scores may be correlated but significantly different (as in Table 26.1B), or they may be poorly correlated but not significantly different. How should these results be interpreted? It is much more desirable to use one index that can answer this question.
++
The intraclass correlation coefficient (ICC) is such an index. Like other reliability coefficients, the ICC ranges from 0.00 to 1.00. It is calculated using variance estimates obtained through an analysis of variance. Therefore, it reflects both degree of correspondence and agreement among ratings.
++
Statistically the ICC has several advantages. First, it can be used to assess reliability among two or more ratings, giving it broad clinical applicability. Second, the ICC does not require the same number of raters for each subject, allowing for flexibility in clinical studies.1 Third, although it is designed primarily for use with interval/ratio data, the ICC can be applied without distortion to data on the ordinal scale when intervals between such measurements are assumed to be equivalent.2 In addition, with data that are rated as a dichotomy (the presence or absence of a trait), the ICC has been shown to be equivalent to measures of nominal agreement, simplifying computation in cases where more than two raters are involved.1,3 Therefore, the ICC provides a useful index in a variety of analysis situations.
++
Another major advantage of the ICC is that it supports the generalizability model proposed by Cronbach as a comprehensive estimate of reliability.4,5 The concept of generalizability theory, introduced in Chapter 5, is based on the idea that differences between observed scores are due to a variety of factors, not just true score variance and random error. Differences occur because of variations in the measurement system, such as the characteristics of raters or subjects, testing conditions, alternate forms of a test, administrations of a test on different occasions and so on. These factors are called facets of generalizability.
++
The essence of generalizability theory is that facets contribute to measurement error as separate components of variance, distinguishable from random error. In classical reliability theory, error variance is undifferentiated, incorporating all sources of measurement error. In generalizability theory, however, the error variance is multivariate; that is, it is further partitioned to account for the influence of specific facets on measurement error. Therefore, the generalizability coefficient (the ICC) is an extension of the reliability coefficient:
++
++
where s2T and s2E are the variances in true scores and error components, and s2F is the variance attributable to the facets of interest.6 The specific facets included in the denominator will vary, depending on whether rater, occasions or some other facet is the variable of interest in the reliability study. For example, if we include rater as a facet, then the total observed variance would be composed of the true variance between subjects, the variance between raters, and the remaining unexplained error variance.
++
Equation (26.4) represents a conceptual definition of generalizability. Actual calculations require the use of variance estimates that are obtained from an analysis of variance, which, of course, does not include direct estimates of true variance (as this is unknown). Theoretically, however, we can estimate true score variance by looking at the difference between observed variance among subjects and error variance (s2T = s2X + s2E). These estimates can be derived from an analysis of variance.
+++
Classification of the ICC
++
There are actually six different equations for calculating the ICC, differentiated by purpose of the reliability study, the design of the study, and the type of measurements taken. It is necessary to distinguish among these approaches, as under some conditions the results can be decidedly different. To facilitate explanations, we will proceed with this discussion in the context of a reliability study with rater as the facet of interest; however, we emphasize that these applications are equally valid to study other facets.
+++
Models of the ICC: Random and Fixed Effects
++
Shrout and Fleiss describe three models of the ICC.7 They distinguish these models according to how the raters are chosen and assigned to subjects.
++
Model 1. In model 1, each subject is assessed by a different set of k raters. The raters are considered randomly chosen from a larger population of raters; that is, rater is a random effect. However, the raters for one subject are not necessarily the same raters that take measurements on another subject. Therefore, in this design there is no way to associate a particular rater with the variables being measured.8 The only variance that can actually be assessed is the difference among subjects. Other sources of error variance, including rater or measurement error, cannot be separated out.
++
Model 2. Model 2 is the most commonly applied model of the ICC for assessing inter-rater reliability. In this design, each subject is assessed by the same set of raters. The raters are randomly chosen; that is, they are expected to represent the population of raters from which they were drawn, and results can be generalized to other raters with similar characteristics. Subjects are also considered to be randomly chosen from the population of individuals who would receive the measurement. Therefore, subject and rater are both random effects. This randomness may be only theoretical in practice; that is, we choose subjects and raters who we believe represent the populations of interest, as we do not have access to the entire population. But the intent of the study is to demonstrate that the measurement reliability can be applied to others.
++
Model 3. In model 3, each subject is assessed by the same set of raters, but the raters represent the only raters of interest. In this case, there is no intention to generalize findings beyond the raters involved. In this design, rater is considered a fixed effect because the raters have been purposely (not randomly) selected. Subjects are still considered a random effect. Therefore, model 3 is a mixed model. This model is used when a researcher wants to establish that specific investigators are reliable in their data collection, but the reliability of others is not relevant. Model 3 is also the appropriate statistic to measure intrarater reliability, as the measurements of a single rater cannot be generalized to other raters.7
+++
Forms of the ICC: Single and Average Ratings
++
Each of the ICC models can be expressed in two forms, depending on whether the scores are single ratings or mean ratings. Most often, reliability studies are based on comparison of scores from individual raters. There are times, however, when the mean of several raters or ratings may be used as the unit of reliability. For instance, when measurements are unstable, it may be necessary to use the mean of several measurements as the individual's score to obtain satisfactory reliability. Using mean scores has the effect of increasing reliability estimates, as means are considered better estimates of true scores, theoretically reducing error variance.
++
The six types of ICC are classified using two numbers in parentheses. The first number designates the model (1, 2, or 3), and the second number signifies the form, using either a single measurement (1) or the mean of several measurements (k)∗ as the unit of analysis. For example, when using single measurements in a generalizability study, we would specify use of ICC(2,1). The type of ICC used should always be indicated.
+
++
++
The ICC is based on measures of variance obtained from an ANOVA. For an interrater reliability study, rater is the independent variable; for an intrarater study, trial is the independent variable. Table 26.2 shows the arrangement of hypothetical data with rater as columns, and subjects as rows. For an intrarater study, the columns would represent trials.
++
+++
Model 1: One-Way ANOVA
++
For model 1, a one-way analysis of variance is run, with "subjects" as the independent variable. This ANOVA partitions the total variance into two parts—the variation between-subjects and error, as shown in Table 26.3A. The between-subjects effect tells us if the subjects' scores are different from each other, which we expect. The error component represents the variation within a subject across raters. Some of this error will be due to true scores changing from trial to trial, some from rater error, and some will be unexplained. This ANOVA does not differentiate among these sources of error. Calculations for this model are shown in Table 26.3B using data from Table 26.2.
++
+++
Models 2 and 3: Repeated Measures ANOVA
++
For model 2, the ANOVA is performed as a two-way random effects model, in which both subjects and raters are considered to be randomly chosen from a larger population.† Therefore, the results of the study can be generalized to other raters and other subjects. For model 3, a two-way mixed model is run, with rater as a fixed effect (not randomly chosen) and subjects as a random effect. The numerical results of the analysis will actually be the same for both random and mixed types of ANOVA. The only difference will lie in the interpretation of the data. The results of a repeated measures analysis of variance are shown in Table 26.4.
++
++
The repeated measures ANOVA partitions the variance into effects due to differences between subjects, differences between raters and error variance. The F-ratio associated with the rater effect reflects the difference among raters, or the extent of agreement or disagreement among them. This effect is significant when the variance due to raters is large, indicating that the raters' scores are different from each other and not reliable. In this example, the rater effect is not significant (p = .130). Table 26.4 shows the calculation of both forms for models 2 and 3, using data from Table 26.2.
+
++
++
SPSS,‡ a commonly used software package, will generate the various forms of the ICC as part of its Reliability Analysis (under SCALE).9,10 SAS,§ another commonly used program, does not provide direct calculations, but a programming macro has been developed.11 Online calculators can also be found to provide ICC values based on raw data.12,13 Calculations by hand are straightforward once the analysis of variance is performed.
++
Table 26.3B shows the SPSS output∗∗ for model 1, and Table 26.5 shows the output for models 2 and 3. Each model is generated in two forms, for single measures and average measures. Confidence intervals are also provided. The researcher must decide which value to use, based on the design of the study.
++
+
++
++
++
+++
Interpretation of the ICC
++
Like other forms of reliability, there are no standard values for acceptable reliability using the ICC. The ICC ranges between 0.00 and 1.00, with values closer to 1.00 representing stronger reliability. But because reliability is a characteristic of measurement obtained to varying degrees (although rarely to perfection), the researcher must determine "how much" reliability is needed to justify the use of a particular tool. The nature of the measured variable will be a factor, in terms of its stability and the precision required to make sound clinical judgments about it. As a general guideline, we suggest that values above .75 are indicative of good reliability, and those below .75 poor to moderate reliability. For many clinical measurements, reliability should exceed .90 to ensure reasonable validity. These are only guidelines, however, and should not be used as absolute standards. Researchers and clinicians must defend their judgments within the context of the specific scores being assessed and the degree of acceptable precision in the measurement.
++
When the ICC is high, it is easy to say that reliability is good, and to express confidence in the obtained measurements. When reliability is less than satisfactory, however, the researcher is obliged to sort through alternative explanations to determine the contributing sources of error. There are two major reasons for finding low ICC values.
++
The first explanation is fairly obvious: The raters (or ratings) do not agree. This is not a straightforward interpretation, however, when more than two raters are analyzed. Because the ICC is an average based on variance across all raters, nonagreement may involve all raters, some raters, or only one rater. The ICC can be considered an average correlation across raters and, therefore, does not represent the reliability of any individual rater. For instance, a critical look back at the data in Table 26.2 reveals that rater 3 seems to be the most out of line with the other raters. In fact, if we obtain the product-moment correlations for all possible pairs of ratings, we find that raters 1, 2 and 4 demonstrate correlations between .96 and .98, whereas the correlations of rater 3 with the other three raters are all negative and small, between −.06 and −.19 (Figure 26.1). The ICC is brought down by the "unreliable" responses of rater 3.
++
++
It is often useful, therefore, to examine the data, to determine if there is an interaction between raters and subjects; that is, are the scores dependent on what "level" of rater is doing the measuring? This type of interaction is reflected in the error variance of the repeated measures ANOVA.
++
When raters are reliable, there should be no interaction between raters and subjects; that is, the error variance should be small. It may be helpful to graph the results, as shown in Figure 26.1. The ratings obtained by raters 1, 2 and 4 are close and fairly parallel. The scores obtained by rater 3 are clearly incongruent. By examining both the intercorrelations and graphic evidence, we can determine that there is an interaction between rater and subject. It would be important, then, to review the circumstances of the third rater's tests, to determine why that person's ratings were not consistent with the others.
++
A second reason for a low ICC is one that has been discussed before in relation to the reliability coefficient; that is, the variability among subjects' scores must be large to demonstrate reliability. A lack of variability can occur when samples are homogeneous, when raters are all very lenient or strict in their scoring, or when the rating system falls within a restricted range. This effect can be checked by looking for significance of the between-subjects variance in the analysis of variance (Table 26.5➊). If subjects' scores are homogeneous, this source of variance will not be significant. It has been shown that when the between-subjects variance is not significant, the actual limits of the ICC do not match the theoretical limits of 0.00 and 1.00.14 In fact, it is possible for ratios to range from negative to positive infinity. When a negative ICC is obtained, the value cannot be considered valid. Therefore, it is imperative that researchers be aware of the extent to which scores will naturally vary, and try to obtain heterogeneous samples whenever possible.
++
Although we have presented multiple values of the ICC for our example, it should be clear that only one type will be appropriate for any one study. The selection of one version should be made before data are collected, based on appropriate design considerations. In most instances, model 2 or 3 will be the appropriate choice. In some research situations, the investigator is interested in establishing the intrarater or interrater reliability of a group of clinicians for one specific data collection experience, fitting model 3. In that situation, it is of no interest if anyone else can perform the measurements with equal reliability. If, however, it is important to demonstrate that a particular measuring tool can be used with confidence by all equally trained clinicians, then model 2 should be used. This approach is appropriate for clinical studies and methodological research, to document that a measuring tool has broad application.
++
Model 1 is applicable in only limited circumstances. For example, Maher et al15 performed a study to determine the interrater reliability of 25 raters who assessed the quality of published randomized controlled trials (RCTs) using the PEDro scale (see Chapter 16). The study involved a total of 120 articles, but each of the 25 raters rated from 1 to 56 RCTs. This fits the design for model 1, where subjects (in these example studies) are not all assessed by the same raters. When all raters assess all subjects, model 1 is not appropriate. Some authors have expressed a preference for using model 1 because it provides a more conservative estimate of reliability than the other models;16 however, the conservative or liberal nature of a statistic is not an adequate rationale for its use if the model is unsuitable for the design.7,17
++
Generally, for the same set of data, model 1 will yield smaller values than model 2, and model 2 will yield smaller values than model 3. Likewise, within each model, the ICC based on single ratings will yield a lower correlation than one based on mean ratings (see Tables 26.3 and 26.5). Because of these potential differences, the type of ICC used in a particular study should always be reported.17