In Chapter 5 we introduced basic concepts of reliability and described how different forms of reliability can be addressed in the planning of research protocols. The purpose of this chapter is to expand on these concepts by presenting the statistical bases for estimates of reliability, including measures of correlation, agreement, internal consistency, response stability and method comparison for alternate forms. We have waited until this point in the book to present these procedures because they require application of statistical concepts that have been covered in the preceding chapters.

Recall from Chapter 5 that classical reliability theory partitions an observed measurement or score, *X*, into two components: a *true component, T*, which represents the real value under ideal and infallible conditions, and an *error component, E*, which includes all other sources of variance that influence the outcome of measurement. This theoretical relationship is expressed in the equation

We can also examine the statistical nature of this relationship by restating it in terms of *variance* (*s*^{2}). The total variance within a set of observed scores (*s*^{2}_{x}) is a function of both the **true variance** between scores (*s*^{2}_{T}) and the variance in the errors of measurement, or **error variance** (*s*^{2}_{E}):

Although it is an unknown quantity, we assume that *s*^{2}_{T} is fixed, because true scores will theoretically remain constant. Therefore, in a set of perfectly reliable scores, all observed differences between individual scores should be attributable to true differences between scores; that is, there is no error variance. Conversely, if we look at a set of repeated measurements from one person, and assume that the true response has not changed, then all observed variance should be the result of error. The essence of reliability, then, is based on the amount of error that is present in a set of scores. A measurement is considered more reliable if a greater proportion of the total observed variance is represented by the true score variance. Thus, reliability is defined by the ratio:

In statistical terminology, this relationship can be expressed as

where *r _{XX}* is the symbol for a

**reliability coefficient.**

The coefficient of reliability can take values from 0.00 to 1.00. Zero reliability indicates that all measurement variation is attributed to error. Reliability of 1.00 means that the measurement has no error, or *s*^{2}_{E} = 0. As the coefficient nears 1.00, we are more confident that the observed score is representative of the true score.

To illustrate this application, consider the set of hypothetical data presented in Table 26.1A. These values represent ratings for six patients on a subjective pain scale, rated from 0 to 20. The first column, labeled *X*, lists the observed scores and their variance, *s*^{2}_{x} = 5.60; the second column, *T*, shows the true scores (although in reality these are not known) and their variance, *s*^{2}_{T} = 2.40; the last column, labeled *E*, shows the error component (the difference between the observed and true scores) and the error variance, *s*^{2}_{E} = 3.20. We can verify that the observed variance is composed of true variance and error variance: 5.60 = 2.40 + 3.20. These values can be used to calculate the reliability coefficient as follows:

Conceptually, this means that 43% of the variation in the observed scores can be attributed to variation in the true score, and the rest, 57%, is attributable to measurement error.

Of course, this approach is completely theoretical, as we can never actually know the true score or error component within a set of data. Therefore, it is necessary to use observed scores to estimate reliability. Although the procedures for obtaining these estimates will vary, the theory underlying the reliability coefficient is universally applicable; that is, reliability is a function of the amount of error variance in a set of data.

As reflected in the definition of reliability, statistical variance is the basis for reliability estimates. We can demonstrate that as the true variance in a set of scores decreases, the reliability coefficient will also decrease. If we look at the differences among the pain scores in Table 26.1A, we can see that the patients did not vary greatly from one another. True scores were in a narrow range from 8 to 14. Consequently, the variance within the observed scores is small. We also find that the differences between the observed and true scores (errors) are minimal across the six patients. Based on these observations, we might reason that these measurements should be highly reliable; however, we obtain a reliability coefficient of only .43, much lower than might be expected.

Now let us look at a similar set of hypothetical data for the same variable, shown in Table 26.1B. Note that the error components for these scores are identical to those in the first data set. This time, however, the true scores are much more variable (*s*^{2}_{T} = 16.00), with values ranging from 4 to 15. Therefore, the observed scores also exhibit a much higher variance (*s*^{2}_{X} = 19.20). Using these values, we can calculate a second reliability coefficient:

These data demonstrate a much stronger degree of statistical reliability than the first data set, even though the degree of error in the scores is the same! Why does this occur? Recall that reliability is based on the *proportion of the total observed variance that is attributable to error.* Therefore, for a given amount of error variance, it follows that reliability will improve as the total variance increases; that is, as the total variance gets larger, the error component will account for a smaller proportion of it.

This concept is crucial in the interpretation of reliability coefficients and in the design of reliability studies. Suppose we were interested in establishing the reliability of a new device for measuring range of back extension. We gather a large sample of "normal" individuals, all with measurements between 20 and 25 degrees of extension. Even if we are fairly consistent over successive trials, the reliability coefficient will probably be low because the total variance is so small. A low reliability coefficient can be misleading under such conditions. The solution to this problem, of course, is to include subjects that have a wider range of scores in a reliability study. We should be studying normal individuals as well as patients with hypermobility and hypomobility in back extension. Researchers should always consider the range of scores used for estimating reliability in the interpretation of reliability coefficients.

The historical approach to testing reliability involved the use of correlation coefficients. In Chapter 5 we discussed the problems with this approach, in that it does not provide a measure of agreement, but only covariance (see Figure 5.1 in Chapter 5). Correlations are also limited as reliability coefficients because they are bivariate; that is, only two ratings or raters can be correlated at one time. It is not possible to assess the simultaneous reliability of more than two raters or the relationships among different aspects of reliability, such as raters, test forms, and testing occasions. As these are often important elements in reliability testing, correlation does not provide an efficient mechanism for evaluating the full scope of reliability.

Another objection to the use of correlation as a measure of reliability is based on the statistical definition of reliability; that is, correlation cannot separate out variance components due to error or true differences in a data set. Therefore, the correlation coefficient is not a true reliability coefficient. It is actually more accurate to use the square of the correlation coefficient (the coefficient of determination) for this purpose, because *r ^{2}* reflects how much variance in one measurement is accounted for by the variance in a second measurement (see Chapter 24). This is analogous to asking how much of the total variance in a set of data is shared by two measurements (the "true" variance) and how much is not shared (the error variance). If we could correlate true scores with observed scores in a set of data, the square of the correlation coefficient would be the reliability coefficient. We can confirm this interpretation using the data from Table 26.1A. For the correlation between observed and true scores,

*r*= .66. Therefore,

*r*

^{2}= .43.

To overcome the limitations of correlation as a measure of reliability, some researchers have used more than one reliability index within a single study. For instance, in a test-retest situation or a rater reliability study, both correlation and a *t*-test can be performed to assess consistency and average agreement between the data sets. This strategy does address the interpretation of agreement, but it is not useful in that it does not provide a single index to describe reliability. The scores may be correlated but significantly different (as in Table 26.1B), or they may be poorly correlated but not significantly different. How should these results be interpreted? It is much more desirable to use one index that can answer this question.

The **intraclass correlation coefficient (ICC)** is such an index. Like other reliability coefficients, the ICC ranges from 0.00 to 1.00. It is calculated using variance estimates obtained through an analysis of variance. Therefore, it reflects both degree of correspondence and agreement among ratings.

Statistically the ICC has several advantages. First, it can be used to assess reliability among two or more ratings, giving it broad clinical applicability. Second, the ICC does not require the same number of raters for each subject, allowing for flexibility in clinical studies.^{1} Third, although it is designed primarily for use with interval/ratio data, the ICC can be applied without distortion to data on the ordinal scale when intervals between such measurements are assumed to be equivalent.^{2} In addition, with data that are rated as a dichotomy (the presence or absence of a trait), the ICC has been shown to be equivalent to measures of nominal agreement, simplifying computation in cases where more than two raters are involved.^{1,3} Therefore, the ICC provides a useful index in a variety of analysis situations.

Another major advantage of the ICC is that it supports the **generalizability** model proposed by Cronbach as a comprehensive estimate of reliability.^{4,5} The concept of generalizability theory, introduced in Chapter 5, is based on the idea that differences between observed scores are due to a variety of factors, not just true score variance and random error. Differences occur because of variations in the measurement system, such as the characteristics of raters or subjects, testing conditions, alternate forms of a test, administrations of a test on different occasions and so on. These factors are called **facets** of generalizability.

The essence of generalizability theory is that facets contribute to measurement error as separate components of variance, distinguishable from random error. In classical reliability theory, error variance is undifferentiated, incorporating all sources of measurement error. In generalizability theory, however, the error variance is multivariate; that is, it is further partitioned to account for the influence of specific facets on measurement error. Therefore, the **generalizability coefficient** (the ICC) is an extension of the reliability coefficient:

where *s*^{2}_{T} and *s*^{2}_{E} are the variances in true scores and error components, and *s*^{2}_{F} is the variance attributable to the facets of interest.^{6} The specific facets included in the denominator will vary, depending on whether rater, occasions or some other facet is the variable of interest in the reliability study. For example, if we include rater as a facet, then the total observed variance would be composed of the true variance between subjects, the variance between raters, and the remaining unexplained error variance.

Equation (26.4) represents a conceptual definition of generalizability. Actual calculations require the use of variance estimates that are obtained from an analysis of variance, which, of course, does not include direct estimates of true variance (as this is unknown). Theoretically, however, we can estimate true score variance by looking at the difference between observed variance among subjects and error variance (*s*^{2}_{T} = *s*^{2}_{X} + *s*^{2}_{E}). These estimates can be derived from an analysis of variance.

There are actually six different equations for calculating the ICC, differentiated by purpose of the reliability study, the design of the study, and the type of measurements taken. It is necessary to distinguish among these approaches, as under some conditions the results can be decidedly different. To facilitate explanations, we will proceed with this discussion in the context of a reliability study with rater as the facet of interest; however, we emphasize that these applications are equally valid to study other facets.

Shrout and Fleiss describe three *models* of the ICC.^{7} They distinguish these models according to how the raters are chosen and assigned to subjects.

**Model 1.** In model 1, each subject is assessed by a different set of *k* raters. The raters are considered randomly chosen from a larger population of raters; that is, rater is a **random effect**. However, the raters for one subject are not necessarily the same raters that take measurements on another subject. Therefore, in this design there is no way to associate a particular rater with the variables being measured.^{8} The only variance that can actually be assessed is the difference among subjects. Other sources of error variance, including rater or measurement error, cannot be separated out.

**Model 2.** Model 2 is the most commonly applied model of the ICC for assessing inter-rater reliability. In this design, each subject is assessed by the same set of raters. The raters are randomly chosen; that is, they are expected to represent the population of raters from which they were drawn, and results can be generalized to other raters with similar characteristics. Subjects are also considered to be randomly chosen from the population of individuals who would receive the measurement. Therefore, subject and rater are both **random effects**. This randomness may be only theoretical in practice; that is, we choose subjects and raters who we believe represent the populations of interest, as we do not have access to the entire population. But the intent of the study is to demonstrate that the measurement reliability can be applied to others.

**Model 3.** In model 3, each subject is assessed by the same set of raters, but the raters represent the only raters of interest. In this case, there is no intention to generalize findings beyond the raters involved. In this design, rater is considered a **fixed effect** because the raters have been purposely (not randomly) selected. Subjects are still considered a random effect. Therefore, model 3 is a **mixed model**. This model is used when a researcher wants to establish that specific investigators are reliable in their data collection, but the reliability of others is not relevant. Model 3 is also the appropriate statistic to measure intrarater reliability, as the measurements of a single rater cannot be generalized to other raters.^{7}

Each of the ICC models can be expressed in two *forms*, depending on whether the scores are single ratings or mean ratings. Most often, reliability studies are based on comparison of scores from individual raters. There are times, however, when the mean of several raters or ratings may be used as the unit of reliability. For instance, when measurements are unstable, it may be necessary to use the mean of several measurements as the individual's score to obtain satisfactory reliability. Using mean scores has the effect of increasing reliability estimates, as means are considered better estimates of true scores, theoretically reducing error variance.

The six types of ICC are classified using two numbers in parentheses. The first number designates the *model* (1, 2, or 3), and the second number signifies the *form*, using either a single measurement (1) or the mean of several measurements (*k*)^{∗} as the unit of analysis. For example, when using single measurements in a generalizability study, we would specify use of ICC(2,1). The type of ICC used should always be indicated.

^{∗}The designation of *k* equals the number of scores used to obtain the mean.

The ICC is based on measures of variance obtained from an ANOVA. For an interrater reliability study, rater is the independent variable; for an intrarater study, trial is the independent variable. Table 26.2 shows the arrangement of hypothetical data with rater as columns, and subjects as rows. For an intrarater study, the columns would represent trials.

Subject | Rater1 | Rater2 | Rater3 | Rater4 | |
---|---|---|---|---|---|

1 | 1 | 7 | 8 | 3 | 5 |

2 | 2 | 2 | 4 | 4 | 1 |

3 | 3 | 1 | 2 | 6 | 1 |

4 | 4 | 5 | 5 | 7 | 2 |

5 | 5 | 8 | 9 | 5 | 6 |

6 | 6 | 9 | 10 | 6 | 7 |

For model 1, a one-way analysis of variance is run, with "subjects" as the independent variable. This ANOVA partitions the total variance into two parts—the variation between-subjects and error, as shown in Table 26.3A. The between-subjects effect tells us if the subjects' scores are different from each other, which we expect. The error component represents the variation within a subject across raters. Some of this error will be due to true scores changing from trial to trial, some from rater error, and some will be unexplained. This ANOVA does not differentiate among these sources of error. Calculations for this model are shown in Table 26.3B using data from Table 26.2.

For model 2, the ANOVA is performed as a two-way random effects model, in which both subjects and raters are considered to be randomly chosen from a larger population.^{†} Therefore, the results of the study can be generalized to other raters and other subjects. For model 3, a two-way mixed model is run, with rater as a fixed effect (not randomly chosen) and subjects as a random effect. The numerical results of the analysis will actually be the same for both random and mixed types of ANOVA. The only difference will lie in the interpretation of the data. The results of a repeated measures analysis of variance are shown in Table 26.4.

The repeated measures ANOVA partitions the variance into effects due to differences between subjects, differences between raters and error variance. The *F*-ratio associated with the rater effect reflects the difference among raters, or the extent of agreement or disagreement among them. This effect is significant when the variance due to raters is large, indicating that the raters' scores are different from each other and not reliable. In this example, the rater effect is not significant *(p* = .130). Table 26.4 shows the calculation of both forms for models 2 and 3, using data from Table 26.2.

^{†}Recall that in a repeated measures ANOVA, "subjects" is considered one of the variables, so that even with only one independent variable (in this case rater), the analysis is designated as "two-way."

SPSS,^{‡} a commonly used software package, will generate the various forms of the ICC as part of its Reliability Analysis (under SCALE).^{9,10} SAS,^{§} another commonly used program, does not provide direct calculations, but a programming macro has been developed.^{11} Online calculators can also be found to provide ICC values based on raw data.^{12,13} Calculations by hand are straightforward once the analysis of variance is performed.

Table 26.3B shows the SPSS output^{∗∗} for model 1, and Table 26.5 shows the output for models 2 and 3. Each model is generated in two forms, for single measures and average measures. Confidence intervals are also provided. The researcher must decide which value to use, based on the design of the study.

^{‡}Statistical Package for the Social Sciences, SPSS Inc., 233 S, Wacker Drive, Chicago, Illinois 60606.

^{§}SAS Institute Inc., 100 SAS Campus Drive, Cary, NC 27513.

^{∗∗}To generate ICC values in SPSS, go to SCALE > RELIABILITY ANALYSIS. Include all levels of the independent variable (raters or ratings) in the "Items" box. Click on "Statistics," and choose "Intraclass Correlation Coefficient." Choose a model from the dropdown menu (One-Way Analysis of Variance for Model 1, Two-Way Random for Model 2, or Two-Way Mixed for Model 3). Choose a Type from the dropdown menu, Absolute Agreement for Model 2 or Consistency for Model 3. An analysis of variance can be generated by checking "F test" under ANOVA. These instructions are based on versions 10.0 to 14.2 of SPSS.

Like other forms of reliability, there are no standard values for acceptable reliability using the ICC. The ICC ranges between 0.00 and 1.00, with values closer to 1.00 representing stronger reliability. But because reliability is a characteristic of measurement obtained to varying degrees (although rarely to perfection), the researcher must determine "how much" reliability is needed to justify the use of a particular tool. The nature of the measured variable will be a factor, in terms of its stability and the precision required to make sound clinical judgments about it. As a general guideline, we suggest that values above .75 are indicative of good reliability, and those below .75 poor to moderate reliability. For many clinical measurements, reliability should exceed .90 to ensure reasonable validity. **These are only guidelines, however, and should not be used as absolute standards. Researchers and clinicians must defend their judgments within the context of the specific scores being assessed and the degree of acceptable precision in the measurement**.

When the ICC is high, it is easy to say that reliability is good, and to express confidence in the obtained measurements. When reliability is less than satisfactory, however, the researcher is obliged to sort through alternative explanations to determine the contributing sources of error. There are two major reasons for finding low ICC values.

The first explanation is fairly obvious: The raters (or ratings) do not agree. This is not a straightforward interpretation, however, when more than two raters are analyzed. Because the ICC is an average based on variance across all raters, nonagreement may involve all raters, some raters, or only one rater. The ICC can be considered an *average correlation* across raters and, therefore, does not represent the reliability of any individual rater. For instance, a critical look back at the data in Table 26.2 reveals that rater 3 seems to be the most out of line with the other raters. In fact, if we obtain the product-moment correlations for all possible pairs of ratings, we find that raters 1, 2 and 4 demonstrate correlations between .96 and .98, whereas the correlations of rater 3 with the other three raters are all negative and small, between −.06 and −.19 (Figure 26.1). The ICC is brought down by the "unreliable" responses of rater 3.

It is often useful, therefore, to examine the data, to determine if there is an interaction between raters and subjects; that is, are the scores dependent on what "level" of rater is doing the measuring? This type of interaction is reflected in the error variance of the repeated measures ANOVA.

When raters are reliable, there should be no interaction between raters and subjects; that is, the error variance should be small. It may be helpful to graph the results, as shown in Figure 26.1. The ratings obtained by raters 1, 2 and 4 are close and fairly parallel. The scores obtained by rater 3 are clearly incongruent. By examining both the intercorrelations and graphic evidence, we can determine that there is an interaction between rater and subject. It would be important, then, to review the circumstances of the third rater's tests, to determine why that person's ratings were not consistent with the others.

A second reason for a low ICC is one that has been discussed before in relation to the reliability coefficient; that is, the variability among subjects' scores must be large to demonstrate reliability. A lack of variability can occur when samples are homogeneous, when raters are all very lenient or strict in their scoring, or when the rating system falls within a restricted range. This effect can be checked by looking for significance of the between-subjects variance in the analysis of variance (Table 26.5➊). If subjects' scores are homogeneous, this source of variance will not be significant. It has been shown that when the between-subjects variance is not significant, the actual limits of the ICC do not match the theoretical limits of 0.00 and 1.00.^{14} In fact, it is possible for ratios to range from negative to positive infinity. When a negative ICC is obtained, the value cannot be considered valid. Therefore, it is imperative that researchers be aware of the extent to which scores will naturally vary, and try to obtain heterogeneous samples whenever possible.

Although we have presented multiple values of the ICC for our example, it should be clear that only one type will be appropriate for any one study. The selection of one version should be made before data are collected, based on appropriate design considerations. In most instances, model 2 or 3 will be the appropriate choice. In some research situations, the investigator is interested in establishing the intrarater or interrater reliability of a group of clinicians for one specific data collection experience, fitting model 3. In that situation, it is of no interest if anyone else can perform the measurements with equal reliability. If, however, it is important to demonstrate that a particular measuring tool can be used with confidence by all equally trained clinicians, then model 2 should be used. This approach is appropriate for clinical studies and methodological research, to document that a measuring tool has broad application.

Model 1 is applicable in only limited circumstances. For example, Maher et al^{15} performed a study to determine the interrater reliability of 25 raters who assessed the quality of published randomized controlled trials (RCTs) using the PEDro scale (see Chapter 16). The study involved a total of 120 articles, but each of the 25 raters rated from 1 to 56 RCTs. This fits the design for model 1, where subjects (in these example studies) are not all assessed by the same raters. When all raters assess all subjects, model 1 is not appropriate. Some authors have expressed a preference for using model 1 because it provides a more conservative estimate of reliability than the other models;^{16} however, the conservative or liberal nature of a statistic is not an adequate rationale for its use if the model is unsuitable for the design.^{7,17}

Generally, for the same set of data, model 1 will yield smaller values than model 2, and model 2 will yield smaller values than model 3. Likewise, within each model, the ICC based on single ratings will yield a lower correlation than one based on mean ratings (see Tables 26.3 and 26.5). Because of these potential differences, the type of ICC used in a particular study should always be reported.^{17}

When the unit of measurement is on a categorical scale, reliability is appropriately assessed as a measure of agreement. The simplest index of agreement is **percent agreement**. This is a measure of how often raters agree on scores given to individual subjects (or how often test-retest scores agree). The *coefficient of agreement* represents the total proportion of observations (*P*_{o}) on which there is agreement, or

where Σ*f*_{O} is the sum of the *frequencies of observed agreements*, and *N* is the number of pairs of scores that were obtained.

For example, suppose two clinicians wanted to establish their interrater reliability for evaluating level of function for self-care on a 3-point scale. They evaluate 100 patients to determine if they are independent (IND), need some assistance (ASST) or are dependent (DEP). We can summarize these data to show agreements by arranging them in an *agreement matrix*, or frequency table, as shown in Table 26.6A. The quantities along the diagonal represent the number of times both raters agreed on their ratings (*f*_{O}). (Ignore values in parentheses for now.) All values off the diagonal represent disagreements. For instance, both raters agreed on ratings of IND for 25 subjects, they agreed on ratings of ASST for 24 subjects, and they agreed on ratings of DEP for 17 subjects. They did not agree on 34 subjects. Of 100 possible agreements, 66 were achieved. Therefore, *P*_{O} = 66/100 = .66. The two clinicians agreed on their ratings 66% of the time. This value is fair, relative to potential perfect agreement of 100%.

There is a limitation to this interpretation, however. To determine the true reliability of categorical assignment, we must consider the possibility that some portion of the results could have occurred by chance; that is, if two raters were to assign subjects to categories completely at random, some degree of agreement would still be expected. Because of this tendency, percent agreement will often be an overestimate of true reliability. Therefore, a measure is needed that will discount the proportion of agreement that is potentially due to chance alone.

The **kappa** statistic, κ, is a *chance-corrected* measure of agreement. In addition to looking at the proportion of observed agreements (*P*_{O}), kappa also considers the proportion of agreements expected by chance (*P*_{C}):

where Σ*f*_{C} is the sum of the frequencies of agreement expected by chance.

We can illustrate this application using the frequency data for functional assessment shown in Table 26.6. The number of expected chance agreements for each cell along the diagonal is calculated by multiplying the corresponding row and column margin totals, and dividing by the total number of possible agreements, or *N*.^{††} These values are shown in parentheses. For example, for agreements on IND, the row total is 37, and the column total is 42. With *N* = 100, we determine that chance agreements on IND can be expected (37 × 42)/100 = 15.54 times. Similarly, we expect both raters to come up with ratings of ASST (34 × 30)/100 = 10.20 times by chance. The expected frequency for DEP is (29 × 28)/100 = 8.12 times. Therefore, the total number of expected chance frequencies, Σ*f*c, is 33.86. The proportion of agreement expected by chance for the entire sample is 33.86/100 = 0.34. This tells us that even if these two raters had no common grading criteria, we could expect agreement between them 34% of the time.

Thus, the proportion of observations that can be attributed to reliable measurement is defined by *P*_{O} — *P*_{C}, the proportion of observed agreements less the contribution of chance. The maximum possible nonchance agreements would be 1 – *P*_{C}, or 100% less the contribution of chance. Kappa represents percent agreement based on these correction factors,

which is a ratio of the proportion of observed nonchance agreements to the proportion of possible nonchance agreements.

For the functional assessment data, we know the proportion of observed agreement, *P*_{O} = 66%. When we calculate chance observations we note that *P*_{C} = 34%. Therefore, to account for the fact that 34% of the agreement could have occurred by chance, we correct our original estimate using the formula for kappa:

This indicates a lower level of agreement than the 66% obtained using percent agreement. With the effects of chance eliminated, agreement is rated at 49%. This corrected percentage is a more meaningful interpretation of reliability estimates for categorical assignments.

As shown in Table 26.6B, kappa can also be expressed in terms of frequencies to facilitate computation:

For all practical purposes, the lower and upper limits of kappa are 0.00 and +1.00.^{18} Kappa will be zero if *P*_{O} = *P*_{C} where agreement equals chance, and positive if *P*_{O} > *P*_{C}, where agreement is better than chance. With perfect agreement, all cells off the diagonal will equal zero; therefore, *P*_{O} = 1.00 and *κ* = 1.00. Kappa can be negative if agreement is worse than chance (*P*_{O} < *P*_{C}), although this is not a likely outcome in clinical reliability studies.

^{††}This procedure is identical to calculation of expected frequencies for the *χ*^{2} test. See Chapter 25 for a fuller discussion of this procedure.

For some applications, kappa is limited in that it does not differentiate among disagreements. Because it is calculated using only the frequencies along the agreement diagonal, kappa assumes that all disagreements (off the diagonal) are of equal seriousness. There may be instances, however, when a researcher wants to assign greater weight to some disagreements than others, to account for differential risks. For example, Jarvik et al^{19} looked at the reliability of classifying disk herniations in patients with lumbar disk disease. They hypothesized that some misclassifications would be more serious than others, if misjudgments were made for those with protruded or extruded disks. In another clinical study, Cooperman et al^{20} examined the reliability of a test for ligamentous stability at the knee, graded 0, +1, +2 or +3, with higher grades indicating less stability. They considered a disagreement between ratings of 0 and +3 to be more serious than a disagreement between 0 and +1 for diagnostic purposes and subsequent treatment decisions. When disagreements can be differentiated in this way, a modified version of the kappa statistic, called **weighted kappa**, κ_{w}, can be used to estimate reliability.^{21}

Weighted kappa allows the researcher to specify differential weights for disagreement cells in the agreement matrix. Kappa is actually a special case of weighted kappa, where all cells along the agreement diagonal are given weights of 1.0 and all disagreements are weighted 0. By assigning different weights to the off-diagonal cells, weighted kappa essentially gives more credit for some disagreements than others.

We can illustrate this procedure once again using the functional assessment data. These data showed 66% observed agreement. In terms of clinical implications, however, we might suggest that disagreements among these grades are not all of the same importance and that weighting them would provide a more practical estimate of reliability.

Cohen^{21} suggests that the assignment of weights is essentially a judgmental process. Therefore, there is no one set of weights that can be applied universally, and the value of κ_{W} will be sensitive to the choice of weights.^{22} Weights should conform to a hypothesis that defines the relative seriousness of the disagreements.

**Incremental Weights.** One approach is to look at a scale as an ordinal continuum with equal intervals, that is, an *incremental scale*.^{3} For example, using the functional evaluation scale described in Table 26.6, we might hypothesize that a disagreement between IND and DEP is twice as serious as a disagreement between ASST and DEP, with IND = 3, ASST = 2 and DEP = 1. If this hypothesis is reasonable, then weights for incremental disagreements can be determined using the formula

where *w* is the assigned weight, and *r*_{1} and *r*_{2} are the scores assigned by rater 1 and rater 2 to that cell. Therefore, *r*_{1} – *r*_{2} represents the *deviation from agreement* for each cell in the agreement matrix. This type of weighting system is shown in Table 26.7A. For instance, a disagreement between IND (3) and ASST (2) would receive a weight of (3 − 2)^{2} = 1. The same weight would be assigned to disagreements between ASST and DEP; however, a disagreement between IND (3) and DEP (1) would receive a weight of (3 – l)^{2} = 4. Weights of zero would automatically be assigned to all the agreement cells on the diagonal, indicating no disagreement.

**Asymmetrical Weights.** In many situations, the evaluation of disagreements does not fit a uniform pattern. For instance, we might hypothesize that a disagreement between IND and DEP is more severe than a disagreement between ASST and DEP. We might also suggest that the *direction* of the disagreement is important; that is, assigning a grade of IND to a patient who needs assistance is a more serious error than assigning DEP to an independent patient. If a patient who is dependent is graded IND, he might be unsafe, left alone without adequate supervision. This is more serious as an evaluation error than unnecessarily supervising a strong patient. Suppose we test the validity of a clinician's assessment of function by comparing her ratings (rater 1) with those of an "expert" (rater 2), who acts as the criterion or reference standard. We would want to assign the highest weight to an error where rater 1 says IND and rater 2 says DEP. The next highest weight might go to the converse DEP-IND disagreement. Errors between IND and ASST might be the next most serious, followed by ASST-IND. Errors between ASST and DEP might be perceived as relatively unimportant, as with either rating the patient will receive some supervision. This creates an *asymmetrical pattern* of weights, varying with the direction of disagreement.

For this type of subjective judgment, Cohen^{21} suggests first choosing a weight to represent maximum disagreement and then setting the other weights accordingly. For example, as shown in Table 26.7B, we might choose weights of 6 for IND-DEP, 4 for DEP-IND, 3 for IND-ASST and 2 for ASST-IND disagreements. DEP-ASST errors would be considered least important, with a weight of 1. For convenience, weights of zero are still assigned to all agreements.

**Symmetrical Weights.** A third pattern of weights can be established when the direction of disagreement is unimportant. For instance, we might argue that any disagreement between IND and DEP is twice as serious as a disagreement between IND and ASST, and that a disagreement between ASST and DEP is only minimally important. To designate a set of weights that reflect this hypothesis, we might choose a weight of 6 to represent any disagreement between IND and DEP, a weight of 3 to represent a disagreement between IND and ASST, and a weight of 1 to represent the less important disagreement between ASST and DEP. These *symmetrical weights (w)* are shown in the center of each cell in Table 26.8A.

The weights that are assigned to each cell in the agreement matrix are used in the calculation of weighted kappa. An obvious criticism of this procedure is based on the fact that the arbitrary assignment of weights can make the consequent value of κ_{W} arbitrary as well.^{23} This points out the need for the researcher to operate on the basis of a hypothesis that defines the relationship among the rating categories. For instance, each of the preceding weighting systems was based on a different theoretical rationale. The rationale used to define these weights then becomes an integral part of the hypothesis being tested.^{2} For this reason, the weights used in calculating κ_{W} and the rationale for choosing them should be stated in a research report.

We will demonstrate the calculation of κ_{w} using the functional assessment data with symmetrical weights shown in Table 26.8A. Each cell in the table contains the observed frequency (*f*_{O}), the expected chance frequency (*f*_{C}, shown in parentheses), and the cell weight (*w*). The first step is to find the *weighted frequencies* of observed disagreement (*wf*_{O}) and chance disagreement (*wf*_{C}) for each cell in the matrix by multiplying the observed and chance frequencies by the cell weight. For example, for the first cell in the matrix, *wf*_{O} = 0(25) and *wf*_{C} = 0(15.54). Note that we are concerned with the *frequencies of disagreements*, not agreements as we were with kappa. Because the cells along the agreement diagonal all have weights of zero, they are effectively eliminated from the calculations.

Next we determine the sum of these terms to find the total weighted observed frequencies, Σ*wf*_{O}, and the total weighted chance frequencies, Σ*wf*_{C}. Weighted kappa is given by

As shown in Table 26.8B, κ_{w} = .36. This value is somewhat lower than the value obtained for kappa (κ = .41).

Let us consider the implications of weighting this data. According to the frequency data, we find exact agreement in 66 of 100 tests. Kappa reduces this estimate to 49% by correcting for chance, but does not account for any differentiation in the seriousness of the 34 disagreements. Five of these disagreements were between ASST and DEP, which we consider minimally important. Of the 29 more serious disagreements, 18 were between IND and DEP (the most serious) and 11 were between IND and ASST. These serious disagreements account for more than one-quarter of the tests and 85% of all the disagreements. By accounting for these serious discrepancies, weighted kappa brings down the level of agreement further to 36%. This gives us a more meaningful estimate of the degree of reliability between these raters than kappa alone, and suggests that these raters demonstrate serious discrepancies too often.

Landis and Koch^{24} have suggested that values of kappa above 80% represent excellent agreement; above 60% substantial levels of agreement; from 40% to 60% moderate agreement; and below 40% poor to fair agreement. For this example, then, we have achieved only a moderate degree of reliability. **The interpretation of this outcome, like any other reliability coefficient, must depend on how the data will be used and the degree of precision required for making rational clinical decisions**.

Several factors must be considered in the application of kappa or weighted kappa. First, it is important to recognize that kappa represents an *average rate of agreement* for an entire set of scores. It will not indicate if most of the disagreement is accounted for by one specific category or rater. Therefore, in an effort to improve reliability, it is useful to subjectively examine the data when discussing results, to see where the major discrepancies lie.

A second issue, which we continue to stress for all reliability indices, is that of variance among subjects. In measures of agreement, variance is necessary to allow reasonable interpretation of reliability. In a group of subjects with homogeneous characteristics, the percentage of agreements will be necessarily high. Therefore, the reliability analysis is not really showing whether the measurement is capable of differentiating among subjects on that characteristic.

Because kappa is based on proportions, the use of very small samples can provide misleading results. For example, if two raters agree on two observations, the reliability estimate will be 100%. If they disagree on one of those observations, the rating drops to 50%. Such a variation does not accurately reflect reliability when compared with estimates of the same behavior tested many more times.

Kappa is also influenced by the number of categories used. As the number of categories increases, the extent of agreement will generally decrease. This is logical, as with more possibilities of assignment, there is room for greater discrepancy between raters. Therefore, if values of kappa are to be compared, the samples used should contain the same number of categories.

Probably the strongest limitation of kappa is that it is an analysis of exact agreement; that is, it treats agreement as an all-or-none phenomenon with no room for "close" agreement. Therefore, it is appropriate for use with nominal or ordinal data, which require that each subject be placed in an exclusive category. By definition, there can be no doubt as to whether raters achieved the same "score" for each subject. Kappa is less useful for dealing with continuous data on the interval or ratio scales, as there is no credit given for scores that remain close over several trials.

Kappa can be used with more than two raters,^{1,25} although the overall rating is less informative than if separate kappas are computed for pairs of raters.^{23} One advantage of using separate analyses is that it is then possible to use different rationales for setting weights for each comparison. A calculation has been derived for using kappa with multiple ratings per subject.^{26} It is also possible to use the intraclass correlation coefficient, ICC, as an equivalent of weighted kappa when incremental weights are scaled according to squared disagreements (*w* = (*r*_{1} – *r*_{2})^{2}).^{3}

Measuring instruments are often designed as scales, composed of many items that in total should reflect the characteristic being measured. For instance, the quantitative portion of the Graduate Record Examination (QGRE) includes many items to test a student's mathematical ability. Functional scales are designed to include items related to different functional tasks. In both of these examples, the scales are actually only a sample of the possible items that could be included, although we want to draw a conclusion about an individual's performance based on the total score. If these scales are reliable, we would expect the subject to receive the same score even if we varied the items.

One assumption that is inherent in the use of such scales is the **homogeneity** of the items or their **internal consistency**. A good scale is one that assesses different aspects of the same attribute; that is, the items are homogeneous.^{27} Therefore, the QGRE will not include items to assess verbal ability. A scale of physical function will reflect physical performance but not emotional function. Statistically, if the items on the scale are truly measuring the same attribute, they should be moderately correlated with each other and with the total score.^{‡‡} These correlations are measures of internal consistency (see Chapter 5 for further discussion of item-total correlations).

The most commonly applied statistical index for internal consistency is **Cronbach's alpha (α)**.^{28} It can be used for scales with items that are dichotomous (yes/no) or when there are more than two response choices (such as an ordinal scale). To illustrate the application of Cronbach's *α*, we will use hypothetical data from a sample of 14 patients in a rehabilitation hospital who have been assessed for function using six items: walking, climbing stairs, carrying 5 pounds, reaching for a phone, dressing (putting on a shirt), and getting in and out of a car. Each item is scored on an ordinal scale from 1 to 5, with 5 reflecting complete independence. The maximum total score, then, is 30.

Internal consistency is a reflection of the correlation among these six items and the correlation of each individual item with the total score. Cronbach's *α* for these data is .894, as shown in Table 26.9A. As with other correlation statistics, this index ranges from 0.00 to 1.00. Therefore, a value that approaches .90 is high, and the scale can be considered reliable.

Alpha can also be used to examine individual items to determine how well they fit the overall scale. In Table 26.9A, the means and standard deviations for each item and the total score are displayed. We can see that walking had the highest mean functional score and car transfer the lowest. In Table 26.9B we find the inter-item correlations for all six items. All item-pairs have correlations above .60 except for car transfer, which has consistently low correlations with all other items (.354 and lower). Perhaps this one variable should not be part of the scale, representing a different component of function than the other items.

To investigate this possibility, the advantage of *α* is that it can be computed repeatedly, each time eliminating one item from the analysis. In Table 26.9C, we see what happens to the total score when each item is deleted. In the first two columns, the mean and variance of the total score is higher when car transfer is deleted, whereas these values remain fairly stable for all other items. The third column in this panel shows the correlation of each item with the sum of the remaining items, or the **item-to-total correlation**. Only car transfer has a low correlation of .19, suggesting that this variable is not related to the other items. Each of the other five items has a correlation of approximately .80 or higher with the total. Finally, we find that alpha increases to .932 when car transfer is not included, indicating that the scale is more homogeneous when this item is omitted. These statistics suggest that car transfer should be removed from the scale, as it appears to reflect a different dimension of function than the other items.

Interestingly, several sources suggest that a scale with strong internal consistency should only show a moderate correlation among the items, between .70 and .90.^{27,29,30,31} If items have too low a correlation, they are possibly measuring different traits. If the items have too high a correlation, they are probably redundant, and the content validity of the scale might be limited.

^{‡‡}The concept of internal consistency should not be confused with content or construct validity. Internal consistency is a measure of reliability, not validity. Even if the items in a scale are correlated (a reliability issue), the scale may not be measuring what it is intended to measure (a validity issue). Internal consistency is an important characteristic of a valid scale, however.

In addition to measuring the reliability of instruments and raters, clinical scientists are often interested in assessing the consistency or stability of repeated responses over time. **Response stability** is basic to establishing all other types of reliability, because if the response variable varies from measurement to measurement, it will not be possible to separate out errors due to the rater or instrument. Three statistical methods are commonly used to express response stability: standard error of measurement, coefficient of variation and method error.

Like other forms of reliability, the concept of response stability is related to measurement error. If we were to administer a test under constant conditions to one individual an infinite number of times, we can assume that the responses would vary somewhat from trial to trial. These differences would be a function of random measurement error. Theoretically, if we could plot these responses, the distribution would resemble a normal curve, with the mean equal to the true score and errors falling above and below the mean. This distribution of measurement errors is a theoretical distribution that represents the population of all possible measurement errors that could occur for that variable. With a more reliable measurement, errors will be smaller and this distribution will be less variable. Therefore, the standard deviation of the measurement errors reflects the reliability of the response. This value is called the **standard error of measurement (SEM)**.

We can use our knowledge of the normal curve to estimate the variability within repeated measurements in one individual. For example, suppose we record a series of 25 measurements of grip strength for one subject using a hand dynamometer. Let us assume that fatigue does not occur, so that the true value does not change. We can expect that this individual will produce varied scores from trial to trial because of slightly different efforts or repositioning his hand on the dynamometer—random errors of measurement. Suppose the mean score for all trials is 23 pounds with a standard deviation of 6. We can estimate, then, that there is approximately a 68% chance that this individual's true score falls within ±1 standard deviations (between 17 and 29 pounds) or a 95% chance that it falls within ±2 standard deviations (between 11 and 35 pounds). In a subsequent test, if this subject's response is 28 pounds, we would consider the score to be within the range of measurement error, that is no true difference.

When the estimate of measurement error is based on repeated measurements from a single individual, as in this example, its value will obviously be different for each subject. Therefore, the amount of error, or reliability, associated with a particular measurement will not be a constant estimate. Most often, however, it is not feasible to collect a large enough sample of repeated measurements on every subject. Therefore, we have to estimate the SEM for a set of scores obtained from a larger sample of subjects as follows:

where *s _{X}* is the standard deviation of the set of observed test scores on a group of subjects, and

*r*is the reliability coefficient for that measurement (typically obtained from previous research). For example, suppose we administer grip strength tests to a sample of 300 patients, each measured once. Assume the standard deviation of these scores is 12, and the reliability coefficient for this measurement, established by previous test-retest studies, is known to be .85. Therefore,

_{XX}This value can now be used as an estimate for the entire group, based on a confidence interval.

Therefore, we can estimate that 95% of the time, the errors of measurement using this test will fall within this range. If the group mean is 30 pounds, then there is a 95% chance that the group's true mean score lies between 20.89 and 39.11. This will also provide a benchmark for evaluating individual patient performance over time.

The interpretation of standard error of measurement is dependent on the type of reliability coefficient that is used in its computation. If the estimate is based on test-retest reliability, then the SEM is indicative of the range of scores than can be expected on retesting. If the ICC is used as an indicator of rater reliability, the SEM reflects the extent of expected error in different raters' scores. The choice of reliability coefficient for calculating the SEM must be based on the ultimate purpose of predicting reliability.

In Chapter 5 we introduced the concept of **minimal detectable difference (MDD)**, which is used to define the amount of change in a variable that must be achieved to reflect a true difference. Statistics of response stability will provide estimates of this threshold. The SEM is used most often to determine if a patient's performance has truly changed from trial to trial. Values below this threshold will be considered measurement error. The more reliable an instrument, the more precise this smallest measure can be. The MDD will be discussed again in the next chapter, when we consider validity of measuring change.

We can also assess response stability across repeated trials by looking at the standard deviation of the responses (for one individual or a group). Variability within the responses should reflect the degree of measurement error. The standard deviation will obviously increase as the repeated scores become more disparate.

The limitation to this approach is that the standard deviation must be interpreted in relation to the size and units of the mean. For example, suppose a distribution of strength scores (in pounds) has a standard deviation of 40 lbs. If the mean of the distribution is 110, reliability will be viewed differently than if the mean is 55. In the first instance the scores are actually less variable relative to the mean. Therefore, on the basis of standard deviation alone, we cannot accurately assess the extent of error in the measurements.

To account for the relationship between the mean and standard deviation, the variability across distributions can be compared using the **coefficient of variation, CV:**

This ratio expresses the standard deviation^{§§} as a *proportion of the mean*. Because both the mean and standard deviation are in the same units, this statistic will be unit free, allowing comparisons across different quantities or different studies. See Chapter 17 for a more complete discussion and sample calculations of the coefficient of variation.

^{§§}Because scores used for reliability testing are generally not intended as estimates of population parameters, the standard deviation can be calculated using *N* in the denominator, rather than *N* − l.^{32}

Response stability, or test-retest reliability, can also be expressed in terms of the percentage variation from trial to trial, by analyzing **method error, ME.** Method error is a measure of the discrepancy between two sets of repeated scores, or their difference scores. Larger difference scores reflect greater measurement error.

Method error is calculated using the standard deviation of the difference scores (*s _{d}*) between test and retest:

This value reflects the amount of variation in the difference scores; however, just like any other standard deviation, it must be interpreted relative to the size of the mean differences. Therefore, it is converted to a percentage using the coefficient of variation:

Calculation of ME and its associated coefficient of variation is illustrated in Table 26.9B for hypothetical range of motion measurements. The variation in measurement from test 1 to test 2 was 6%. The interpretation of this value will depend on the amount of error deemed acceptable by those who must use the information.

Method error is often used as an adjunct to test-retest correlation statistics, as it reflects the percentage of variation from trial to trial, which the correlation coefficient does not. In addition, unlike the correlation coefficient, method error is not affected by a lack of variation in raw scores. For instance, for the data in Table 26.10, *r* = .58. This is low, especially considering how close the two pairs of scores are. But we can also see that there is very little variability within these scores, which we know will tend to decrease the correlation coefficient or any reliability coefficient. Method error will not be affected by a restriction in range, because it looks only at the difference scores. Therefore, in situations like this example, where reliability coefficients may be misleading, method error provides a useful alternative.

Because method error is based on the variability within difference scores, it will not account for systematic variation between test 1 and test 2. Therefore, the researcher may want to check for systematic bias by performing a paired *t*-test between the test and retest scores.^{33} The *t*-ratio can be obtained directly by dividing the mean of the difference scores, *d̄*, by the standard error of the difference scores, *s _{d̄}*. This computation is illustrated in Table 26.10B. With

*n*– 1 degrees of freedom, this value demonstrates no significant difference between test 1 and test 2.

Reliability is an essential property when measurements are taken with alternate forms of an instrument. For example, clinical researchers have looked at outcomes of measuring joint range of motion with different types of goniometers, inclinometers, electrogoniometers and radiographs. Even though each instrument is different, they are all intended to result in an accurate recording of joint angles in degrees. We might want to compare different designs of dynamometers for measuring strength, different types of spirometers for assessing pulmonary function or different types of thermometers for measuring temperature. In each of these examples, we would expect these methods to record similar values. The analysis of reliability in this situation focuses on the agreement between alternative methods. We can consider two methods in agreement when the difference between measurements on one subject is small enough for the methods to be considered interchangeable.^{34} This property is an important practical concern as we strive for effective and efficient clinical measurement,^{35} as well as a concern for generalization of research findings.

Two analysis procedures have traditionally been applied for method comparisons. The correlation coefficient, *r*, has been used to demonstrate covariance among methods; however, we know this is a poor estimate of reliability, as it does not necessarily reflect the extent of agreement in the data. The second procedure is the paired *t*-test, (or repeated measures ANOVA) which is used to show that mean scores for two (or more) methods are not significantly different. This approach is also problematic, however, as two distributions may show no statistical difference, but still be composed of pairs with no agreement.

An interesting alternative for examining agreement across methods is an index called **limits of agreement**.^{34,35} To understand this approach, consider the hypothetical distribution of 10 measurements of range of motion of straight leg raising shown in Table 26.11 for two instruments, a regular goniometer and an inclinometer. The difference between each method for each subject is calculated by subtracting the inclinometer score from the goniometer score (this direction is consistent but arbitrary). Therefore, positive difference scores reflect a higher reading for the goniometer. The mean of the difference scores is −0.1 degrees. On average, then, the differences between the methods is quite small, and certainly within acceptable clinical error range. We would be happy to find that the two instruments differed by less than one degree. On further examination, however, we can see that the amount of error varied across subjects, from zero to as much as 10 degrees. Therefore, we would be more complete in our estimate of reliability to determine the range of error that would be expected for any individual subject.

A visual analysis can help to clarify this relationship. For example, the scatterplot of these scores is shown in Figure 26.2 (*r* = .86). If we draw a *line of identity* from the origin, representing agreement of scores, we can see that most of the scores are close, but not in perfect agreement. A further understanding of this relationship can be achieved by looking at the difference between methods plotted against the mean score for each subject, as shown in Figure 26.3. These plots are often called **Bland-Altman plots**, recognizing those who developed this strategy.^{35} The spread of scores around the zero point helps us decide if the observed error is acceptable if we substitute one measurement method for the other. In Figure 26.3A, for example, the error appears unbiased, as differences are spread evenly and randomly above and below the zero point. Other possible patterns are shown in Figures 26.3B, which shows a pattern with no error, where all differences are zero. In Figure 26.3C we see a biased pattern, where the goniometer has consistently resulted in higher scores, resulting in positive difference scores. Figure 26.3D shows another biased pattern where error is influenced by the size of the measurement; that is, smaller angles are measured higher by the inclinometer (resulting in a negative difference score) and larger angles are measured higher by the goniometer (resulting in a positive difference score). With a biased pattern, the instruments could not be considered interchangeable.

###### FIGURE 26.2

Example of a method comparison plot, showing the relationship between two different methods for measuring range of motion of straight leg raise. The line of identity emerges from the origin, showing how closely the two methods agree. In this example, only one score falls directly on this line.

###### FIGURE 26.3

Plots of difference scores for straight leg raise measurements across mean scores for each subject. The center line represents zero difference. (**A**) Data from Table 26.11; (**B**) perfect agreement between two methods; (**C**) pattern with systematic bias in measurement error, in this case with the goniometer consistently producing higher scores than the inclinometer; (**D**) plot showing bias related to magnitude of the subjects' scores.

We can examine the agreement between the two methods by looking at the spread of the difference scores. A larger variability would indicate larger errors. Statistically, this spread is reflected by the standard deviation. Assuming the errors are normally distributed,^{∗∗∗} we would expect that approximately 95% of the difference scores would fall within two standard deviations above and below the mean of the difference scores.^{34} This range is considered the **95% limits of agreement**. As shown in Table 26.11B, for the straight leg raise data the mean difference score is –0.1 degrees with a standard deviation of 5.09 degrees. Two standard deviations equal 10.18 degrees. Therefore, the difference between these two methods of measurement of straight leg raise can be expected to vary between –0.1 ± 10.18, or between –10.28 degrees and 10.08 degrees, a range of approximately 20 degrees (see Figure 26.4).

###### FIGURE 26.4

Difference scores between goniometer and inclinometer, plotted against mean scores for each subject (data from Table 26.11). Dashed line shows the mean difference score (−0.1 degrees). The 95% upper and lower limits of agreement represent 2 standard deviations above and below the mean difference score (−0.1 ± 10.18).

Our question, then, is would we be comfortable using either instrument, if we knew that their difference could be as much as 10 degrees higher or lower? This decision should be based on a clinical criterion and the application of the measurements. We might argue that a potential difference of 20 degrees does not suggest interchangeable methods. We are, of course, assuming that each method is reliable. These considerations have important implications for clinical analyses as well as comparison of research studies.

Because reliability issues are so important to the validity of clinical science, the statistical bases for interpreting reliability must be understood by those who do the research and those who read research reports. What we learn from looking through professional literature is that preferred methods for analyzing reliability seem to vary with different researchers and within different disciplines. Even though statisticians have been addressing reliability issues for a long time, there is no consensus on how reliability data are handled.

Choosing a particular approach to reliability testing should be based on an understanding of the nature of the response variable, what types of interpretations are desired, and what measurement issues are of greatest concern. Consideration should be given to the scale of measurement, the amount of variability that can be expected within sample scores, and what units of measurement are used. We should be aware of the intended application of the data and the degree of precision needed to make safe and meaningful clinical decisions. These details are often overlooked, and we allow ourselves to fall into the trap of using specific standards for reliability just because they have been used by others—or stated in a textbook like this one! Guidelines are just that—not gold standards. Researchers and clinicians are obligated to justify their interpretation of acceptable reliability.

Researchers should address each of these relevant issues in their reports, so that others can interpret their work properly. Many articles are published with no such discussion, leaving the reader to guess why a particular statistic was used or standard applied. Because reliability statistics can be applied in so many ways, it is important to maintain an exchange of ideas that promotes such accountability. By having to justify our choices, we are forced to consider what a statistic can really tell us about a variable and what conclusions are warranted.

It is also important to justify a measure of reliability as a foundation for validity of a test. Reliability by itself is never enough to support the use of a particular measure. Statistics such as sensitivity, specificity and likelihood ratios must be applied to determine the clinical meaningfulness of a test (see Chapter 27). Reliability is necessary but not sufficient to assure validity; however, sometimes a test will have only moderate reliability, but may still provide strong diagnostic information because of the nature of the disorder being evaluated. All measurements will have some degree of error. Our concern with the consequences of this error will depend on how we will use the measurement in our decision making. There are several examples of measurements that have only moderate kappa or ICC values, but have strong sensitivity or likelihood ratios, suggesting that the tests are useful for identifying those who have a certain disorder.^{16} Reliability is one measurement property that must be considered—an important one to be sure, but not the only one.

^{∗∗∗}Because the difference scores represent measurement error, Bland and Altman^{35} suggest that they should follow a normal distribution, even if the actual measurements do not. This distribution can be checked by graphing a histogram of the difference scores.^{34}

*J Nerv Ment Dis*1976;163:307–317. [PubMed: 978187]

*J Counseling Psychol*1975;22:358–376.

*Educ Psychol Meas*1973;33:613–619.

*The Dependability of Behavioral Measurements: Theory of Generalizability for Scores and Profiles.*New York: Wiley, 1972.

*Psychol Bull*1979;36:376–390.

*Am J Ment Defic*1979;83:460–472. [PubMed: 426006]

*Psychol Bull*1979;86:420–428. [PubMed: 18839484]

*Using SPSS for Windows and Macintosh: Analyzing and Understanding Data*(4th ed.). Upper Saddle River, NJ: Prentice Hall, 2004.

*Psychol Bull*1983;93:586–595.

*Phys Ther*2003;83:713–721. [PubMed: 12882612]

*Phys Ther*1989;69:192–194.

*Acad Radiol*1996;3:537–544. [PubMed: 8796714]

*Phys Ther*1990;70:225–233. [PubMed: 2315385]

*Psychol Bull*1968;70:213–220. [PubMed: 19673146]

*J Clin Epidemiol*1993;46:1055–1062. [PubMed: 8263578]

*Am J Epidemiol*1987;126:161–169. [PubMed: 3300279]

*Biometrics*1977;33:159–174. [PubMed: 843571]

*Phys Ther*1989;69:970–974. [PubMed: 2813523]

*Health Measurement Scales: A Practical Guide to Their Development and Use*(3rd ed.). New York: Oxford University Press, 2003.

*Psychometrika*1951;16:297–334.

*Personality Individ Differences*1991;12:291–294.

*Appl Psychol Meas*1985;9:1139–1164.

*An Introduction to Medical Statistics*(2nd ed.). New York: Oxford University Press, 1995.

*Am J Phys Med Rehabil*1993;72:266–271. [PubMed: 8398016]

*Lancet*1986;327:307–310.

*J Orthop Sports Phys Ther*2003;33:488–490. [PubMed: 14524507]