Estimates of reliability vary depending on the type of reliability being analyzed. We discuss four general approaches to reliability testing: test-retest reliability, rater reliability, alternate forms reliability, and internal consistency. For each approach we will identify the most commonly used reliability coefficients. These statistical indices are described in detail in Chapter 26.
One basic premise of reliability is the stability of the measuring instrument; that is, a reliable instrument will obtain the same results with repeated administrations of the test. Test-retest reliability assessment is used to establish that an instrument is capable of measuring a variable with consistency. In a test-retest study, one sample of individuals is subjected to the identical test on two separate occasions, keeping all testing conditions as constant as possible. The coefficient derived from this type of analysis is called a test-retest reliability coefficient. This estimate can be obtained for a variety of testing tools, and is generally indicative of reliability in situations where raters are not involved, such as self-report survey instruments and physical and physiological measures with mechanical or digital readouts. If the test is reliable, the subject's score should be similar on multiple trials. In terms of reliability theory, the extent to which the scores vary is interpreted as measurement error.
Because variation in measurement must be considered within the context of the total measurement system, errors may actually be attributed to many sources. Therefore, to assess the reliability of an instrument, the researcher must be able to assume stability in the response variable. Unfortunately, many variables do change over time. For example, a patient's self-assessment of pain may change between two testing sessions. We must also consider the inconsistency with which many clinical variables naturally respond over time. When responses are labile, test-retest reliability may be impossible to assess.
Because the stability of a response variable is such a significant factor, the time interval between tests must be considered carefully. Intervals should be far enough apart to avoid fatigue, learning, or memory effects, but close enough to avoid genuine changes in the measured variable. The primary criteria for choosing an appropriate interval are the stability of the response variable and the test's intended purpose. For example, if we were interested in the reproducibility of electromyographic measurements, it might be reasonable to test the patient on two occasions within one week. Range of motion measurements can often be repeated within one day or even within a single session. Measures of infant development might need to be taken over a short period, to avoid the natural changes that rapidly occur at early ages. If, however, we are interested in establishing the ability of an IQ test to provide a stable assessment of intelligence over time, it might be more meaningful to test a child using intervals of one year. The researcher must be able to justify the stability of the response variable to interpret test-retest comparisons.
Carryover and Testing Effects
With two or more measures, reliability can be influenced by the effect of the first test on the outcome of the second test. For example, practice or carryover effects can occur with repeated measurements, changing performance on subsequent trials. A test of dexterity may improve because of motor learning. Strength measurements can improve following warm-up trials. Sometimes subjects are given a series of pretest trials to neutralize this effect, and data are collected only after performance has stabilized. A retest score can also be influenced by a subject's effort to improve on the first score. This is especially relevant for variables such as strength, where motivation plays an important role. Researchers may not let subjects know their first score to control for this effect.
It is also possible for the characteristic being measured to be changed by the first test. A strength test might cause pain in the involved joint and alter responses on the second trial. Range of motion testing can stretch soft tissue structures around a joint, increasing the arc of motion on subsequent testing. When the test itself is responsible for observed changes in a measured variable, the change is considered a testing effect. Oftentimes, such effects will be manifested as systematic error, creating consistent changes across all subjects. Such an effect will not necessarily affect reliability coefficients, for reasons we have already discussed.
Reliability Coefficients for Test-Retest Reliability
Test-retest reliability has traditionally been analyzed using the Pearson product-moment coefficient of correlation (for interval-ratio data) or the Spearman rho (for ordinal data). As correlation coefficients, however, they are limited as estimates of reliability. The intraclass correlation coefficient (ICC) has become the preferred index, as it reflects both correlation and agreement. With nominal data, percent agreement can be determined and the kappa statistic applied. In situations where the stability of a response is questioned, the standard error of measurement (SEM) can be applied.
Many clinical measurements require that a human observer, or rater, be part of the measurement system. In some cases, the rater is the actual measuring instrument, such as in a manual muscle test or joint mobility assessment. In other situations, the rater must observe performance and apply operational criteria to subjective observations, as in a gait analysis or functional assessment. Sometimes a test necessitates the physical application of a tool, and the rater becomes part of the instrument, as in the use of a goniometer or taking of blood pressure. Raters may also be required simply to read or interpret the output from another instrument, such as an electromyogram, or force recordings on a dynamometer. However the measurements are taken, the individual performing the ratings must be consistent in the application of criteria for scoring responses.
This aspect of reliability is of major importance to the validity of any research study involving testers, whether one individual does all the testing or several testers are involved. Data cannot be interpreted with confidence unless those who collect, record and reduce the data are reliable. In many studies, raters undergo a period of training, so that techniques are standardized. This is especially important when measuring devices are new or unfamiliar, or when subjective observations are used. Even when raters are experienced, however, rater reliability should be documented as part of the research protocol.
To establish rater reliability the instrument and the response variable are considered stable, so that any differences between scores are attributed to rater error. In many situations, this may be a large assumption, and the researcher must understand the nature of the test variables and the instrumentation to establish that the rater is the true source of observed error.
Intrarater reliability refers to the stability of data recorded by one individual across two or more trials. When carryover or practice effects are not an issue, intrarater reliability is usually assessed using trials that follow each other with short intervals. Reliability is best established with multiple trials (more than two), although the number of trials needed is dependent on the expected variability in the response. In a test-retest situation, when a rater's skill is relevant to the accuracy of the test, intrarater reliability and test-retest reliability are essentially the same estimate. The effects of rater and the test cannot be separated out.
Researchers may assume that intrarater reliability is achieved simply by having one experienced individual perform all measurements; however, the objective nature of scientific inquiry demands that even under expert conditions, rater reliability should be evaluated. Expertise by clinical standards may not always match the level of precision needed for research documentation. By establishing statistical reliability, those who critique research cannot question the measurement accuracy of data, and research conclusions will be strengthened.
Rater Bias. We must also consider the possibility for bias when one rater takes two measurements. Raters can be influenced by their memory of the first score. This is most relevant in cases where human observers use subjective criteria to rate responses, but can operate in any situation where a tester must read a score from an instrument. The most effective way to control for this type of error is to blind the tester in some way, so that the first score remains unknown until after the second trial is completed; however, as most clinical measurements are observational, such a technique is often unreasonable. For instance, we could not blind a clinician to measures of balance, function, muscle testing or gait where the tester is an integral part of the measurement system. The major protections against tester bias are to develop grading criteria that are as objective as possible, to train the testers in the use of the instrument, and to document reliability across raters.
Interrater reliability concerns variation between two or more raters who measure the same group of subjects. Even with detailed operational definitions and equal skill, different raters are not always in agreement about the quality or quantity of the variable being assessed. Intrarater reliability should be established for each individual rater before comparing raters to each other.
Interrater reliability is best assessed when all raters are able to measure a response during a single trial, where they can observe a subject simultaneously and independently. This eliminates true differences in scores as a source of measurement error when comparing raters' scores. Videotapes of patients performing activities have proved useful for allowing multiple raters to observe the exact same performance.3,4,5 Simultaneous scoring is not possible, however, for many variables that require interaction of the tester and subject. For example, range of motion and manual muscle testing could not be tested simultaneously by two clinicians. With these types of measures, rater reliability may be affected if the true response changes from trial to trial. For instance, actual range of motion may change if the joint tissues are stretched from the first trial. Muscle force can decrease if the muscle is fatigued from the first trial.
Researchers will often decide to use one rater in a study, to avoid the necessity of establishing interrater reliability. Although this is useful for attempting consistency within the study, it does not strengthen the generalizability of the research outcomes. If interrater reliability of measurement has not been established, we cannot assume that other raters would have obtained similar results. This, in turn, limits the application of the findings to other people and situations. Interrater reliability allows the researcher to assume that the measurements obtained by one rater are likely to be representative of the subject's true score, and therefore, the results can be interpreted and applied with greater confidence.
Reliability Coefficients for Rater Reliability
The intraclass correlation coefficient (ICC) should be used to evaluate rater reliability. For interrater reliability, ICC model 2 or 3 can be used, depending on whether the raters are representative of other similar raters (model 2) or no generalization is intended (model 3). For intrarater reliability, model 3 should be used (see Chapter 26).
Many measuring instruments exist in two or more versions, called equivalent, parallel or alternate forms. Interchange of these alternate forms can be supported only by establishing their parallel reliability. Alternate forms reliability testing is often used as an alternative to test-retest reliability with paper-and-pencil tests, when the nature of the test is such that subjects are likely to recall their responses to test items. For example, we are all familiar with standardized tests such as the Scholastic Aptitude Test (SAT) and the Graduate Record Examination (GRE), professional licensing exams or intelligence tests, which are given several times a year, each time in a different form. These different versions of the tests are considered reliable alternatives based on their statistical equivalence. This type of reliability is established by administering two alternate forms of a test to the same group, usually in one sitting, and correlating paired observations. Because the tests are ostensibly different, they can be given at relatively the same time without fear of bias from one to the other. Although the idea of alternate forms has been applied mostly to educational and psychological testing, there are many examples in clinical practice. For example, clinicians use parallel forms of gait evaluations, tests of motor development, strength tests, functional evaluations, and range of motion tests. Many of these have not been tested for alternate forms reliability.
The importance of testing alternate forms reliability has been illustrated in studies of hand dynamometers. Several models are available, each with slightly different design features. Because these tools are often used to take serial measurements, patients might appear to be stronger or weaker simply because of error if different instruments were used. Studies comparing various models have shown that some instruments generate significantly different strength scores,6 while others have shown comparable values.7 Establishing this method comparison is necessary if absolute values are to be compared or equated across tests, and to generalize findings from one study to another or from research to practice.
Reliability Coefficients for Alternate Forms Reliability
Correlation coefficients have been used most often to examine alternative forms reliability. The determination of limits of agreement has been proposed as a useful estimate of the range of error expected when using two different versions of an instrument. This estimate is based on the standard deviation of difference scores between the two instruments (see Chapter 26).
Software instruments, such as questionnaires, written examinations and interviews are ideally composed of a set of questions or items designed to measure particular knowledge or attributes. Internal consistency, or homogeneity, reflects the extent to which items measure various aspects of the same characteristic and nothing else. For example, if a professor gives an exam to assess students' knowledge of research design, the items should reflect a summary of that knowledge; the test should not include items on anthropology or health policy. If we assess a patient's ability to perform daily tasks using a physical function scale, then the items on the scale should relate to aspects of physical function only. If some items evaluated psychological or social characteristics, then the items would not be considered homogeneous. The scale should, therefore, be grounded in theory that defines the dimension of physical function, thereby distinguishing it from other dimensions of function.
The most common approach to testing internal consistency involves looking at the correlation among all items in a scale. For most instruments, it is desirable to see some relationship among items, to reflect measurement of the same attribute, especially if the scale score is summed. Therefore, for inventories that are intended to be multidimensional, researchers generally establish subscales that are homogenous on a particular trait (even though items are often mixed when the test is administered). For example, the Short-Form 36-item (SF-36) health status measure is composed of eight subscales, including physical function, limitations in physical role, pain, social function, mental health, limitations in emotional role, vitality and general health perception.8 Each of these subscales has been evaluated separately for internal consistency.9
If we wanted to establish the reliability of a questionnaire, it would be necessary to administer the instrument on two separate occasions, essentially a test-retest situation. Oftentimes, the interval between testing is relatively brief, to avoid the possibility for true change. Recall of responses, then, becomes a potential threat, as it might influence the second score, making it impossible to get a true assessment of reliability. One solution to this problem is the use of parallel forms, but this shifts the measure of reliability to a comparison of instruments, rather than reliability of a single instrument.
A simpler approach combines the two sets of items into one longer instrument, with half the items being redundant of the other half. One group of subjects takes the test at a single session. The items are then divided into two comparable halves for scoring, creating two separate scores for each subject. Typically, questions are divided according to odd and even items. This is considered preferable to comparing the first half of the test with the second half, as motivation, fatigue and other psychological elements can influence performance over time, especially with a long test. Reliability is then assessed by correlating results of two halves of the test. If each subject's half-test scores are highly correlated, the whole test is considered reliable. This is called split-half reliability. This value will generally be an underestimate of the true reliability of the scale, since the reliability is proportional to the total number of items in the scale. Therefore, because the subscales are each half the length of the full test, the reliability coefficient is too low.
The obvious problem with the split-half approach is the need to determine that the two halves of the test are actually measuring the same thing. In essence, the two halves can be considered alternate forms of the same test; however, the split-half method is considered superior to test-retest and alternate forms procedures because there is no time lag between tests, and the same physical, mental and environmental influences will affect the subjects as they take both sections of the test.
Reliability Coefficients for internal Consistency
The statistic most often used for internal consistency is Cronbach's coefficient alpha (α).10 This statistic can be used with items that are dichotomous or that have multiple choices.* Conceptually, coefficient α is the average of all possible split-half reliabilities for the scale. This statistic evaluates the items in a scale to determine if they are measuring the same construct or if they are redundant, suggesting which items could be discarded to improve the homogeneity of the scale. Cronbach's α will be affected by the number of items in a scale. The longer the scale, the more homogeneous it will appear, simply because there are more items.
For split-half reliability, the Spearman-Brown prophecy statistic is used as an estimate of the correlation of the two halves of the test.
We can also assess internal consistency by conducting an item-to-total correlation; that is, we can examine how each item on the test relates to the instrument as a whole. To perform an item-to-total correlation, each individual item is correlated with the total score, omitting that item from the total. If an instrument is homogeneous, we would expect these correlations to be high. With this approach it is not necessary to create a doubly long test. The Pearson product-moment correlation coefficient is appropriate for this analysis (see Chapter 23).