A scale is an ordered system based on a series of questions or items that provide an overall rating that represents the degree to which a respondent possesses a particular attitude, value or characteristic. The purpose of a scale is to distinguish among people who demonstrate different intensities of the characteristic that is being measured. Scales have been developed to measure attitudes, function, health and quality of life, pain, exertion and other physical, physiological and psychological variables.
Categorical scales are based on nominal measurement. A question asks the respondent to assign himself according to one of several classifications. This type of scale is used with variables such as gender, diagnosis, religion or race. These data are expressed as frequency counts or percentages.
Most scales represent a characteristic that exists on a continuum. Continuous scales may be measured using interval or ratio values, such as age, blood pressure or years of experience. An ordinal scale requires that a continuous variable be collapsed into ranks. For instance, pain can be measured as "minimal, moderate, severe," or function as "independent, minimal assist, moderate assist, maximal assist, dependent." Scale items should represent the full range of values that represent the characteristic being measured.
Scales are created so that a summary score can be obtained from a series of items, indicating the extent to which an individual possesses the characteristic of interest. Because item scores are combined to make this total, it is important that the scale is structured around only one dimension; that is, all items should reflect different elements of a single characteristic. A summative scale is one that presents a total score with all items contributing equal weight to the total. A cumulative scale demonstrates an accumulated characteristic, with each item representing an increasing amount of the attribute being measured.
We will describe several scaling models used to summarize respondent characteristics: Likert scales, the semantic differential, visual analogue scales, cumulative scales and Rasch models.
A Likert scale is a summative scale, most often used to assess attitudes or values. A series of statements is presented expressing a viewpoint, and respondents are asked to select an appropriately ranked response that reflects their agreement or disagreement with each one. For example, Figure 15.3 shows a set of statements that evaluate students' opinions about including a research course in an entry-level professional curriculum. Likert's original scale included five categories: strongly agree (SA), agree (A), neutral (N), disagree (D), and strongly disagree (SD).19 Many modifications to this model have been used, sometimes extending it to seven categories (including "somewhat disagree" and "somewhat agree") or four categories (eliminating "neutral").
There is no consensus regarding the number of response categories that should be used. Some researchers believe the "neutral" option should be omitted so that the respondents are forced to make a choice, rather than allowing them an "out" so that they do not have to take sides on an issue. Others feel that respondents who do not have strong feelings should be given a viable option to express that attitude. When the forced choice method is used, responses that are left blank are generally interpreted as "neutral."
Each choice along the scale is assigned a point value, based on the degree to which the item represents a favorable or unfavorable characteristic. For example, we could rate SA = 5, A = 4, N = 3, D = 2, SD = 1, or we could use codes such as SA = 2, A = 1, N = 0, D = −1, SD = −2. The actual values are unimportant, as long as the items are consistently scored; that is, agreement with favorable items should always be scored higher than agreement with unfavorable items. Therefore, if positively phrased items are coded 5 through 1, then negatively phrased items must be coded 1 through 5.
An overall score is computed for each respondent by adding points for each item. Creating such a total assumes that the items are measuring the same things and that each item reflects equal elements of the characteristic being studied; that is, one item should not carry any more weight than the others.
Constructing a Likert scale requires more than just listing a group of statements. A large pool of items should be developed, usually 10 to 20, that reflect an equal number of both favorable and unfavorable attitudes. It is generally not necessary to include items that are intended to elicit neutral responses, because these will not help to distinguish respondents. The scale should be validated by performing item analyses that will indicate which items are truly discriminating between those with positive and those with negative attitudes. These items are then retained as the final version of the scale, and others are eliminated. If respondents are equally likely to agree with both favorable and unfavorable statements, then the scale is not providing a valid assessment of their feelings about a particular issue. The basis of the item analysis is that there should be correlation between an individual's total score and each item response. Those who score highest should also agree with positively worded statements, and those who obtain the lowest total scores should disagree. Those items that generate agreement from both those with high and low scores are probably irrelevant to the characteristic being studied, and should be omitted.
Attitudes have also been evaluated using a technique called the semantic differential.20 This method tries to measure the individual's feelings about a particular object or concept based on a continuum that extends between two extreme opposites. For example, we could ask respondents to rate their feelings about natural childbirth by checking the space that reflects their attitude on the following scale:
The semantic differential is composed of a set of these scales, using pairs of words that reflect opposite feelings. Typically a 7-point scale is used, as just shown, with the middle representing a neutral position. This scale is different from the Likert scale in two ways. First, only the two extremes are labeled. Second, the continuum is not based on agree/disagree, but on opposite adjectives that should express the respondent's feelings about the concept. Figure 15.4 illustrates a semantic differential to explore self- image in a group of elderly women who reside in a nursing home.
Example of a semantic differential for testing self-image. Dimensions of evaluation (E), potency (P), and activity (A) are indicated, although these designations would not appear in an actual test.
Research has demonstrated that the adjective pairs used in this scale tend to fall along three underlying dimensions, which have been labeled evaluation, potency and activity.20,21 Evaluation is associated with adjectives such as nice-awful, good-bad, clean-dirty, valuable-worthless and helpful-unhelpful. Some concepts that lie on the positive side of this dimension are doctor, family, peace, success and truth. Negative evaluation concepts include abortion, disease, war and failure. Potency ideas are big-little, powerful-powerless, strong-weak, large-small and deep-shallow. Strong potency concepts include bravery, duty, law, power and science. Negative concepts include baby, love and art. The activity dimension is characterized by fast-slow, alive-dead, noisy-quiet, young-old, active-passive and sharp-dull. Strong activity concepts are danger, anger, fire and child. Concepts that lie toward the negative activity side are calm, death, rest and sleep. The ratings shown in Figure 15.4 are labeled according to their respective dimensions. It is a good idea to mix up the order of presentation of the dimensions in listing the scales.
The semantic differential is scored by assigning values from 1 to 7 to each of the spaces within each adjective pair, with 1 representing the most negative response and 7 indicating the positive extreme. To avoid biases or a tendency to just check the same column in each scale, the order of negative and positive responses should be randomly varied. For instance, in Figure 15.4, ratings of weak-strong, slow-fast and ugly-beautiful place the negative value on the left; all other scales have the positive value on the left. A total score can be obtained by summing the scores for each rating. Lower total scores will reflect generally negative feelings toward the concept being assessed, and higher scores represent generally positive feelings. Statistical procedures, such as factor analysis, can be applied to the scale ratings to determine if the evaluation, potency, and activity ratings tend to go together (see Chapter 29 for a description of factor analysis). In this way, the instrument can be used to explore theoretical constructs.
A visual analogue scale (VAS) is one of the simplest methods to assess the intensity of a subjective experience. A line is drawn, usually fixed at 100 mm in length, with word anchors on either end that represent extremes of the characteristic. The intermediate levels along the line are not defined. Respondents are asked to place a mark along the line corresponding to their perceived level for that characteristic. The VAS is scored by measuring the distance of the mark from the left-hand anchor in millimeters. This method has also been used to measure a variety of characteristics,22 most extensively for pain,23,24 as shown in Figure 15.5. The VAS can be used to evaluate a variable at a given point in time or its degree of change over time.
A 100 mm visual analogue scale for pain, showing a mark at 27 mm.
The scores obtained with a VAS have generally been treated as ratio level data, measured in millimeters.25,26,27 This assumption permits VAS scores to be added to obtain a mean and subjected to parametric statistical procedures. Some have argued that the scores are only psuedo ratio, and should be treated as ordinal, handled with nonparametric statistics.28 They suggest that the individual marking the line is not truly able to appreciate the full continuum, evidenced by ceiling effects29 and a tendency to cluster marks at certain points.30 Therefore, even though the actual readings from the scale are obviously at the ratio level, the true measurement properties may be less precise. This dilemma will continue to emerge in studies using the VAS.31
The simple format of the VAS continues to make it a popular method for assessing unidimensional characteristics. This points out one disadvantage of the technique, however, in that each VAS is only capable of evaluating one dimension of a trait. Researchers often incorporate several VAS lines, each with different anchors, to assess related aspects of the characteristic being measured.32,33
In a summative scale, several item scores are added to create a total score. One of the limitations of this type of measure is that the total score can be interpreted in more than one way. Suppose we have a scale, scored from 0 to 100, that measures physical function, including elements related to locomotion, personal hygiene, dressing and feeding. Two individuals who achieve a score of 50 may have obtained this score for very different reasons. One may be able to walk, but is unable to perform the necessary upper extremity movements for self-care. Another may be in a wheelchair, but is able to take care of his personal needs. Therefore, a summed score can be ambiguous. This potential outcome reflects the fact that the items within the scale actually reflect different components or dimensions of the trait being measured, in this case physical function, which are not all equal.
Cumulative scales (also called Guttman scales) provide an alternative approach, wherein a set of statements is presented that reflects increasing intensities of the characteristic being measured. This technique is designed to ensure that there is only one dimension within a set of responses; that is, there is only one unique combination of responses that can achieve a particular score. For instance, in a cumulative scale a respondent who agrees with item 2 will also have had to agree with item 1; one who agrees with item 3 will have had to agree with items 1 and 2; and so on. Therefore, although there may be several combinations of responses that will result in a total score of 10 for a summative scale, there is only one way to achieve that score on a cumulative scale. Consider the following statements which were included in a self-assessment interview of elderly people concerning their functional health status.34
I can go to the movies, church or visiting without help.
I can walk up and down to the second floor without help.
I can walk half a mile without help.
I am not limited in any activities.
I have no physical conditions or illnesses now.
I am still healthy enough to do heavy work around the house without help.
If these items represent a cumulative scale, then all those who can walk half a mile can also climb stairs to the second floor and go out visiting. Those who cannot walk half a mile should not be able to do heavy housework and probably have some limiting illness or physical condition. The development of this scale is, therefore, based on a theoretical premise that there is a hierarchy to this dimension of health.
Each item in the cumulative scale is scored as 1 = agree or 0 = disagree. A total cumulative score is then computed for all items. The maximum score will be equal to the number of items in the scale. A respondent who achieves a score of 2 would have had to agree only with items 1 and 2. If he agreed with items 1 and 3 only, the scale would be faulty because the set of statements would not constitute a hierarchy in terms of the characteristic being assessed. In reality, such scales are not free of error, and some of the subjects can be expected to present inconsistent patterns of response. In the analysis of the response categories for functional health, researchers found that most of their subjects could participate in social activities (86%) and that the fewest could do heavy work around the house, like shoveling snow and washing walls (21%).34 The frequencies for other responses ranged between these two extremes, supporting the cumulative scale.
The issues of hierarchical assessment extend to many of the questionnaire instruments that have been developed to assess functional and health outcomes. In most such scales, items are marked using ordinal values, and a total score is generated. For example, we could ask elderly patients if their health limits their function based on several ADL items, as follows:
| ||(1) Limited a Lot ||(2) Limited a Little ||(3) Not Limited |
|Eating ||□ ||□ ||□ |
|Walking indoors ||□ ||□ ||□ |
|Climbing stairs ||□ ||□ ||□ |
Although this is an obviously abbreviated scale for the sake of example, a person who is independent in all three items would obtain a total score of 9. A person who is severely limited in all three tasks would receive a total score of 3. For this total score to be meaningful, however, three criteria must be met. First, the scale items must reflect a unidimensional construct. For instance, the ability to eat is not necessarily related to the ability to walk indoors or climb stairs; that is, these items may be part of different dimensions of function.35 If so, the sum of scores on these items would not reflect a unified construct—sort of adding apples and oranges. Therefore, two patients who obtain a score of 5 may not demonstrate the same functional profile.
Second, the items must progress according to a hierarchical model from easy to difficult, so we can determine if someone has more or less of the trait. This would also mean that the order of difficulty for the items is consistent for all patients, and that the range of the scale incorporates the extremes.36 Therefore, if our sample functional scale was properly arranged, eating would be easier than walking indoors, and walking indoors would be easier than climbing stairs for everyone.
Third, we need a scale that will allow us to measure change within or across patients. As we have noted before, ordinal values may present problems in this regard because they have limited sensitivity and precision. A patient might improve in his ability to climb stairs, but not enough to be scored at a higher level of independence. Therefore, for a score to be meaningful, units of measurement must have equal intervals along the scale, to account for magnitude of change. These objectives can be achieved using a technique called Rasch analysis, which statistically manipulates ordinal data to create a linear measure on an interval scale.37,38,39,40,41
We can describe this process using items from the Functional Independence Measure (FIM), a popular instrument for assessing function in rehabilitation settings. The FIM is an 18-item scale designed to evaluate the amount of assistance a patient needs to accomplish activities of daily living (ADLs).42 The items measure both motor and cognitive functions. Each item is scored on an ordinal scale from 1 (total assist) to 7 (total independence). The larger the total score, the less assistance the patient requires. Theoretically, then, if the scale represents a singular construct of function, the total score should reflect an "amount" of independence; that is, we can think of individuals as being "more" or "less" independent.
Now let us conceive of function as a line, representing the continuum of function, as shown in Figure 15.6. For this example we will use the five items in the cognitive subscale, listed in Table 15.1.∗ We construct the line using items in the scale, with easier items at the base, and harder items at the top. Using data from a patient sample and specialized computer programs,43 the Rasch analysis determines the order of difficulty of the items, and locates them along this continuum, to show how they fit a unidimensional model of function. The analysis will also position patients along this line according to "how much" or "how little" cognitive function they have. The arrangement of items in Figure 15.6 illustrates these concepts based on a study by Heinemann et al44 who performed a Rasch analysis on the motor and cognitive portions of the FIM. The figure illustrates two facets of this scale: On the right, the items are ranked in relation to their difficulty; and on the left, the patients are positioned relative to their abilities.45 The more difficult items have a higher score, and patients who have these abilities (they are more independent) will also be placed near the top of the scale. Patients who are less functional are placed toward the bottom of the scale, as they are only able to complete the easier items. In Figure 15.6, patient 5 was only able to achieve the auditory comprehension and verbal expression items, while patients 1 and 2 were able to achieve all five items.
Example of a two-facet linear functional scale for mobility, showing the placement of scale items and patients according to a Rasch analysis. The increments represent item difficulty on the logit scale, with higher values representing greater difficulty.
TABLE 15.1ITEMS FROM THE COGNITIVE SUBSCALE OF THE FUNCTIONAL INDEPENDENCE MEASURE ||Download (.pdf) TABLE 15.1 ITEMS FROM THE COGNITIVE SUBSCALE OF THE FUNCTIONAL INDEPENDENCE MEASURE
|Item ||Logit |
|1. Problem solving ||0.53 |
|2. Memory ||0.30 |
|3. Social interaction ||0.00 |
|4. Auditory comprehension ||–0.40 |
|5. Verbal expression ||–0.45 |
If the scale truly represents one functional construct, it should meet three measurement principles.14,45 First, the total score on the scale should reflect level of function implied by the items; second, the items will range in difficulty; and third, the rank order of difficulty will not change from person to person. The results of the computerized Rasch analysis will show where each individual respondent fits along the continuum; the level of difficulty achieved by each item on an interval scale; and goodness-of-fit of the model, showing how well each item matches the cumulative scale.14
Several criteria are used to judge the adequacy of a scale as part of a Rasch analysis:
1. Item difficulty refers to the position of items within the hierarchical scale. It is expressed as a logit†, or log-odds unit, with a central zero point, allowing items to be scaled as positive or negative. The items are ordered so that the degree of function becomes systematically greater as the items become harder; that is, patients who have greater functional ability will "pass" the more difficult items. Therefore, it becomes possible to determine how close or far apart items are in difficulty, not just their rank order of difficulty. Ideally items are positioned equally across the scale, not leaving large gaps. As shown in Table 15.1 and Figure 15.6, the five items on the FIM range in difficulty from −0.45 to 0.53, with a reasonable spread of scores. The most difficult item is problem solving, and the easiest item is verbal expression. If gaps are identified, they suggest where items need to be added better to reflect the continuum.
2. Item fit is the extent to which the individual items conform to the unidimensional model. Person fit represents the extent to which individuals fit the model. The Rasch analysis develops a probability model that predicts what scores should be for each item and person. If we look at the continuum for the construct of cognitive function, for example, a good fit means that each item represents a level on the scale that will discriminate between those who require less assistance and those who require more assistance. Patients who are less functional will be placed toward the bottom, and those who are more functional will be placed toward the top; that is, the more functional individual will pass more of the items (and more of the difficult items), and the less functional individual will fail more of the items.
When expected relationships are not found, the responses are considered a misfit.‡ For example, a Rasch analysis for the entire FIM scale has shown that combining all 18 items resulted in a large proportion of misfitting items;44 that is, some of the more difficult items and more functional patients were not placed at the top (supporting the separation of motor and cognitive subscales).
Fit statistics are calculated for each item to reflect how well the items conform to the hierarchical model. These statistics are expressed as a mean square residual (MNSQ), which is the difference between the observed scores and the scores expected by the model. If the observed and expected values are the same, the MNSQ will equal 1.0.§ Higher MNSQ values indicate greater discrepancy from the model; that is, the item is not consistent in its level of difficulty across patients.** It would then be reasonable to consider revising the scale, either by eliminating the item or rewording it to remove ambiguity. If patients are misfit, the researcher must examine their characteristics, potentially identifying subgroups in the population. In the FIM study by Heinemann and colleagues,44 several patient groups were evaluated, demonstrating that differently ordered cognitive scales were needed to represent groups with and without brain dysfunction. For instance, patients with right- and left-sided strokes did not demonstrate similar difficulty with verbal expression.
3. Item separation reflects the spread of items, and person separation represents the spread of individuals. Ideally, the analysis will show that items can be separated into at least three strata that represent low, medium and high difficulty,48 although a good scale may actually delineate many strata to clarify the construct. Statistically, this spread is related to measurement error or reliability; that is, the more reliable a scale, the more likely the item or person score represents the true score. Measurement error should be small so that segments of the scale are separated by distances greater than their measurement error alone. Separation statistics may be expressed as a reliability coefficient, or the ratio of the sample standard deviation to the standard error of the test.39 Conceptually, this is a ratio of the true spread of scores divided by the measurement error.
An understanding of measurement principles applied to questionnaires is essential if we want to use scores as part of our patient evaluations or to look at group performance over time. We must consider the potential for misinference when ordinal scales are used. Rasch Item Response Theory provides an important technique for testing our assumptions in clinical measurement. Several useful examples of Rasch analysis can be found in the literature.49,50,51,52,53