Construct validity reflects the ability of an instrument to measure an abstract concept, or construct. The process of construct validation presents a considerable challenge to the researcher because constructs are not "real"; that is, they are not directly observable, and exist only as concepts that are constructed to represent an abstract trait. Because constructs are typically multidimensional, it is not easy to determine if an instrument is actually measuring the variable of interest.
For example, everyone agrees that "health" is an important clinical construct, but because of its complexity, clinicians are generally unable to agree on how it should be defined or measured. Therefore, the definition of a construct like "health status" can be determined only by the instrument used to measure it. A test that focuses on physical activity alone will suggest a very different definition than a more global test that also incorporates cognitive, social and psychological elements. Similarly, a scale that looks at activities of daily living (ADL) according to categories of self-care, transfers and dressing will provide a different perception of function than one that also evaluates locomotion, housekeeping and recreation skills. An instrument that evaluates ADL according to an individual's perception of the difficulty performing given tasks will produce a measurement that is interpreted differently than one which focuses on the time needed to perform, the assistance required, or another that assesses the level of pain associated with specific tasks. Each of these provides a different theoretical foundation for defining the construct of function.
Part of construct validity, therefore, is based on content validity; that is, one must be able to define the content universe that represents that construct to develop a test to measure it. Beyond content, however, constructs must also be defined according to their underlying theoretical context. Thus, the "meaning" of a construct is based on assumptions about how an individual with that trait would behave under given conditions and how the various dimensions that form the construct interrelate. One can generate hypotheses regarding the overt behaviors of individuals with high and low scores on the test. An instrument is said to be a valid measure of a construct when its measurements support these theoretical assumptions.
For example, pain is a difficult construct to define, as it represents a subjective phenomenon rather than a performance behavior. However, we may also question whether "pain" is a stimulus, a perception, a response or a behavior. Looking at the construct of pain, then, requires that we conceptualize what is actually being evaluated. For instance, Sim and Waterfield16 discuss the experience of pain as a subjective outcome that varies from individual to individual. They describe the pain experience as having sensory, affective, evaluative, cognitive and behavioral dimensions, with sensory, emotional and physiological outcomes (Figure 6.2). Further analysis suggests the need to look at memory, cultural factors, social networks, sex and age, personality and other elements that contribute to the individual perception of pain. The differentiation between chronic and acute pain is more than just the time over which the pain occurs. Then there are characteristics of pain, such as intensity, quality, location, and duration.
Theoretical model of the multidimensional nature of the experience of pain, illustrating how the construct of pain may be conceptualized. Several dimensions contribute to the individual nature of the experience, as well as how the outcomes of the pain experience are perceived. (Adapted from Sim J, Waterfield J. Validity, reliability and responsiveness in the assessment of pain. Physiother Theory Pract 1997; 13:23–37.)
How one chooses to "measure" pain, therefore, will affect how the outcome will be interpreted. For instance, a study of patients in cancer trials looked at several outcome measures to evaluate pain treatment.17 A visual analog scale (VAS) using the anchors of "no pain" to "pain as bad as it could be" focused solely on intensity. A Pain Relief Scale assessed complete relief to worsening of pain. A Patient Satisfaction Scale rated how satisfied patients were with their treatment, and pain management scales were based on medication use. The authors showed that the adequacy of treatment for pain varied from 16% to 91%, depending on the type of outcome measure used. The construct is defined, therefore, by the instrument used to measure it. Different elements may be important, depending on the clinical or research situation.
Methods of Construct Validation
Construct validation provides evidence to support or refute the theoretical framework behind the construct. Construct validation is an ongoing process, wherein we are continually learning more about the construct and testing its predictions. This evidence can be gathered by a variety of methods. Some of the more commonly used procedures include the known groups method, convergence and discrimination, factor analysis, hypothesis testing and criterion validation.
The most general type of evidence in support of construct validity is provided when a test can discriminate between individuals who are known to have the trait and those that do not. Using the known groups method, a criterion is chosen that can identify the presence or absence of a particular characteristic, and the theoretical context behind the construct is used to predict how different groups are expected to behave. Therefore, the validity of a particular test is supported if the test's results document these known differences. For example, Megens and associates18 examined the construct validity of the Harris Infant Neuromotor Test (HINT), a screening tool to identify neuromotor or cognitive/behavioral problems in infants who are healthy or at risk within the first year of life. They studied 412 low-risk infants and 54 infants who were identified as high risk based on preterm birth weight or exposure to drugs or alcohol in utero. The researchers found that the HINT distinguished between the two groups of infants in their mean scores, supporting the construct validity of the tool.
Convergence and Discrimination
Campbell and Fiske19 have suggested that the construct validity of a test can be evaluated in terms of how its measures relate to other tests of the same and different constructs. In other words, it is important to determine what a test does measure as well as what it does not measure. This determination is based on the concepts of convergence and discrimination.
Convergent validity indicates that two measures believed to reflect the same underlying phenomenon will yield similar results or will correlate highly. For instance, if two health status scales are valid methods for measuring quality of life, they should produce correlated scores. Convergence also implies that the theoretical context behind the construct will be supported when the test is administered to different groups in different places at different times. Convergence is not a sufficient criterion for construct validity, however. It is also necessary to show that a construct can be differentiated from other constructs.
Discriminant validity indicates that different results, or low correlations, are expected from measures that are believed to assess different characteristics. Therefore, the results of an intelligence test should not be expected to correlate with results of a test of gross motor skill. To illustrate these concepts, the Sickness Impact Profile (SIP) has been compared to several other measures of function in an effort to establish its construct validity. The SIP is a health status measure which indicates the changes in a person's behavior due to sickness, scored on the total scale as well as on separate physical and psychosocial subscales.20 Convergent validity has been supported by a high correlation between the physical dimensions of the SIP scale and the SF-36 health survey questionnaire.21 Discriminant validity is illustrated by a lower correlation between the physical SIP scale and the Carroll Rating Scale for Depression.22
Campbell and Fiske19 also suggest that validity of a test should be evaluated in terms of both the characteristic being measured and the method used to measure it. They call this a trait-method unit; that is, a trait cannot be assessed independently of some method. Therefore, the validity of the assessment must take both elements into account. On the basis of this concept, a validation process was proposed that incorporates an analysis of two or more traits measured by two or more methods. The intercorrelations of variables within and between methods are arranged in a matrix called a multitrait-multimethod matrix (see Figure 6.3). By arranging scores in this way, we can verify that tests measuring the same trait produce high correlations, demonstrating convergent validity, and those that measure different traits produce low correlations, demonstrating discriminant validity.
A multitrait-multimethod matrix, showing the relationship between reliability and validity, and the concepts of convergent and discriminant validity. The physical scale of the Sickness Impact Profile (SIP) shows high correlations (convergent validity) with the physical scale of the SF-36 Health Status Questionnaire, but low correlations (discriminant validity) with different measures of depression.
Another common approach to construct validation is the use of a statistical procedure called factor analysis. The concept of factor analysis is based on the idea that a construct contains one or more underlying dimensions, or different theoretical components. For example, Wessel and associates23 used the Western Ontario Rotator Cuff (WORC) Index to study the quality of life of individuals with that disorder. The index is composed of 21 items that were originally designed to reflect five dimensions: (1) pain and physical symptoms, (2) sports and recreation, (3) work, (4) lifestyle, and (5) emotions. Using a factor analysis, the researchers were able to recombine these variables as three factors: Emotions and Symptoms, Disability-Strength Activities, and Disability-Daily Activities. These separate groupings of correlated variables represent subsets of test items or behaviors that are related to each other, but are not related to items in other factors; that is, each factor represents a unique combination of items that reflects a different theoretical component of the construct. The statistical basis for this process is quite complex and beyond our current discussion, but we will devote considerable attention to it in Chapter 29.
Because constructs have a theoretical basis, an instrument's validity can also be assessed by using it to test specific hypotheses that support the theory. For instance, the construct validity of the Functional Independence Measure (FIM) was assessed by Dodds et al.,24 based on the assumption that the instrument should be able to distinguish functional differences between people with varied clinical conditions. The construct of function that forms the foundation for the FIM relates to the burden of care, or the degree of assistance needed for a patient to fulfill activities in ADL, mobility and cognitive domains. Using this theoretical premise, the authors proposed three hypotheses: (1) that FIM scores should decrease with increasing age and comorbidities, (2) that the score would be related to a patient's discharge destination according to the level of care provided in that setting (such as home or skilled nursing facility), and (3) that there would be a relationship between FIM scores and degree of severity for patients with amputations, spinal cord injury and stroke. Using data collected on more than 11,000 patients, their results supported some hypotheses better than others, demonstrating a strong relationship between FIM scores and discharge destination, and severity of spinal cord injury and stroke. This type of analysis provides distinct evidence of construct validity for the instrument, but it leaves unanswered many theoretical questions regarding its use over the broad range of rehabilitation situations. Therefore, it also points to the need for continued testing to determine how the FIM score relates to various diagnoses and clinical findings.
Construct validity can also be supported by comparison of test results with those of relevant criterion tests. This approach is not used as often as other approaches, because it is typically difficult to find a suitable criterion. In most cases, when a new instrument is developed to measure a construct, it is because no other acceptable instruments are available. Therefore, no standard can be applied to test it; however, it is often possible to find criterion tests that can be applied to subparts of the overall instrument. For example, Podsiadlo and Richardson10 used the Berg Balance Scale, gait speed and an ADL scale as criterion values to establish the construct validity of the timed "Up and Go" test. These individual criterion tests were assumed to represent components of the overall construct of functional mobility that the "Up and Go" test was intended to measure. Through a series of correlations the authors were able to demonstrate that each criterion test was related to the outcome variable, and although these were not perfect correlations, taken together they supported the overall concept that was being evaluated.