|
|
F. Detection of Differential Item/Test Functioning (DIF/DTF) Using IRT
Meaningful comparisons of differences across organizational, cultural, ethnic, and gender groups require that measurement equivalence holds. Drasgow (1984) states that measurement equivalence exists "when the relations between observed test scores and the latent attribute measured by (a) test are identical across subpopulations" (p. 134). Violations of measurement equivalence are evidenced by differential item/test functioning.
Differential item functioning (DIF) refers to a difference in the probability of endorsing an item for members of a reference group (e.g., US workers) and a focal group (e.g., Chinese workers) having the same standing on the latent attribute measured by a test. Differential test functioning (DTF) refers to a difference in the test characteristic curves obtained by summing the item response functions computed in the respective groups.
Many parametric and nonparametric methods have been developed for detecting DIF/DTF. Among applied researchers, parametric methods are more appealing because item parameters may have simple psychological interpretations. However, parametric models are less flexible than their nonparametric counterparts, because they make stronger assumptions about scale dimensionality and the shapes of IRFs. Nonparametric methods, in contrast, typically assume only monotonicity. In this tutorial, the parametric Lord's chi-square and the nonparametric SIBTEST methods for detecting DIF are discussed.
In conducting an IRT-based DIF analysis, it is conventional to divide samples of respondents into a base (reference) group and one or more comparison (focal) groups. A focal group is commonly a subpopulation of interest to the researcher, and the reference group serves as the standard for comparison. For example, Chinese workers may constitute the focal group and US workers may serve as the reference group. The IRFs of the focal groups are compared to those of the reference group after placing the distributions on a common metric by iterative linking (Candell & Drasgow, 1988) or ability purification (Stocking & Lord, 1983). An item is said to display DIF when individuals from different subpopulations, who are equal on the measured underlying attribute, have different probabilities of making a correct or positive response. In other words, the IRFs and therefore the item parameters should be the same for all groups within the range of estimation error (Hulin et al., 1983). On the other hand, impact is defined as a difference in item performance across subpopulations due to real differences in the underlying target attribute (Camilli, 1992). Therefore, impact and DIF are conceptually distinct.
To clarify the difference between DIF and impact, consider the following example. Mean differences across groups on cognitive ability tests are often interpreted as evidence of bias. However, the presence of mean differences, which affect selection and promotion, are correctly called impact. Impact concerns the fair use of tests. Tests that cause impact do not necessarily exhibit bias in the psychometric sense. Psychometric bias is associated with a lack of measurement equivalence, so it becomes manifest as DIF/DTF.
If the dimensionality of response data is established and the appropriate unidimensional or multidimensional IRT model is applied to estimate item parameters, then IRT-based methods of assessing DIF do not confound DIF with impact. Consequently IRT-based methods of assessing DIF are superior to comparisons of classical p-values across reference and focal groups. Comparisons of p-values often lead to inflated Type I and Type II error rates, respectively, by identifying items that do not display real DIF and failing to detect items that do (Lim & Drasgow, 1990). The major advantage of IRT, however, is that IRFs and item parameters are subpopulation invariant, whereas p-values are subpopulation dependent. Thus, to identify DIF, one can compare the IRFs of the reference and focal groups using Lord's chi-square statistic for differences in the estimated item parameters, or one can use an alternate statistic for testing the differences in area between the IRFs.
Procedure for DIF Detection Using Lord's Chi-Square (2PL and 3PL Models)
Detection of DIF Using the SIBTEST Procedure
Detection of DTF Using the DFIT Procedure
Back To Main
|
|
|