IRT Modeling Lab

E. Assessment of Model-data Fit

Conceptual Issues of Fit

Introduction

IRT offers several advantages over classical test theory methods for test construction, identification of bias, and scoring. However, unless the the IRT model used for parameter estimation adequately fits the data, the benefits of IRT methods may not be realized. Some applications may require closer correspondence between theoretical and empirical observations than others.

There is no a priori justification why a model should describe data adequately. More general models that have less restrictive assumptions will generally fit better; however, they may require larger samples to estimate their increasing numbers of parameters. Therefore, it may be impractical for developers to employ these models in small scale research and assessment programs. Ultimately, the choice of a model must be based on theoretical and empirical grounds.

For cognitive ability data, a multitude of studies has shown that the 3PL model fits very well. However, numerous questions must still be answered regarding the fit of IRT models to personality and, more generally, noncognitive data. The model-data fit issue can be addressed in two ways. First, the data must conform to model assumptions about dimensionality. Second, predictions, based on the estimated model, should be examined in cross-validation samples. Drasgow et al. (1995) advocated a combination of complementary graphical and statistical methods for examining fit.

Graphical Methods

Fit plots are one of the most widely used methods for examining model-data fit. Ideally, one would compare item/option response functions, estimated from a calibration sample, to empirical proportions of positive responses obtained from a cross-validation sample. However, in many applications, a cross-validation sample is not used. For example, the DOS version of BILOG, discussed in this tutorial, produces fit plots, but it computes empirical and theoretical proportions using the same (calibration) sample. This may yield an unrealistically good representation of the fit of the model to the data because it capitalizes on chance. As in regression or structural equation modeling, fit should be examined using a cross-validation sample.

The simplest way to construct a fit plot is to divide the theta continuum into, say, 25 strata. Then theta is estimated for each examinee, and the total number of examinees in each stratum is counted. An empirical proportion can then be computed as the number of examinees, who selected the positive option, divided by the total number of examinees in the stratum. The problem with this straightforward approach is that the theta estimate for an individual is hardly ever equal to the true theta due to estimation error. Thus, even with a very large sample and perfectly estimated response functions, a fit plot may differ systematically from the true response function. This problem is especially pronounced for short tests where theta stimates have larger error.

Levine and Williams (1991, 1993) found an elegant solution to this problem. They proposed computing empirical points using true thetas, rather than theta estimates. Essentially, an empirical point for a target item,
formula,
is computed by taking the ratio of two posterior densities, as shown:

formula


The posterior density in the numerator is computed using respondents who answered the target item positively (N+), whereas the posterior density in the denominator is computed using the total sample (N) of respondents. A is just an index for summing examinees, and u*A is a particular examinee's response pattern.

It is important to note that when using fit plots to examine the correspondence between theoretical and empirical proportions, circularity is involved. In each method, item parameters, theta estimates, and posterior densities are obtained under the assumptions of a particular model. This affects not only the theoretical, but also the empirical proportions. Thus, the true shapes of the empirical response functions are never known; they are model-specific. The shapes of the empirical response functions will change for each model examined, so they must always be recomputed; even though, the response data are identical. What is important is the relative correspondence of theoretical and empirical proportions across the models examined.

The figure below shows an example fit plot for the 3PL model computed using the MODFIT computer program. The blue line, labeled IRF, is a theoretical item response function computed from a calibration sample. The red line, referred to as EMP, is an empirical item response function computed from a cross-validation sample. The vertical lines in each figure describe the approximate 95% confidence intervals for the empirical points. It can be seen that there is a close correspondence between the IRF and EMP curves, which suggests that the 3PL model fits the data well.

formula


Statistical Methods

Statistical tests of goodness of fit (i.e., chi-square fit statistics) are probably the most widely used in applied research. Unfortunately, they are often viewed as inconclusive evidence of adequate fit because of their sensitivity to sample size and their insensitivity to certain forms of misfit.

Numerous statistical methods for examining model-data fit have been developed (e.g., Glas, 1988; Orlando & Thissen, 2000; Van der Wollenberg, 1982; Yen, 1981; see also Hambleton & Swaminathan, 1991). The ordinary chi-square for an individual item i is given by

formula


where s is the number of keyed options, Oi(k) is the observed frequency of endorsing option k, and Ei(k) is the expected frequency of option k under a particular IRT model. The expected frequency of respondents selecting an option is computed using as

formula

where f(t) is the theta density, usually taken to be the standard normal, because item/option response functions are scaled in reference to that distribution. Van der Wollenberg (1982) showed that chi-square statistics for single items are in many instances insensitive to unidimensionality violations. Moreover, chi-square statistics for single items are insensitive to the type of misfit shown below.

formula


In this example, the empirical IRF consistently lies above the estimated IRF at low trait levels and below it at high trait levels. Although, it is visually clear that the data do not fit the IRT model, the chi-square for an individual item will be close to zero, because it is a marginal statistic, and the estimated IRF is integrated against a normal theta density; consequently, many different integrands can integrate to the same constant (i.e., the observed marginal number endorsing the item). To avoid these problems, the chi-square statistic should be computed for pairs and triples of items. Pairs and triples of items with similar misfits will have large chi-square statistics. The expected frequency for a pair of items in the (k, k')th cell of the two-way table for items i and i¢ is computed as follows,

formula

and the observed frequencies are counted in each cell (see Drasgow et al. [1995] for details). Some cells are combined so that the expected frequencies exceed 5. The usual chi-square for a two-way table is then calculated. A similar procedure is carried out with item triples. Algebraically, if model-data misfit occurs for an item pair or triple at the same trait level, the chi-square will increase even for the kinds of misfit shown above.

To facilitate comparisons of chi-squares based on different sample sizes, Drasgow et al. (1995) advocated reporting chi-squares, adjusted for sample size (say, 3000) and divided by their degrees of freedom. Based on numerous studies, they found that good model-data fit is associated with adjusted chi-square to degrees of freedom ratios of less than 3 for item singles, doubles and triples. Large ratio statistics for doubles and triples may indicate violations of local independence or unidimensionality.

Computing Chi-Square Statistics and Fit-Plots using the MODFIT Program