IRT offers several advantages over classical test theory methods for test construction,
identification of bias, and scoring. However, unless the the IRT model used for parameter
estimation adequately fits the data, the benefits of IRT methods may not be realized. Some
applications may require closer correspondence between theoretical and empirical
observations than others.
There is no a priori justification why a model should describe data adequately.
More general models that have less restrictive assumptions will generally fit better; however,
they may require larger samples to estimate their increasing numbers of parameters.
Therefore, it may be impractical for developers to employ these models in small scale
research and assessment programs. Ultimately, the choice of a model must be based
on theoretical and empirical grounds.
For cognitive ability data, a multitude of studies has shown that the 3PL model fits
very well. However, numerous questions must still be answered regarding the fit of
IRT models to personality and, more generally, noncognitive data.
The model-data fit issue can be addressed in two ways. First, the data must conform
to model assumptions about dimensionality. Second,
predictions, based on the estimated model, should be examined in
cross-validation samples. Drasgow et al. (1995)
advocated a combination of complementary graphical and statistical methods for
examining fit.
Graphical Methods
Fit plots are one of the most widely used methods for examining model-data fit.
Ideally, one would compare item/option response functions, estimated from a
calibration sample, to empirical proportions of positive responses obtained
from a cross-validation sample. However, in many applications, a cross-validation
sample is not used. For example, the DOS version of BILOG, discussed in this
tutorial, produces fit plots, but it computes empirical and theoretical proportions
using the same (calibration) sample. This may yield an unrealistically good
representation of the fit of the model to the data because it capitalizes on chance.
As in regression or structural equation modeling, fit should be examined using a
cross-validation sample.
The simplest way to construct a fit plot is to divide the theta continuum into, say,
25 strata. Then theta is estimated for each examinee, and the total number of
examinees in each stratum is counted. An empirical proportion can then be computed
as the number of examinees, who selected the positive option, divided by the total
number of examinees in the stratum. The problem with this straightforward approach
is that the theta estimate for an individual is hardly ever equal to the true theta
due to estimation error. Thus, even with a very large sample and perfectly estimated
response functions, a fit plot may differ systematically from the true response
function. This problem is especially pronounced for short tests where theta
stimates have larger error.
Levine and Williams (1991, 1993) found an elegant solution to this problem.
They proposed computing empirical points using true thetas, rather than theta
estimates. Essentially, an empirical point for a target item, ,
is computed by taking the ratio of two posterior densities, as shown:
The posterior density in the numerator is computed using respondents who answered
the target item positively (N+), whereas the posterior density in the
denominator is computed using the total sample (N) of respondents. A is just an
index for summing examinees, and u*A is a particular examinee's response
pattern.
It is important to note that when using fit plots to examine the correspondence
between theoretical and empirical proportions, circularity is involved.
In each method, item parameters, theta estimates, and posterior densities are
obtained under the assumptions of a particular model. This affects not only the
theoretical, but also the empirical proportions. Thus, the true shapes of the
empirical response functions are never known; they are model-specific.
The shapes of the empirical response functions will change for each model examined,
so they must always be recomputed; even though, the response data are identical.
What is important is the relative correspondence of theoretical and empirical
proportions across the models examined.
The figure below shows an example fit plot for the 3PL model computed using the
MODFIT computer program. The blue line, labeled IRF, is a theoretical item
response function computed from a calibration sample. The red line, referred to as
EMP, is an empirical item response function computed from a cross-validation sample.
The vertical lines in each figure describe the approximate 95% confidence intervals
for the empirical points. It can be seen that there is a close correspondence between
the IRF and EMP curves, which suggests that the 3PL model fits the data well.
Statistical Methods
Statistical tests of goodness of fit (i.e., chi-square fit statistics) are probably the most
widely used in applied research. Unfortunately, they are often viewed as inconclusive
evidence of adequate fit because of their sensitivity to sample size and their
insensitivity to certain forms of misfit.
Numerous statistical methods for examining model-data fit have been developed
(e.g., Glas, 1988; Orlando & Thissen, 2000; Van der Wollenberg, 1982; Yen, 1981;
see also Hambleton & Swaminathan, 1991). The ordinary chi-square for an individual item i
is given by
where s is the number of keyed options, Oi(k) is the observed frequency of endorsing
option k, and Ei(k) is the expected frequency of option k under a particular IRT model.
The expected frequency of respondents selecting an option is computed using as
where f(t) is the theta density, usually taken to be the standard normal,
because item/option response functions are scaled in reference to that distribution.
Van der Wollenberg (1982) showed that chi-square statistics for single items are in
many instances insensitive to unidimensionality violations. Moreover, chi-square
statistics for single items are insensitive to the type of misfit shown below.
In this example, the empirical IRF consistently lies above the estimated IRF at low
trait levels and below it at high trait levels. Although, it is visually clear that
the data do not fit the IRT model, the chi-square for an individual item will be close to
zero, because it is a marginal statistic, and the estimated IRF is integrated against
a normal theta density; consequently, many different integrands can integrate to the
same constant (i.e., the observed marginal number endorsing the item). To avoid
these problems, the chi-square statistic should be computed for pairs and triples of items.
Pairs and triples of items with similar misfits will have large chi-square statistics.
The expected frequency for a pair of items in the (k, k')th cell of the two-way
table for items i and i¢ is computed as follows,
and the observed frequencies are counted in each cell (see Drasgow et al. [1995]
for details). Some cells are combined so that the expected frequencies exceed 5.
The usual chi-square for a two-way table is then calculated. A similar procedure is carried
out with item triples. Algebraically, if model-data misfit occurs for an item pair
or triple at the same trait level, the chi-square will increase even for the kinds of misfit
shown above.
To facilitate comparisons of chi-squares based on different sample sizes,
Drasgow et al. (1995) advocated reporting chi-squares, adjusted for sample size
(say, 3000) and divided by their degrees of freedom. Based on numerous studies,
they found that good model-data fit is associated with adjusted chi-square
to degrees of freedom ratios of less than 3 for item singles, doubles and triples.
Large ratio statistics for doubles and triples may indicate violations of local
independence or unidimensionality.