IRT Modeling Lab

2. Test Construction Using IRT


Step 1. Item Writing

As with CTT approaches to test construction, write about two to three times as many items as desired in the final form(s). Administer items to a large heterogeneous calibration sample. More complex IRT models require larger samples for parameter estimation.

To create a 15-item test of verbal ability for screening out, say, 50% of the applicant pool, the 3PL model might be used. This model has been shown to fit cognitive ability data well. Generally, accurate 3PL parameter estimates can be obtain using about 1000 examinees.

Step 2. Initial Item Selection

Before estimating item parameters, do a classical test theory item analysis to eliminate items having near zero, and, of course, negative item-total correlations. These items may cause convergence problems. Of the 40 example items shown earlier, four were eliminated before parameter estimation (Items 16, 18, 28, 37).

Now use an IRT calibration program, such as BILOG, to estimate item discrimination (a), difficulty (b), and pseudo-guessing (c) parameters. Then check model-data fit using the MODFIT program. Assuming good fit was found, you can begin initial item selection for your 15-item form.

When constructing a test form using IRT, item information functions are often used to select items. Because item information varies across trait levels, it is possible to select items that provide high measurement precision at specific points on the trait continuum. Items that have larger discrimination parameters provide more information for scoring examinees, and thus, higher precision. For the 3PL model, maximum item information occurs at the value of difficulty parameter. To construct a test that screens out approximately 50% of the examinees, you should select items that have large a-parameters and b-parameters near 0.0.

Below are the 3PL parameters for the 36 items calibrated using BILOG.

Item Parameters


Below is an example of three item information functions (IIFs). It can be seen that Items 8 and 24 are good for this type of test because they provide high information at theta values near the cutoff score of 0.0. Note that although Item 8 has a b-parameter of zero, it provides very little information because the a-parameter is only .48. This item contributes little to measurement precision and does not discriminate among examinees.

Graph


To divide examinees basically into two groups, above and below theta equals zero, we suggest selecting items with high a-parameters and b-parameters near zero as shown. Note that item response functions tend to cluster near zero, and the resulting TCC (shown in aqua) rises fairly sharply between theta equals zero and one, indicating good discrimination among examinees in the target range.

Graph


Finally, you can examine the test information function (TIF), which results from summing the item information functions. The standard error function is plotted on the same graph. It is given by 1 over the square root of test information. Note that the standard error is below 0.5 between theta values of -.5 and 1, but it rises sharply at the end points, where items contributed little information.

Graph


Parallel Forms

Back