|
|
1. Test Development Using Classical Test Theory
CTT procedures mainly involve selecting items using item difficulty values and item-total correlations. During the test development process, items with the highest item-total correlations are retained to form a scale with high internal consistency to minimize the contribution of random error to test scores.
In addition, the distribution of total test scores is obtained and compared to the distribution desired by the test developer. Then, some item switching may be done to obtain the closest possible match between the desired and obtained total score distributions. Item difficulty is often used as a criterion for item replacement. Parallel forms are typically created to have identical test score distributions. Equivalence of test score means, variances, and errors is interpreted as evidence that tests are parallel.
Example of CTT-Based Test Development
Task
Your organization wants to use a 15-item verbal ability test for initial screening of job applicants. The test should be reliable, easy to score, and relatively difficult to pass (only 50 percent of applicants should answer 8 or more items correctly).
Step 1. Writing Items
- Before item writing begins, a test developer must have a good understanding of the construct(s) to be measured. Theory should guide item writing and, ultimately, item selection. With respect to measuring verbal ability, multiple- choice items that cover topics, such as spelling, vocabulary, and grammar, have been shown to perform well historically.
- As a rule of thumb, it is generally desirable to write at least twice as many items as needed for the final test (Nunnally et al., 1977). Obviously, an even larger number of items is needed if multiple forms must be developed.
- Once the items are written, they must be pretested using a sample that is similar to your applicant population. This sample, which we refer to later as a calibration sample, should be large enough to provide stable CTT item statistics.
Step 2. Initial Item Selection
The most important item statistic for CTT-based test construction is the item-total test score correlation. Item with high item-total correlations should be included in the test because they increase scale internal consistency (reliability), thus, reducing the standard error of measurement. In addition, item difficulty (p-value) should be considered to create a test with the desired total score distribution. For example, if you want to screen out 90% of your applicant population, your test should contain items with low p-values.
To demonstrate the principles described above, consider the following example. Suppose your initial item pool consisted of 40 multiple-choice items, which were administered to 450 individuals. Item responses were then entered into an SPSS database for analysis (download the example.sav file here).
- To obtain item difficulties and item-total correlations for all 40 items, run a Reliability Analysis, as shown in the following SPSS syntax:

- Organize your output as shown in the table below. Identify items with low or negative item-total correlations (below .2) and with very extreme p-values (below .1 or above .9). Extremely low or negative p-values may indicate mistakes associated with scoring, or perhaps problems with item content; items with very high p-values may be too easy, and provide little information about test takers.

Based on the item analysis above, five items were identified as having low item-total correlations and/or extreme p-values. These "poor" items should be deleted.
- Next, sort the table of item-total correlations in descending order, and select the top 15 items to retain. Then, for those 15 items, compute scale score statistics, such as the mean, standard deviation, and internal consistency reliability (alpha). You should also plot the total score frequency distribution. See the SPSS syntax below.

The results of the above analyses should be as follows:
Scale reliability = .85
Scale score frequency distribution:

Step 3. Obtaining the Desired Scale Score Distribution
Because your organization wanted to have a mean scale score of 8.00, some item switching is necessary to increase the expected mean total score from 7.25 to 8.00. In this case, you want to replace low p-value items with high p-value items. To minimize the impact of item switching on the scale's reliability, try replacing items with low item-total correlations before eliminating more highly discriminating items. We suggest replacing items 9, 30, and 2 with items 33, 27, and 3 (Note, in practice, some content balancing may also be needed).
Below is the scale score distribution for the revised 15-item test (alpha = .84).

Assuming you intended to measure only one factor, verbal ability, the revised scale should be factor analyzed to ensure that only one dimension underlies item responses. Afterwards, the test form is ready for operational use.
Constructing Parallel Forms
To construct two parallel forms, a similar procedure can be used. A test developer again can sort all items in descending order by item-total correlations. Then, he/she should identify pairs having similar item difficulty values and item-total correlations. Finally, put one item from each pair on each form, and compare the expected test score distributions.
Limitations of the CTT Approach
- CTT statistics are subpopulation dependent. They may vary across groups of examinees that differ in mean scores on the attribute being measured. Thus, test developers must be careful when selecting samples for item calibration. If calibration samples are different from operational samples, the psychometric properties of the test may change dramatically.
- In CTT, the measurement precision of a test (standard error of measurement) is implicitly averaged across all ability levels. Thus, the measurement precision at particular score levels is unknown.
IRT methods can circumvent these limitations.
Back
|
|
|