Computerized adaptive testing
Computerized adaptive testing is a form of computer-based test that adapts to the examinee's ability level. For this reason, it has also been called tailored testing. In other words, it is a form of computer-administered test in which the next item or set of items selected to be administered depends on the correctness of the test taker's responses to the most recent items administered.
Description
CAT successively selects questions for the purpose of maximizing the precision of the exam based on what is known about the examinee from previous questions. From the examinee's perspective, the difficulty of the exam seems to tailor itself to their level of ability. For example, if an examinee performs well on an item of intermediate difficulty, they will then be presented with a more difficult question. Or, if they performed poorly, they would be presented with a simpler question. Compared to static tests that nearly everyone has experienced, with a fixed set of items administered to all examinees, computer-adaptive tests require fewer test items to arrive at equally accurate scores.The basic computer-adaptive testing method is an iterative algorithm with the following steps:
- The pool of available items is searched for the optimal item, based on the current estimate of the examinee's ability
- The chosen item is presented to the examinee, who then answers it correctly or incorrectly
- The ability estimate is updated, based on all prior answers
- Steps 1–3 are repeated until a termination criterion is met
As a result of adaptive administration, different examinees receive quite different tests. Although examinees are typically administered different tests, their ability scores are comparable to one another. The psychometric technology that allows equitable scores to be computed across different sets of items is item response theory. IRT is also the preferred methodology for selecting optimal items which are typically selected on the basis of information rather than difficulty, per se.
A related methodology called multistage testing or CAST is used in the Uniform Certified Public Accountant Examination. MST avoids or reduces some of the disadvantages of CAT as described below.
Examples
CAT has existed since the 1970s, and there are now many assessments that utilize it.- Graduate Management Admission Test
- MAP test from NWEA
- SAT
- National Council Licensure Examination
- Armed Services Vocational Aptitude Battery
Advantages
Adaptive tests can provide uniformly precise scores for most test-takers. In contrast, standard fixed tests almost always provide the best precision for test-takers of medium ability and increasingly poorer precision for test-takers with more extreme test scores.An adaptive test can typically be shortened by 50% and still maintain a higher level of precision than a fixed version. This translates into time savings for the test-taker. Test-takers do not waste their time attempting items that are too hard or trivially easy. Additionally, the testing organization benefits from the time savings; the cost of examinee seat time is substantially reduced. However, because the development of a CAT involves much more expense than a standard fixed-form test, a large population is necessary for a CAT testing program to be financially fruitful.
Large target populations can generally be exhibited in scientific and research-based fields. CAT testing in these aspects may be used to catch early onset of disabilities or diseases. The growth of CAT testing in these fields has increased greatly in the past 10 years. Once not accepted in medical facilities and laboratories, CAT testing is now encouraged in the scope of diagnostics.
Like any computer-based test, adaptive tests may show results immediately after testing.
Adaptive testing, depending on the item selection algorithm, may reduce exposure of some items because examinees typically receive different sets of items rather than the whole population being administered a single set. However, it may increase the exposure of others.
Disadvantages
The first issue encountered in CAT is the calibration of the item pool. In order to model the characteristics of the items, all the items of the test must be pre-administered to a sizable sample and then analyzed. To achieve this, new items must be mixed into the operational items of an exam, called "pilot testing", "pre-testing", or "seeding". This presents logistical, ethical, and security issues. For example, it is impossible to field an operational adaptive test with brand-new, unseen items; all items must be pretested with a large enough sample to obtain stable item statistics. This sample may be required to be as large as 1,000 examinees. Each program must decide what percentage of the test can reasonably be composed of unscored pilot test items.Although adaptive tests have exposure control algorithms to prevent overuse of a few items, the exposure conditioned upon ability is often not controlled and can easily become close to 1. That is, it is common for some items to become very common on tests for people of the same ability. This is a serious security concern because groups sharing items may well have a similar functional ability level. In fact, a completely randomized exam is the most secure.
Review of past items is generally disallowed, as adaptive tests tend to administer easier items after a person answers incorrectly. Supposedly, an astute test-taker could use such clues to detect incorrect answers and correct them. Or, test-takers could be coached to deliberately pick a greater number of wrong answers leading to an increasingly easier test. After tricking the adaptive test into building a maximally easy exam, they could then review the items and answer them correctly—possibly achieving a very high score. Test-takers frequently complain about the inability to review.
Because of the sophistication, the development of a CAT has a number of prerequisites. The large sample sizes required by IRT calibrations must be present. Items must be scorable in real time if a new item is to be selected instantaneously. Psychometricians experienced with IRT calibrations and CAT simulation research are necessary to provide validity documentation. Finally, a software system capable of true IRT-based CAT must be available.
In a CAT with a time limit it is impossible for the examinee to accurately budget the time they can spend on each test item and to determine if they are on pace to complete a timed test section. Test takers may thus be penalized for spending too much time on a difficult question which is presented early in a section and then failing to complete enough questions to accurately gauge their proficiency in areas which are left untested when time expires. While untimed CATs are excellent tools for formative assessments which guide subsequent instruction, timed CATs are unsuitable for high-stakes summative assessments used to measure aptitude for jobs and educational programs.
Components
There are five technical components in building a CAT. This list does not include practical issues, such as item pretesting or live field release.- Calibrated item pool
- Starting point or entry level
- Item selection algorithm
- Scoring procedure
- Termination criterion
Calibrated item pool