Item response theory
In psychometrics, item response theory is a paradigm for the design, analysis, and scoring of tests, questionnaires, and similar instruments measuring abilities, attitudes, or other variables. It is a theory of testing based on the relationship between individuals' performances on a test item and the test takers' levels of performance on an overall measure of the ability that item was designed to measure. Several different statistical models are used to represent both item and test taker characteristics. Unlike simpler alternatives for creating scales and evaluating questionnaire responses, it does not assume that each item is equally difficult. This distinguishes IRT from, for instance, Likert scaling, in which "All items are assumed to be replications of each other or in other words items are considered to be parallel instruments". By contrast, item response theory treats the difficulty of each item as information to be incorporated in scaling items.
It is based on the application of related mathematical models to testing data. Because it is often regarded as superior to classical test theory, it is the preferred method for developing scales in the United States, especially when optimal decisions are demanded, as in so-called high-stakes tests, e.g., the Graduate Record Examination and Graduate Management Admission Test.
The name item response theory is due to the focus of the theory on the item, as opposed to the test-level focus of classical test theory. Thus IRT models the response of each examinee of a given ability to each item in the test. The term item is generic, covering all kinds of informative items. They might be multiple choice questions that have incorrect and correct responses, but are also commonly statements on questionnaires that allow respondents to indicate level of agreement, or patient symptoms scored as present/absent, or diagnostic information in complex systems.
IRT is based on the idea that the probability of a correct/keyed response to an item is a mathematical function of person and item parameters. The person parameter is construed as a single latent trait or dimension. Examples include general intelligence or the strength of an attitude. Parameters on which items are characterized include their difficulty ; discrimination, representing how steeply the rate of success of individuals varies with their ability; and a pseudoguessing parameter, characterising the asymptote at which even the least able persons will score due to guessing.
Overview
The concept of the item response function was around before 1950. The pioneering work of IRT as a theory occurred during the 1950s and 1960s. Three of the pioneers were the Educational Testing Service psychometrician Frederic M. Lord, the Danish mathematician Georg Rasch, and Austrian sociologist Paul Lazarsfeld, who pursued parallel research independently. Key figures who furthered the progress of IRT include Benjamin Drake Wright and David Andrich. IRT did not become widely used until the late 1970s and 1980s, when practitioners were told the "usefulness" and "advantages" of IRT on the one hand, and personal computers gave many researchers access to the computing power necessary for IRT on the other. In the 1990's Margaret Wu developed two item response software programs that analyse PISA and TIMSS data; ACER ConQuest and the R-package TAM.Among other things, the purpose of IRT is to provide a framework for evaluating how well assessments work, and how well individual items on assessments work. The most common application of IRT is in education, where psychometricians use it for developing and designing exams, maintaining banks of items for exams, and equating the difficulties of items for successive versions of exams.
IRT models are often referred to as latent trait models. The term latent is used to emphasize that discrete item responses are taken to be observable manifestations of hypothesized traits, constructs, or attributes, not directly observed, but which must be inferred from the manifest responses. Latent trait models were developed in the field of sociology, but are virtually identical to IRT models.
IRT is generally claimed as an improvement over classical test theory. For tasks that can be accomplished using CTT, IRT generally brings greater flexibility and provides more sophisticated information. Some applications, such as computerized adaptive testing, are enabled by IRT and cannot reasonably be performed using only classical test theory. Another advantage of IRT over CTT is that the more sophisticated information IRT provides allows a researcher to improve the reliability of an assessment.
IRT entails three assumptions:
- A unidimensional trait denoted by ;
- Local independence of items;
- The response of a person to an item can be modeled by a mathematical item response function.
The item response function
The IRF gives the probability that a person with a given ability level will answer correctly. Persons with lower ability have less of a chance, while persons with high ability are very likely to answer correctly; for example, students with higher math ability are more likely to get a math item correct. The exact value of the probability depends, in addition to ability, on a set of item parameters for the IRF.Three parameter logistic model
For example, in the three parameter logistic model, the probability of a correct response to a dichotomous item i, usually a multiple-choice question, is:where indicates that the person's abilities are modeled as a sample from a normal distribution for the purpose of estimating the item parameters. After the item parameters have been estimated, the abilities of individual people are estimated for reporting purposes.,, and are the item parameters. The item parameters determine the shape of the IRF. Figure 1 depicts an ideal 3PL ICC.
The item parameters can be interpreted as changing the shape of the standard logistic function:
In brief, the parameters are interpreted as follows ; b is most basic, hence listed first:
- b – difficulty, item location: the half-way point between and 1, also where the slope is maximized.
- a – discrimination, scale, slope: the maximum slope
- c – pseudo-guessing, chance, asymptotic minimum
In other words, the standard logistic function has an asymptotic minimum of 0, is centered around 0, and has maximum slope The parameter stretches the horizontal scale, the parameter shifts the horizontal scale, and the parameter compresses the vertical scale from to This is elaborated below.
The parameter represents the item location which, in the case of attainment testing, is referred to as the item difficulty. It is the point on where the IRF has its maximum slope, and where the value is half-way between the minimum value of and the maximum value of 1. The example item is of medium difficulty since =0.0, which is near the center of the distribution. Note that this model scales the item's difficulty and the person's trait onto the same continuum. Thus, it is valid to talk about an item being about as hard as Person A's trait level or of a person's trait level being about the same as Item Y's difficulty, in the sense that successful performance of the task involved with an item reflects a specific level of ability.
The item parameter represents the discrimination of the item: that is, the degree to which the item discriminates between persons in different regions on the latent continuum. This parameter characterizes the slope of the IRF where the slope is at its maximum. The example item has =1.0, which discriminates fairly well; persons with low ability do indeed have a much smaller chance of correctly responding than persons of higher ability. This discrimination parameter corresponds to the weighting coefficient of the respective item or indicator in a standard weighted linear regression and hence can be used to create a weighted index of indicators for unsupervised measurement of an underlying latent concept.
For items such as multiple choice items, the parameter is used in attempt to account for the effects of guessing on the probability of a correct response. It indicates the probability that very low ability individuals will get this item correct by chance, mathematically represented as a lower asymptote. A four-option multiple choice item might have an IRF like the example item; there is a 1/4 chance of an extremely low ability candidate guessing the correct answer, so the would be approximately 0.25. This approach assumes that all options are equally plausible, because if one option made no sense, even the lowest ability person would be able to discard it, so IRT parameter estimation methods take this into account and estimate a based on the observed data.