From the Item Bank |
||
The Professional Testing Blog |
||
Multiple Choice Items – Can You Guess for Success?September 23, 2016 | |Co-authored by Joy Matthews-López and David Cox One of the most common criticisms of the standard multiple-choice (MC) item format is that test takers might easily guess their way to a passing score. Often these discussions are limited to a purely probabilistic point of view based on the odds of randomly guessing the answer to a single question (i.e you have a one in four chance of guessing the correct answer to a four option question,) rather than a more complex calculation of the odds of randomly guessing enough correct answers to a series of questions that result in achieving a passing score. While these arguments are simplistic, under particular conditions the underlying concern can indeed be a plausible one. A more objective analysis must take into account a variety of interrelated factors that contribute to increasing the probability of success in guessing including the ability of candidates, test length, passing point and item quality. This blog explores the conditions under which guessing can impact the pass/fail outcome of a MC test session. Is a test that is comprised exclusively of MC items “less safe” than an equivalent test that contains items less likely to be guessed, such as multiple-response items? Most ability measures, such as test scores, can be considered to be continuous variables, where the lowest possible number-correct score is 0 and the highest is k, where a given test contains k equally weighted items. The probability of a low ability examinee passing a reasonably long criterion-referenced MC test by guessing alone is demonstrably low. For example, on a test with k 4-option MC items, the probability of all items being guessed correctly is, which is asymptotically equivalent to zero when k is reasonably large. However, candidates don’t need to get all items right to pass a test. They only need to correctly answer enough items to meet the cut point. So, where the cut point is located matters. Also, the length of the test significantly affects the probability of a candidate guessing a path to success, so short tests will be less robust than long forms. In addition, it is unreasonable to assume that candidates that have met all requirements to sit for a test are indeed of zero ability, so calculations need to be adjusted for some level of mastery. Additionally, not all items are stellar. Some may have suboptimal distractors, which will impact the probability stated above. Poorly constructed MC questions are susceptible to test-wiseness. Test-wise candidates may improve their test scores by recognizing and exploiting logical cues in the test items, format or testing situation. Test-wiseness is considered independent of the examinee’s knowledge of the subject matter for which the items are intending to measure. But when combined with a candidate who possesses partial knowledge, the probability of selecting a correct response increases significantly. Given these issues, the real question is this: are tests that are comprised exclusively of MC items robust enough with respect to guessing to realistically prevent a low ability candidate to pass a reasonably long, well-constructed MC test? The question is relevant because converting away from traditional MC items requires work and resources, such as retraining of subject matter experts, updated item banking tool and/or additional training on that tool, updated psychometric considerations (item analyses, flagging criteria, etc.), and the money to host such changes. To explore these issues and answer our question, let us consider the following example. Let Test A consist of 100 4-option MC items, where all items have equal weight for scoring purposes, and where the set of items are locally independent of each other. The probability of an examinee correctly guessing the answer to a 4-option MC item is 0.25 (1 out of 4). Given that the test in this example contains 100 questions, the probability of guessing the correct answer to all 100 questions is (.25)^100, which is a number with 61 zeros following the decimal point. For practical purposes, this is zero. However as mentioned earlier, examinees don’t need to get all items correct to pass a test. For the sake of this example, let us consider a raw percent-correct cut point of 60%. For our 100 item test, this would translate to a raw number correct score of at least 60/100 correct responses needed to pass the test. The probability of an examinee achieving a passing score by random guessing alone would increase (See Fig. 1) to approximately 1.3322 x 10-13. Again, we see a near-zero probability of passing by random guessing alone. However, it is not reasonable to assume that an examinee that has met eligibility requirements to sit for a certification or licensure exam would be totally void of subject matter knowledge or requisite skills. Even low-ability examinees have some level of mastery. When we adjust accordingly, say by a base mastery level of 25%, the probability of passing the test described above (by a combination of some non-zero level of knowledge/skill and some random guessing) changes from 1.3322 x 10-13 to 0.000039 (or 3.9 x 10-5). Obviously, the longer the test, the lower the probability that a non-master will guess their way to successfully passing the test, assuming that all items contained on the test function as designed (aka, all distractors function well and the item discriminates well). To put these outcomes in perspective, we would expect fewer than 4 low ability examinees out of a pool of 100,000 candidates to pass the test by guessing alone. Is 4 too many? If the test in question is for certification (or licensure) in a field that does not require periodic testing to recertify or maintain license, then yes, perhaps any candidates at all would be too many. It is important to note that as ability increases from low to borderline (say, just below the cut point), then the probability of correctly guessing the few items that fall between what a candidate can legitimately answer and what (s)he needs to guess correctly drastically increases. In this scenario, MC items alone may be too risky. There are many reasons why an exam program may want to introduce alternative item types, such as to bring an element of “real world” fidelity to the exam or to use items that are less memorable (or memorizable). We can add to this list a validity argument. As has been demonstrated here, the probability of a borderline candidate passing a short, 4-option MC exam that has a cut point near the average ability (of borderline examinees) is high enough to warrant the use of non-MC items. So what is the take-away message? MC items are a reasonable item type to use unless (1) the test is very short, or (2) the average ability of a program’s borderline group is close to the exam cut point.
Image Source: Alberto G. Tags: Item Type, Item writing, multiple choice, test development, test score Categorized in: Item Type |
Comments are closed here.