Anatomy of a Clerkship Test - Wiley Online Library

Viewer
Transcript

Anatomy of a Clerkship Test Emily L. Senecal, MD, Kim Askew, MD, Barbara Gorney, PhD, Michael S. Beeson, MD, MBA, and David E. Manthey, MD

Abstract Written examinations are frequently used to assess medical student performance. Within emergency medicine (EM), a National Board of Medical Examiners (NBME) subject examination for EM clerkships does not exist. As a result, clerkship directors frequently generate examinations within their institution. This article reviews the literature behind the use of standardized examinations in evaluating medical student performance, describes methods for generating well-written test questions, reviews the statistical concepts of reliability and validity that are necessary to evaluate an examination, and proposes future directions for testing EM students. ACADEMIC EMERGENCY MEDICINE 2010; 17:S31–S37 ª 2010 by the Society for Academic Emergency Medicine Keywords: emergency medicine, student, examination, test, evaluation, clerkship

A

ssessment of medical student performance on clinical rotations is a complex and inherently subjective task. Clerkship directors use a variety of assessment tools to this end, including written examinations, oral examinations, direct observation, standardized patient examinations, simulated patient examinations, and formal case presentations. Numerous studies have demonstrated that use of more than one assessment tool is preferred to a single assessment tool.1–3 Common to all of these studies is the inclusion of a written examination. A written examination is considered desirable because it can assess medical knowledge, differentiate students based on knowledge level, encourage students to study independently, and assesses the curricular goals of the clerkship. However, a test is limited in its ability to assess clinical and procedural skills and higher levels of medical decisionmaking skills. Additionally, written examinations do not assess professionalism or interpersonal and communication skills. Despite these limitations, written examinations are frequently used to assess medical student From the Department of Emergency Medicine, Harvard Medical School, Massachusetts General Hospital (ALS), Boston, MA; the Department of Emergency Medicine (KA, DEM), Office of Medical Education (BG), Wake Forest University School of Medicine, Winston-Salem, NC; and the Department of Emergency Medicine, Akron General Medical Center (MSB), Akron, OH. Received April 22, 2010; revision received July 7, 2010; accepted July 11, 2010. Supervising Editor: Nicole DeIorio, MD. Address for correspondence and reprints: Emily L. Senecal, MD; e-mail: [email protected].

ª 2010 by the Society for Academic Emergency Medicine doi: 10.1111/j.1553-2712.2010.00880.x

performance. This article will assess the effect of clerkship tests on a rotation grade; review criteria for writing test questions, validity, and reliability; question statistics; and explore the use of a national test. IMPACT OF CLERKSHIP TESTS The accuracy and reliability with which written examinations assess medical knowledge are not clearly defined. Even National Board of Medical Examiners (NBME) subject examinations are valid only to the extent of content congruence with school course objectives.4 Studies of the ability of written examinations to assess medical knowledge have found varying results. For example, a study of students on an obstetrics and gynecology rotation found that scores on the NBME subject examination had the highest correlation with overall clerkship performance when compared with clinical performance scores (as rated by residents, fellows, and faculty) and formal case presentations.2 Likewise, a study of students on an internal medicine rotation found that performance on the NBME examination identified students at risk for poor performance on the U.S. Medical Licensing Examination (USMLE) Step 2 examination.5 In contrast, students on a pediatric rotation were found to have only a weak correlation between their NMBE score and their final clerkship grade.1 Similarly, the NBME scores for students on a surgery rotation were noted to negatively predict their overall clerkship grade and correlate only minimally with fund of knowledge scores as assessed on ward evaluations.6 Usage of written examinations at the end of a clerkship has been studied for several disciplines. The use of the NBME examination for students on internal medi-

ISSN 1069-6563 PII ISSN 1069-6563583

S31

S32

cine rotations has steadily increased from 66% of clerkships in 19907 to 83% in 20028 and 88% in 2009.9 In the most recent survey (2009), 99% of internal medicine clerkships used the NBME score to help determine the student’s final grade. There was considerable variation in how much the NBME score contributed to the final clerkship grade, ranging from 10% to 50%. However, in the majority of clerkships, the NBME score accounted for 20% to 25% of the final grade.9 In psychiatry, 69% of clerkship directors surveyed in 2005 used the NBME examination, 34% of whom were required to do so by their medical schools.10 The median (±SD) weight assigned to the NBME score in determining the final psychiatry clerkship grade was 25% (±16.5%). Although use of NBME examinations is prevalent, they are currently available only for the disciplines of medicine, surgery, pediatrics, family practice, psychiatry, obstetrics and gynecology, and clinical neurology.11 Clerkship directors without access to NBME examinations, either due to lack of availability or due to cost, have the option of developing an examination within their department. Additionally, validated assessment instruments have been developed in two fields for which NBME examinations do not exist—urology12 and emergency medicine (EM).13 Within EM, 59% of clerkship directors use a final examination.14 Many of these clerkship directors have developed their own departmentally developed final written examination. To this effect, this paper will cover some of the basics of test question writing and test item evaluation. DEVELOPMENT OF EFFECTIVE TEST QUESTIONS The importance of writing nonflawed test items is shown in work by Downing,15 in which review of several tests showed that up to 10% to 15% of students classified as failing the examination may have failed at least in part due to the flaws in item writing. Numerous authors and groups have published guidelines to help reduce the number of flaws in written test questions, including the NBME.16–20 These published guidelines are based on the groups’ and authors’ experiences and educational research. For the purpose of this paper, focus is limited to multiple-choice questions (MCQs), although some clerkship directors may incorporate short-answer, fill-in-the-blank, essay, and other types of questions. When writing a MCQ, two criteria must be met. The first criterion is to identify important concepts to be tested, based on educational goals and objectives of the course or lecture. The second criterion is writing wellstructured questions that avoid certain pitfalls and flaws that may lead to an examination that produces nonvalid scores.17 Such flaws can occur when writing the stem (statement, question, and lead-in to the answers), the keyed response (correct answer), and foils ⁄ distracters (incorrect options) for the test item (entire test question). The next few paragraphs will cover the pros and cons of the two types of MCQs (true ⁄ false and single best answer) and review the published guidelines in writing effective MCQs.

Senecal et al.

•

ANATOMY OF A CLERKSHIP TEST

True ⁄ False Questions As stated in the type of question, true ⁄ false questions (TFQs) require the test taker to identify one or more answer options as true. There are several variations of the TFQ, from the simple true ⁄ false options, to the type-C true ⁄ false (A ⁄ B ⁄ both ⁄ neither options), in which the test taker must determine if several options are true, as opposed to just one answer being correct. Although these questions appear easy to write, they have several difficulties that have led numerous organizations, such as the NBME, to either stop using them or remove a large number of them from their question banks.17 Several of the difficulties associated with TFQs are reviewed in Table 1. Despite the difficulty in writing clear TFQs, they are good for testing knowledge content and recall and can test a large amount of content in fewer questions.18 Single Best Answer Multiple-choice questions that require the test taker to choose the single best answer are the most frequently used form of test question. These questions can be formulated to test knowledge as well as knowledge application. There are numerous forms of the single best answer format, from the question with three to five options, to extended matching questions with sets of 2 to 20 items. Extended matching questions have been shown to have better discrimination and be more reliable than other question types.19 However, flaws in stem and option formation can lead to questions that are unreliable. Accepted Stem Writing Guidelines Stems may vary based on the content being examined, as well as the setting of the examination. For example, basic science questions may be a simple question, whereas clinical examinations may have stems with vignettes and images. Regardless of the setting, several guidelines exist for stem writing (Table 2). Stems should ask a clear question in a complete statement or question that can be answered without reviewing the options. Stems should contain all necessary information, but avoid verbosity. Negative stems are discouraged by most test writing experts.17 Research on the effects of using negative stems has found mixed results in terms of discrimination and difficulty, but several experts have recommended multiple TFQs over negative stem items.18

Table 1 Difficulties in Writing True ⁄ False Questions17 • Unclear or ambiguous stem that makes test taker make judgment as to what the writer was thinking when writing questions • Requires additional judgment that may be in addition to clinical experience or knowledge • Answer options are vague or ambiguous • Answer options that are not completely true or false • Typically assesses recall of isolated facts, not the application of knowledge

ACAD EMERG MED • October 2010, Vol. 17, No. S2

•

www.aemj.org

Table 2 General Guidelines to Writing Stems for Test Items17,18,20 • Ensure that directions in the stem are clear and the item can be answered without looking at the answers. • Include the central idea in the stem. • Stems should contain as much of the item as possible, but avoid verbosity • Stems should be a complete statement ⁄ question rather than a sentence completion • Stem should be stated so that only one option is correct without dispute. If more than one option contains truth, the stem should ask the test taker to choose the best answer. • Avoid the use of negative stems (all of the following EXCEPT or which of the following is NOT true). If used, make sure to capitalize, underline, or italicize to bring attention to the negative stem.

Table 3 Examples of Clinical Vignette Lead-ins17 • Which of the following is the most likely diagnosis? • Which of the following is most likely to confirm the diagnosis? • Which of the following is the first priority in caring for this patient? • Which of the following is the most likely explanation for these findings?

In the basic science and clinical science arenas, clinical vignettes have become more popular. Clinical vignettes typically include a combination of age, sex, presenting health care location (e.g., emergency department), presenting complaint, duration of complaint, patient history, physical findings, initial laboratory findings, and treatments. Using vignettes in the stem allows for a more patient-oriented item and moves from knowledge recall to knowledge application. Vignettes may increase item difficulty as patient findings are presented in a less interpreted manner; however, the difference in discrimination does not achieve statistical significance.17 When writing vignettes, several rules should be followed. The stem should contain a clear, specified problem, and the lead-in should pose a clear question that can be answered without looking at the options. Writers should avoid using real patient experiences and patient’s own words.17 A list of commonly used lead-ins are listed in Table 3. Accepted Option Writing Guidelines When writing options, more is not always better. Research has shown that three options are adequate for a single item, with four options having no better discrimination than three options.18,20 Although obvious, only one of the options should be the undisputable correct answer. Options should be placed in alphabetical, logical, or numerical order, instead of randomly assigned order. Options should not overlap (e.g., 10% to 20%, >15%) and should be in the same units if options contain a numerical value. Several flaws have been shown to allow the ‘‘testwise’’ student to identify an option as the keyed

S33

response or as a foil ⁄ distracter. First, all options should follow the stem grammatically (proper tense, plurality, subject-verb agreement). When an option does not follow the same grammatical format, the test-taker can identify the option as a distracter. Second, absolute terms, such as always and never, should be avoided. Absolutes rarely exist; therefore, answers with such absolute terms can be identified as a distracter. Third, when developing options, test writers often make the keyed response more specific or longer to ensure the option is correct. In doing so, the test-wise student can identify that option as the correct one. Fourth, repeating words from the stem in an option is most likely to occur with the keyed response, making it more likely to be identified as such. Last, use of all of the above simplifies an item, as a test-taker who can identify two correct options will ultimately identify the keyed response as all of the above. By avoiding these flaws, the item is more likely to ensure that the test taker identified the keyed response based on knowledge recollection and comprehension. Other flaws in option writing lead to irrelevant difficulty, and likely adversely affect the test takers’ performance. By writing long, complicated options, the item becomes more of a reading test than a knowledge examination.17,18 Therefore, options should be kept relatively short. Another common flaw is to place numeric data in options in different values or terms or in a nonlogical order. This flaw requires the test taker to spend more time in assessing and converting options and may result in the item testing the student’s knowledge of conversion, rather than content. The classic ‘‘none of the above’’ option adds another degree of complexity to a test item. By using this option, the test taker has to use judgment similar to TFQs in determining what the item writer was thinking when writing the question. Use of ‘‘none of the above’’ essentially changes a single best option item into a TFQ.17 Finally, the use of vague terms (rarely, usually, frequently) should be avoided as these terms mean different time periods to different people. In a study by Case,21 when asked what frequently meant, the median response was 70% of the time, with half of respondents identifying it as between 45% and 75%. Use of vague terms adds another dimension of complexity to a question as the test taker may identify the term as meaning something different than what the item writer intended. The best distracters are typically errors commonly made by examinees in relation to the content being tested. Each foil or distracter should be plausible, but incorrect, or not meet the full requirements of the stem or problem. Distracters should also be related to each other, by falling into the same category as the correct answer (tests, labs, organism, etc.). By following these recommendations, the item is less likely to introduce irrelevant difficulty while ensuring items test content, not test-taking skills. Item Compilation Into Test After ensuring that individual items are not flawed, the clerkship director needs to be aware of the effects of ‘‘cueing’’ and ‘‘hinging.’’ Cueing refers to when one option in one item provides a hint to the answer for

S34

Table 4 Summation of Test Item Writing Guidelines17,18,20 Test items in general • Developed from educational objectives of the curriculum or lecture • Examine different levels of knowledge (recollection, application, problem solving) • Avoid cueing and hinging Stems • Contain a complete, clear statement with relevant information (not verbose) • Avoid the use of negative terms (i.e., except, not true) and asks for correct answer • Contain as much information of the item as possible • Use vignettes to make patient oriented and promote knowledge application Options • Avoid ‘‘none of the above’’ or ‘‘all of the above’’ options • Avoid imprecise terms (frequently, usually, etc.) • Options are related to one another in type (tests, labs, etc.) • Grammatically follows the stem • Placed in logical order (numerical, alphabetical, etc.) • Brief, equal length options that are independent of one another

another item.20 Hinging occurs when one question requires students to know the answer to another item. Test items should be independent of one another to prevent cueing as the test-wise student may be positively affected, while hinging may adversely affect a student’s performance. Although most physicians have little experience in writing test questions, monitoring for certain flaws in stem and option writing can lead to test items that are reliable and valid, while not introducing irrelevant difficulty. Table 4 reviews several of these recommendations in item writing and test compilation. After developing items that are reviewed for flaws, the clerkship director should review several statistics of each item and the test in general in determining its effectiveness. STATISTICAL ANALYSIS OF A TEST Reliability Reliability can be characterized as the level of confidence users can have that test scores accurately reflect test takers’ ability in whatever the test is measuring. It is an indication of how much error there is in the scores. Validity relates to assuring that what is being measured is what was intended to be measured. Scores that are not reliable cannot, by definition, be valid. The Kuder-Richardson 20 (KR-20) reliability method is designed for use with questions that are scored dichotomously (i.e., all right or all wrong; 1 or 0). KR-20 and Cronbach’s alpha are commonly reported in test scoring programs and will produce the same value when dichotomous scoring is used. KR-20 is a lower bound of reliability, that is, the reliability may be higher than the value reported. Its value ranges from 0 to 1, with higher values reflecting higher reliability. The minimum acceptable value depends on the consequences of the scores to the test takers. When scores are used to make consequential decisions, e.g., passing or failing a

Senecal et al.

•

ANATOMY OF A CLERKSHIP TEST

course, the minimum recommended KR-20 value for departmentally developed tests is 0.80.22 Reliability is affected by test length, variation of the test scores, the relatedness of the questions’ content, and the quality of the questions. To increase reliability, the test content should address closely related objectives, contain a sufficient number of questions to adequately represent the content being tested, and produce scores that reflect the different ability levels of the test takers (i.e., variation in scores). Very easy or very difficult tests will have a low reliability coefficient. The standard error of measurement (SEM) is also frequently reported and can be used to construct a confidence interval (CI) around a score. The smaller the SEM, the narrower the CI, and the less error there is in the score. SEM is larger in the middle of the distribution than in the tails. The establishment of minimum scores to categorize test takers into groups should consider the SEM. Validity Although there are different types of validity, content validity is the main concern in testing. This depends on how well the content and the number of the questions represent the entire domain being tested, including the cognitive processes required of the examinees. Unlike reliability, however, no number exists to measure content validity; rather, evidence is provided to support an argument for validity and discredit competing arguments. Demonstration of factors irrelevant to examinee ability in the domain being tested are competing arguments. Poorly written questions can introduce such irrelevant factors. The judgment of experts is the major component of content validity evidence. Questions should be written by content experts and reviewed by nonauthor content experts. To ensure congruence of content and curriculum, examinations should be reviewed by a faculty group prior to each academic year and whenever curricular changes occur. A test blueprint in which each question is matched to a learning objective demonstrates content validity. The level of cognitive process (e.g., knowledge application, analysis, and comprehension) required to answer correctly should also be determined. Thereby, the breadth and depth to which the test addresses the domain can be demonstrated. Question or Item Analysis Test scoring software commonly provides information at the question level. Typically reported are the proportion of students who answered correctly (the question difficulty) and an index of discrimination. The latter may be labeled as biserial, point-biserial, or discrimination but all indicate how well the question distinguished between students who did well on the test and those who did poorly on the test. Values range from –1 to +1. Discrimination values are not reliable for very easy or very difficult questions. For departmentally developed tests, discrimination values ‡ +0.2 are reasonably good and ‡ +0.3 are considered good.22,23 Incorrect answer options (distracters) should have negative or zero values; larger negative

ACAD EMERG MED • October 2010, Vol. 17, No. S2

•

www.aemj.org

values indicate better distracters. Questions where the correct answer has negative discrimination or a distracter has positive discrimination are suspect and should be examined. Question statistics are useful to identify optimally (and suboptimally) performing test questions. They should not replace faculty judgment about what is a fair question. Neither easy questions nor difficult questions (with good discrimination) should be removed from an examination if they cover important content. One way of increasing test score reliability is to use highly discriminating questions. However, item statistics are not a property of the question, but will vary with examinee groups and when used with a different set of questions. Item statistics and reliability coefficients based on a small number of test takers are unstable. For examinations where small groups of students take the same test over time, collect at least 30 test results before heeding the statistics. DESIGN OF A TEST BASED ON THE NATIONAL CURRICULUM Due to the breadth of skills and knowledge needed to survive in EM, clerkship directors often have too much information to cover during a 4-week clerkship. Clerkship directors are constantly faced with deciding which material is important to cover and which material can be left for future training and independent learning. The same is true for developing an examination for an EM clerkship. To make this process easier for the clerkship director and fairer to students, objectives for the course should be established, with test material arising from these objectives, as supported by the NBME.17 The question becomes how and when to determine content and objectives. To help with this question, the Clerkship Directors in Emergency Medicine (CDEM), an academy of the Society for Academic Emergency Medicine (SAEM), developed a national curriculum describing which educational topics should be covered based on a consensus syllabus.24,25 The premise of this document was to describe what content students rotating on EM clerkships should be exposed to better equip them to handle emergent conditions later in practice as residents and practicing physicians. In other words, all clerkships should cover these critical topics, and all students should have an understanding to make them better physicians. Although the testing of this curriculum was not part of the major thrust of the curriculum, development of a test based on the national curriculum has its advantages. First, the requirement of testing from course objectives is met, as the student is aware of what material is important and expected to be understood. Second, development of a national test based on the national curriculum allows the student’s performance to become generalizable. If the test is a departmentally developed test, it is impossible to know the level of difficulty of the test, nor how it rates against a similarly developed test at another institution. However, if the test is developed via the above noted guidelines, and based on a curriculum that each institution is

S35

addressing in part by using the same educational material, then student performance on the test can be more reliably compared between institutions. Some evidence shows that it is possible to deliver an equivalent educational experience at two different schools based on both written and clinical examination performance if there is careful design and delivery of the curriculum.26 Another method that has been used in the past is to develop a test based on information that is felt to be important and then develop a curriculum and course based on this examination. This type of process has been used in the past with national examinations. The examinations were written based on what experts felt was important content. It was then up to clerkship directors using the national examination to prepare their students for this examination, which lead to clerkships using the examination to determine clerkship content and objectives. Although using such examinations allows for generalizability, there has been some controversy over the manner in which these tests were developed recently. For example, the NBME internal medicine examination is widely used by most internal medicine clerkships. However, recent studies have shown that clerkship directors wished that the NBME examination more closely followed the consensus national curriculum published by the Clerkship Directors in Internal Medicine (CDIM) and the Society of General Internal Medicine.27,28 This curriculum, as the EM national curriculum, was developed by undergraduate and graduate medical educators to emphasize core areas for medical students. Therefore, NBME and CDIM are working together on this issue. The same can be seen in other areas of medicine, such as the board certification process in radiology.29 Also, students have been shown to score better on national examinations not attached to clerkship curricula later as they advance through their training and after taking national examinations in other specialty areas.30 If testing occurs based on a set curriculum and objectives, then students early in their training should be just as prepared as later in the year, although studies addressing this issue would need to be performed. Therefore, based on experiences in other specialties, it appears that development of an examination based on the national curriculum has advantages over using an examination to define content. Lessons being learned in other specialties should be applied to EM clerkships as clerkship directors evaluate testing prospects, be it a department based examination or the use of a national examination. Use of the national syllabus and curriculum allows clerkships to become more uniform in nature, with a goal of a generalizable test in the future. RATIONALE FOR A SINGLE NATIONAL EXAMINATION In summary, a written examination is used by most clerkships in EM. Currently, two options exist for clerkship directors who wish to administer a written examination: 1) an online testing mechanism provided by SAEM13 or 2) a clerkship test developed internally by the individual institution. The inherent problems of an

S36

internally developed test include small numbers of students taking the test leading to reliability issues, a limited number of content experts to provide content validity, the potential failure to match test items to the educational goals and objectives of the clerkship, and the potential lack of expertise in writing good questions. Additionally, use of departmentally developed examinations does not permit comparison of student performance between institutions. Given these limitations, future directions should include development of a clerkship rotation test based on a suggested national curriculum and the coordination of the development and presentation of a national curriculum in EM. Although a national curriculum has been developed by the CDEM academy within SAEM, on-line teaching content is still being developed in the form of presentation slides, readings, interactive clinical cases, and other resources. Until there is uniformity in the educational content provided to medical students regardless of clerkship location, the development of a national EM clerkship test will have problems with reliability and validity. For these reasons, the most effective approach to a reliable and valid clerkship examination is the development of a standardized test based on an accepted curriculum. For this to occur, test development must be coordinated with the development of national curriculum teaching tools. Additionally, faculty development must be coordinated with the curriculum to consistently present curricular topics regardless of clerkship. To this end it may be appropriate for teaching modules to be developed and presented at national EM educational conferences that teach ‘‘how to teach’’ the curricular topics. Until this occurs, clerkships should strive to ensure that the testing mechanism they use follows published guidelines on question writing and is frequently assessed for difficulty, reliability, and validity. References 1. Greenberg L, Getson P. Assessing student performance on a pediatric clerkship. Arch Pediatr Adolesc Med. 1996; 150:1209–12. 2. Nahum G. Evaluating medical student obstetrics and gynecology clerkship performance: which assessment tools are most reliable? Am J Obstet Gyn. 2004; 191:1762–71. 3. Schmahmann J, Neal M, MacMore J. Evaluation of the assessment and grading of medical students on a neurology clerkship. Neurology. 2008; 70:706– 12. 4. National Board of Medical Examiners. Subject Examination Program Information Guide. Available at: http://www.nbme.org/PDF/2007subexaminfoguide. pdf. Accessed Apr 19, 2010. 5. Ripkey D, Case S, Swanson DB, et al. Identifying students at risk for poor performance on the USMLE step 2. Acad Med. 1999; 74(10 Suppl):S45–8. 6. Hermanson B, Firpo M, Cochran A, et al. Does the National Board of Medical Examiners’ Surgery Subtest level the playing field? Am J Surg. 2004; 188:520–1.

Senecal et al.

•

ANATOMY OF A CLERKSHIP TEST

7. Magarian G, Mazur D. Evaluation of students in medicine clerkships. Acad Med. 1990; 65:341–5. 8. Hemmer P, Szauter K, Allbritton TA, et al. Internal medicine clerkship directors’ use of and opinions about clerkship examinations. Teach Learn Med. 2002; 14:229–35. 9. Torre D, Papp K, Elnicki M, et al. Clerkship directors’ practices with respect to preparing students for and using the National Board of Medical Examiners subject exam in medicine: results of a United States and Canadian survey. Acad Med. 2009; 84:867–71. 10. Levine R, Carlson D, Rosenthal RH, et al. Usage of the National Board of Medical Examiners subject test in psychiatry by U.S. and Canadian clerkships. Acad Psychiatry. 2005; 29:52–7. 11. National Board of Medical Examiners. NBME Subject Examinations. Available at: http://www.nbme.org/ Schools/SubjectExams/PaperBased.html. Accessed Jun 28, 2010. 12. Kerfoot B, Baker H, Volkan K, et al. Development of validated instrument to measure medical student learning in clinical urology: a step toward evidence based education. J Urol. 2004; 172:282–5. 13. Senecal E, Thomas S, Beeson M. A four-year perspective of Society for Academic Emergency Medicine tests: an online testing tool for medical students. Acad Emerg Med. 2009; 16(Suppl 2):S42–5. 14. Wald D, Manthey D, Kruus L, et al. The state of the clerkship: a survey of emergency medicine clerkship directors. Acad Emerg Med. 2007; 14:629–34. 15. Downing S. The effects of violating standard item writing principles on tests and students: the consequences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ Theory Pract. 2005; 10:133–43. 16. Braddom C. A brief guide to writing better test questions. Am J Phys Med Rehabil. 1997; 76:514–6. 17. Case S, Swanson D. Constructing Written Test Questions for the Basic and Clinical Sciences. Available at: http://www.nbmc.org/PDF/ItemWriting_2003/2003 IWGwhole.pdf. Accessed Mar 30, 2010. 18. Haladyna T, Downing S, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ. 2002; 15:309–34. 19. McCoubrie P. Improving the fairness of multiplechoice questions: a literature review. Med Teach. 2004; 26:709–12. 20. Collins J. Education techniques for lifelong learning: writing multiple-choice questions for continuing medical education activities and self-assessment modules. Radiographics. 2006; 26:543–51. 21. Case S. The use of imprecise terms in examination questions-how frequent is frequently. Acad Med. 1994; 69:S4–6. 22. Worthen B, Borg W, White KR. Measurement and Evaluation in the Schools. New York, NY: Longman Publishing, 1993. 23. Crocker L, Algina J. Introduction to Classical and Modern Test Theory. Philadelphia, PA: Holt, Rinehart & Winston, 1986. 24. Manthey D, Coates W, Ander DS, et al. Report of the task force on the national fourth year medical

ACAD EMERG MED • October 2010, Vol. 17, No. S2

•

www.aemj.org

student emergency medicine curriculum guide. Ann Emerg Med. 2006; 47:e1–7. 25. Manthey D, Ander D, Gordon DC, et al. Emergency medicine clerkship curriculum: an update and revision. Acad Emerg Med. 2010; 17:638–43. 26. McKendree J. Can we create an equivalent educational experience on a two campus medical school? Med Teacher. 2009; 31:e202–5. 27. Elnicki DM, Lescisin DA, Case S. Improving the National Board of Medical Examiners Internal Medicine Subject Exam for use in clerkship evaluation. J Gen Intern Med. 2002; 17:435–40.

S37

28. Hemmer PA, Szauter K, Allbritton TA, Elnicki DM. Internal medicine clerkship directors’ use of and opinions about clerkship examinations. Teach Learn Med. 2002; 14:229–35. 29. Goske MJ, Reid JR. Define a national curriculum for radiology residents and test from it. Acad Radiol. 2004; 11:596–9. 30. Reteguiz JA, Crosson J. Clerkship order and performance on family medicine and internal medicine National Board of Medical Examiners Exams. Fam Med. 2002; 34:604–8.