Toward Automation to Support Creation and Evaluation of Pedagogically Valid Multiple-Choice Question Assessments at Scale
Multiple-choice questions (MCQs) are the predominant form of assessment in educational environments, known for their efficiency and scalability. Traditionally, these questions are crafted by instructors, a method that despite its expertise often results in inconsistencies and errors. In response to these limitations and the need for scalability, learnersourcing has been leveraged, which involves students in the question creation process. Although this method capitalizes on the diverse perspectives of students, it also leads to significant variability in the quality of the questions produced. Additionally, while recent advances in artificial intelligence have facilitated more scalable and automated methods for generating MCQs, these AI-driven methods still suffer from many of the same shortcomings as those created by humans. Current evaluation methods for MCQs predominantly rely on human judgment, which introduces subjectivity and lacks scalability. While automated evaluation methods provide scalability, they fall short in adequately assessing the educational value of questions, focusing instead on surface-level features that do not match expert evaluation.
In this thesis, I explore various methods for creating and evaluating educational content, grounded in learning science research and guided by the use of rubrics. I demonstrate that students, with minimal scaffolding and technological support, are capable of generating high-quality assessments. I have also investigated the potential of involving both students and crowdworkers in the generation and evaluation of the skills required to solve problems. Building on this, we developed a new method that leverages LLMs to enhance the efficiency and accuracy of these processes. Furthermore, I have shown that crowdworkers can effectively use rubrics to evaluate questions with a level of accuracy comparable to human experts. Through these crowdsourcing and learnersourcing studies, I examine how specialized knowledge and expertise influence the success of content creation and evaluation. This work culminates in the proposal of the Scalable Automatic Question Usability Evaluation Toolkit (SAQUET), a new standardized method for evaluating educational MCQs.
This work contributes to the fields of educational technology, learning sciences, and human-computer interaction. By harnessing the capabilities of crowdsourcing, learnersourcing, and generative AI, this research demonstrates how the generation and evaluation of educational content can be vastly improved. It introduces a standardized approach to assessment processes, enhancing the quality and consistency of educational evaluations across various domains. By providing a scalable framework that leverages advancements in generative AI, this work propels the field of educational technology forward, addressing critical challenges related to the creation and evaluation of assessments. Ultimately, these contributions offer a foundation for future innovations in educational content development and quality assurance.
History
Date
2024-09-01Degree Type
- Dissertation
Department
- Human-Computer Interaction Institute
Degree Name
- Doctor of Philosophy (PhD)