Carnegie Mellon University
Browse

Toward Automation to Support Creation and Evaluation of Pedagogically Valid Multiple-Choice Question Assessments at Scale

Download (3.34 MB)
thesis
posted on 2024-11-21, 19:54 authored by Steven MooreSteven Moore

  Multiple-choice questions (MCQs) are the predominant form of assessment in  educational environments, known for their efficiency and scalability. Traditionally,  these questions are crafted by instructors, a method that despite its expertise  often results in inconsistencies and errors. In response to these limitations and  the need for scalability, learnersourcing has been leveraged, which involves  students in the question creation process. Although this method capitalizes on  the diverse perspectives of students, it also leads to significant variability in the  quality of the questions produced. Additionally, while recent advances in artificial  intelligence have facilitated more scalable and automated methods for  generating MCQs, these AI-driven methods still suffer from many of the same  shortcomings as those created by humans. Current evaluation methods for  MCQs predominantly rely on human judgment, which introduces subjectivity and  lacks scalability. While automated evaluation methods provide scalability, they  fall short in adequately assessing the educational value of questions, focusing  instead on surface-level features that do not match expert evaluation.  

In this thesis, I explore various methods for creating and evaluating  educational content, grounded in learning science research and guided by the  use of rubrics. I demonstrate that students, with minimal scaffolding and  technological support, are capable of generating high-quality assessments. I  have also investigated the potential of involving both students and crowdworkers  in the generation and evaluation of the skills required to solve problems. Building  on this, we developed a new method that leverages LLMs to enhance the  efficiency and accuracy of these processes. Furthermore, I have shown that  crowdworkers can effectively use rubrics to evaluate questions with a level of  accuracy comparable to human experts. Through these crowdsourcing and  learnersourcing studies, I examine how specialized knowledge and expertise  influence the success of content creation and evaluation. This work culminates  in the proposal of the Scalable Automatic Question Usability Evaluation Toolkit  (SAQUET), a new standardized method for evaluating educational MCQs.  

This work contributes to the fields of educational technology, learning  sciences, and human-computer interaction. By harnessing the capabilities of  crowdsourcing, learnersourcing, and generative AI, this research demonstrates  how the generation and evaluation of educational content can be vastly  improved. It introduces a standardized approach to assessment processes,  enhancing the quality and consistency of educational evaluations across various  domains. By providing a scalable framework that leverages advancements in  generative AI, this work propels the field of educational technology forward,  addressing critical challenges related to the creation and evaluation of  assessments. Ultimately, these contributions offer a foundation for future  innovations in educational content development and quality assurance. 

History

Date

2024-09-01

Degree Type

  • Dissertation

Department

  • Human-Computer Interaction Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

John Stamper

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC