Can the Internet grade math? Crowdsourcing a complex scoring task and picking the optimal crowd size
This paper presents crowdsourcing as a novel approach to reducing the grading burden of constructed response assessments. We find that the average rating of 10 workers from a commercial crowdsourcing service can grade student work cheaply ($0.35 per student response) and reliably (Pearson correlation with the teacher's gold-standard scores, ρ = 0:86 ± .04 for Grade 3 to ρ = .72 ± .07 for Grade 8). The specific context of our proof-of-concept dataset is 3rd-8th grade constructed response math questions. A secondary contribution of the paper is the development of a novel subsampling procedure, which allows a large data-collection experiment to be split into many smaller pseudo-experiments in such a way as to respect within-worker and between-worker variance. The subsampling procedure plays a key role in our calculation that the average of 10 workers' scores suffices to produce reliable scores.