In this supplemental material to our paper “Think-aloud interviews: A tool for exploring student statistical reasoning,” we provide question-level summaries of student performance on our assessment questions. The question texts are provided separately in other Supplemental Material.

Assessment metadata

The data presented here comes from the following courses and times:

Course Pre-test date Post-test date Notes
CMU 200 (spring) 2019-01-18 to 2019-01-25 2019-04-23 to 2019-05-07 Pre-test given in Homework 1
CMU 200 (summer) 2019-05-22 to 2019-05-25 2019-06-22 to 2019-06-27 Homework 1 and Homework 8
CMU 200 (fall) 2019-09-05 to 2019-09-11 2019-12-01 to 2019-12-06 Homework 1 and final homework
CMU 202 2019-02-06 to 2019-02-13 Treated as a post-test
Colby 212 (spring) 2019-02-08 to 2019-02-15 2019-04-26 to 2019-05-08 Pre-test given in Homework 1, post in Homework 9
Colby 212 (fall) 2019-08-27 to 2019-09-13 2019-11-15 to 2019-12-03 First and last homework

Note that since 202 is the next course students take after 200, we treated the assessment near the beginning of the semester as a “post-test,” since it represents students who have completed 200.

The summer session of 200 is only six weeks long, so the time scale is much more compressed.

All assessments were completed online, through ISLE. Table 2 in the paper indicates how many students completed each assessment round.

Changes in score from pre- to post-test

This table summarizes the fraction of correct answers from the pre-test to the post-test, averaged across all course sections (CMU and Colby). Note that this table does not account for the respondent bias caused by some students not completing the post-test, as discussed in the paper.

Confidence is recorded on the scale 0 = Guessed, 1 = Somewhat sure, 2 = Confident.

Question Pre-test correct Pre-test confidence N Post-test correct Post-test confidence N Post - pre correct Post - pre confidence
ap-cs 0.50 1.12 216 0.60 1.27 247 0.09 0.15
apples-oranges 0.25 0.89 252 0.30 1.10 243 0.05 0.21
books 0.78 1.45 231 0.85 1.51 247 0.07 0.06
box-bets 0.72 1.28 227 0.79 1.43 241 0.07 0.15
brown-vegas 0.68 0.98 234 0.76 1.23 244 0.08 0.25
candies 0.49 1.01 241 0.54 1.30 243 0.05 0.28
candy-test 0.52 1.04 100 0.52 1.27 83 0.00 0.23
cmu-pitt 0.66 1.08 239 0.72 1.39 225 0.06 0.30
coffee-headlines 0.78 1.34 96 0.80 1.41 105 0.02 0.07
coin-likely 0.22 1.30 224 0.29 1.47 235 0.06 0.16
coke-pepsi 0.25 1.04 238 0.36 1.26 225 0.12 0.22
colored-coins 0.05 1.09 242 0.09 1.36 219 0.04 0.27
commutes 0.56 1.02 234 0.62 1.33 249 0.06 0.31
corr-match 0.59 1.23 247 0.74 1.57 242 0.16 0.34
cows-chickens 0.32 0.82 211 0.34 1.13 227 0.03 0.31
dice-bet 0.13 0.91 228 0.11 1.29 234 -0.02 0.38
dice-even-out 0.45 0.97 239 0.39 1.22 241 -0.06 0.24
diet-pills 0.56 1.00 96 0.50 1.22 101 -0.07 0.22
euro-coin 0.02 0.75 261 0.02 1.30 234 -0.01 0.55
eyeball-sd 0.60 0.85 188 0.72 1.24 235 0.12 0.39
farm-areas 0.39 0.54 223 0.28 1.09 230 -0.11 0.55
farm-corn 0.53 0.99 90 0.47 1.25 104 -0.06 0.26
fixitol-solvix 0.71 1.38 238 0.66 1.55 240 -0.05 0.17
forecasts 0.54 0.98 241 0.56 1.23 232 0.03 0.25
fruit 0.65 1.57 237 0.74 1.67 245 0.10 0.10
greatest-variation 0.15 0.65 254 0.21 0.83 242 0.07 0.18
handmade-candies 0.61 1.26 228 0.67 1.28 229 0.06 0.02
height-weight 0.81 1.43 237 0.85 1.64 259 0.05 0.21
heights 0.99 1.66 187 0.96 1.69 214 -0.03 0.03
hist-outliers 0.68 1.42 238 0.81 1.60 255 0.14 0.18
horse-races 0.82 0.77 238 0.86 0.91 245 0.04 0.14
investment-success 0.22 0.72 258 0.31 0.99 249 0.09 0.27
lost-draw 0.64 1.04 246 0.72 1.31 227 0.08 0.26
math-courses 0.26 1.10 221 0.37 1.32 236 0.11 0.22
mosaic-independence-I 0.52 0.54 222 0.66 0.92 244 0.14 0.37
mosaic-independence-II 0.19 0.54 222 0.28 0.91 243 0.09 0.37
new-class 0.51 1.03 229 0.59 1.17 254 0.08 0.13
plot-match-test 0.15 0.25 127 0.31 0.75 133 0.16 0.50
plot-match-test-boxplot 0.23 0.28 100 0.68 1.15 87 0.45 0.87
plot-match-test-mosaic 0.36 0.27 97 0.66 1.12 100 0.30 0.85
plot-match-test-scatter 0.57 0.40 89 0.82 1.27 88 0.25 0.87
plot-matching-I 0.90 NaN 216 0.84 NaN 242 -0.06 NaN
plot-matching-II 0.33 NaN 216 0.64 NaN 241 0.31 NaN
plot-matching-III 0.40 NaN 217 0.65 NaN 243 0.25 NaN
plot-matching-IV 0.57 NaN 217 0.78 NaN 244 0.21 NaN
pools 0.70 1.12 97 0.78 1.43 91 0.08 0.30
pym-pike 0.71 1.32 222 0.76 1.44 247 0.05 0.12
quiz-questions 0.11 0.30 231 0.19 0.74 247 0.09 0.44
sample-space 0.74 1.34 245 0.80 1.52 227 0.06 0.18
sd-converge 0.19 0.81 133 0.15 1.23 150 -0.04 0.42
skin-cream 0.61 1.03 239 0.67 1.07 232 0.06 0.05
sleep-attention 0.73 NaN 101 0.86 NaN 90 0.12 NaN
sleep-attention-I 0.72 0.74 137 0.88 1.28 146 0.16 0.55
sleep-attention-II 0.43 0.73 139 0.57 1.27 147 0.14 0.54
sleep-attention-three-groups 0.45 0.90 89 0.62 1.32 87 0.17 0.42
snake-eyes 0.32 1.00 237 0.40 1.14 251 0.09 0.14
study-time 0.67 1.12 214 0.57 1.23 248 -0.10 0.11
sum-box-bets 0.56 0.88 234 0.64 1.23 247 0.08 0.34
test-variation 0.27 0.69 96 0.45 0.86 93 0.18 0.17
ticket-dependence 0.81 1.46 226 0.79 1.60 246 -0.02 0.13
u-correlation 0.54 0.62 240 0.80 1.23 251 0.26 0.61
u-scatter 0.84 1.58 232 0.81 1.57 231 -0.03 0.00
vitamin-c 0.34 1.11 231 0.41 1.34 230 0.07 0.23
vitamin-randomization 0.12 1.07 108 0.15 1.32 92 0.03 0.24
wacky-alpha 0.13 0.48 224 0.12 1.15 250 -0.02 0.67
win-half 0.16 1.16 244 0.13 1.38 211 -0.03 0.22

(Due to technical limitations, confidence on the multi-part plot-matching question is not available here.)

Pre-test stability

This figure compares pre-test question “difficulty” overall, by school/course (CMU’s 36-200 vs Colby’s SC212). Before we can safely compare post-tests and gains across schools, we must first explore whether pre-tests are similarly difficult at different schools. This would help evaluate whether the student populations entering each school’s introductory course are very different.

We find that most questions have similar pre-test results: only a few questions vary by as much as 25 percentage points between Colby’s SC212 and CMU’s 36-200. Curiously, the Colby students tended to do a bit better on the easier questions, while CMU students tended to do a bit better on the harder questions.

However, most of these questions have fairly small samples, so we also compute marginal 95% confidence intervals for the between-schools difference for each question. Even the apparently-large differences are not statistically significant, apart from the following questions:
plot.match.test.boxplot, coke.pepsi, farm.corn, ap.cs, sum.box.bets, skin.cream, fruit, fixitol.solvix, height.weight. Most questions have a confidence interval which is 20 or 30 percentage points wide. In other words, our current samples are too small to meaningfully determine whether CMU and Colby students have equivalent performance on the pre-test.

Confidence

We can plot the change in rate of correct answers (from pre-test to post-test) against the change in student confidence between the tests. The change in rate of correct answers is normalized by the amount of gain that was possible on the pre-test:

\[ \text{normalized gain} = \frac{\text{post} - \text{pre}}{1 - \text{pre}} \]

We would ideally expect most questions to fall along the diagonal, indicating that students got more confident and more correct during the semester. However, certain questions do not, notably wacky-alpha, farm-areas, and euro-coin. farm-areas is discussed in Section 4 of the paper, and appears to reflect a misconception that is reinforced during the course. wacky-alpha and euro-coin both test difficult details of the interpretation of p values and confidence intervals.

As shown in the table of results above, all three questions had low pre-test average scores, so gain was possible, but not attained.