Understanding and Mitigating Biases in Evaluation
Many applications in real life involve collecting and aggregating evaluation from people, such as in hiring, peer grading and conference peer review. In this thesis, we
focus on three sources of biases that arise in such problems: people, estimation and policies. Specifically, people provide evaluation data; estimation procedures
perform inference and draw conclusions from the provided data; policies specify all the details that are needed in order to execute the entire process. We model and analyze these biases, and subsequently propose methods to mitigate them. First, we study human bias, that is, the bias in the evaluation data introduced by human evaluators. We consider the miscalibration aspect, meaning that different
people have different calibration scales. We propose randomized algorithms that provably extract useful information under a general model we propose for arbitrary
miscalibration. Building upon these results, we also propose a heuristic that is applicable to a broader range of settings. In addition to miscalibration, we also consider
the bias induced by the “outcome” experienced by people. As an example, when students rate their course instructors, the students’ ratings are influenced by the grades
that the students receive in these courses. We make mild assumptions to model such biases, and propose an adaptive algorithm that corrects this bias using knowledge
about the “outcomes”. Second, we study estimation bias, that is, when algorithms exhibit different behaviors
on different subgroups of the population. We consider the problem of estimating the quality of individual items from pairwise comparison data. We analyze the statistical bias (defined as the expectation of the estimated value minus the true value) when using the maximum-likelihood estimator, and then propose a simple modification to the estimator to reduce the bias. Third, we study policy bias, that is, when the rules dictating the evaluation process induce undesirable outcomes. We examine large-scale multi-attribute evaluation tasks. As an example, in graduate admissions, the evaluation criteria often consist of multiple attributes, such as school GPAs, standardized test scores, recommendation letters, research experience, etc. The number of applications is large, and therefore the evaluation task needs to be divided and assigned to many reviewers in a distributed fashion. It is common practice to assign each reviewer a subset of the applications, and ask them to assess all relevant information for their assigned subset.
In contrast, we propose an alternative approach where each reviewer evaluates more applicants but fewer attributes per applicant. We establish various tradeoffs between
these two approaches, and identify conditions under which our proposed approach results in better evaluation.
Finally, we briefly describe our outreach efforts to improve the peer review process – reducing the bias caused by the alphabetical-ordering authorship in scientific publications, and analyzing the gender distribution of the recipients of conference paper awards.
- Doctor of Philosophy (PhD)