posted on 2004-10-01, 00:00authored byPinar Donmez, Jaime G. Carbonell
Active learning consists of principled on-line sampling over
unlabeled data to optimize supervised learning rates as a function
of the number of labels requested from an external oracle.
A new sampling technique for active learning is developed
based on two key principles: 1) Balanced sampling on both
sides of the decision boundary is more effective than sampling
one side disproportionately, and 2) exploiting the natural
grouping (clustering) of unlabeled data establishes a more
meaningful non-Euclidean distance function with respect to
estimated category membership. Our new paired-sampling
density-sensitive method embodying these principles yields
significantly superior performance in multiple active learning
data sets over all other sampling methods in our comparative
study: representative sampling, uncertainty sampling,
density-based sampling, and random sampling.