posted on 1987-09-01, 00:00authored byMaria-Florina Balcan, Avrim Blum, Yang Ke
Co-training is a method for combining labeled and unlabeled data when
examples can be thought of as containing two distinct sets of features. It
has had a number of practical successes, yet previous theoretical analyses
have needed very strong assumptions on the data that are unlikely to be
satisfied in practice.
In this paper, we propose a much weaker “expansion” assumption on the
underlying data distribution, that we prove is sufficient for iterative cotraining
to succeed given appropriately strong PAC-learning algorithms
on each feature set, and that to some extent is necessary as well. This
expansion assumption in fact motivates the iterative nature of the original
co-training algorithm, unlike stronger assumptions (such as independence
given the label) that allow a simpler one-shot co-training to succeed.
We also heuristically analyze the effect on performance of noise in
the data. Predicted behavior is qualitatively matched in synthetic experiments
on expander graphs.