posted on 2002-09-01, 00:00authored byMaria-Florina Balcan, Avrim Blum, Nathan Srebro
We continue the investigation of natural conditions
for a similarity function to allow learning, without
requiring the similarity function to be a valid kernel,
or referring to an implicit high-dimensional
space. We provide a new notion of a “good similarity
function” that builds upon the previous definition
of Balcan and Blum (2006) but improves
on it in two important ways. First, as with the previous
definition, any large-margin kernel is also a
good similarity function in our sense, but the translation
now results in a much milder increase in the
labeled sample complexity. Second, we prove that
for distribution-specific PAC learning, our new notion
is strictly more powerful than the traditional
notion of a large-margin kernel. In particular, we
show that for any hypothesis class C there exists
a similarity function under our definition allowing
learning with O(log |C|) labeled examples. However,
in a lower bound which may be of independent
interest, we show that for any class C of pairwise
uncorrelated functions, there is no kernelwith
margin γ ≥ 8√|C| for all f ∈ C, even if one
allows average hinge-loss as large as 0.5. Thus,
the sample complexity for learning such classes
with SVMs is Ω(|C|). This extends work of Ben-
David et al. (2003) and Forster and Simon (2006)
who give hardness results with comparablemargin
bounds, but at much lower error rates.
Our new notion of similarity relies upon L1 regularized
learning, and our separation result is related
to a separation result between what is learnable
with L1 vs. L2 regularization.