posted on 2008-01-01, 00:00authored byShobha Venkataraman, Avrim Blum, Dawn Song
Automatic signature generation is necessary because
there may often be little time between the discovery of a
vulnerability, and exploits developed to target the vulnerability.
Much research effort has focused on patternextraction
techniques to generate signatures. These have
included techniques that look for a single large invariant
substring of the byte sequences, as well as techniques that
look for many short invariant substrings. Pattern-extraction
techniques are attractive because signatures can be generated
and matched efficiently, and earlier work has shown
the existence of invariants in exploits.
In this paper, we show fundamental limits on the
accuracy of pattern-extraction algorithms for signaturegeneration
in an adversarial setting. We formulate a framework
that allows a unified analysis of these algorithms, and
prove lower bounds on the number of mistakes any patternextraction
learning algorithm must make under common assumptions,
by showing how to adapt results from learning
theory. While previous work has targeted specific algorithms,
our work generalizes these attacks through theoretical
analysis to any algorithm with similar assumptions, not
just the techniques developed so far. We also analyze when
pattern-extraction algorithms may work, by showing conditions
under which these lower bounds are weakened. Our
results are applicable to other kinds of signature-generation
algorithms as well, those that use properties of the exploit
that can be manipulated.