Many experts believe that new malware is created at a rate faster than legitimate software. For example, in
2007 over one million new malware samples were collected by a major security solution vendor. However, it is
often speculated, though to the best of our knowledge unproven, that new malware is produced by modifying
existing malware, either through simple tweaks, code composition, or a variety of other techniques. Moreover,
when buggy code is copied from one program to another program, both original and new programs have
to be patched. However, code copying is typically not recorded. Such code reuse is a recurring problem in
security.
In this paper we propose a fast, scalable algorithm for automatic code reuse detection in binary code,
BitShred. BitShred can be used for identifying the amount of shared code based upon the ability to calculate
the similarity among binary code. BitShred can be applied to many security problems, such as malware
clustering and bug finding. We developed a prototype implementation to evaluate our algorithm. The
experimental results show that BitShred is able to detect plagiarism among malware samples and cluster
them efficiently.