Carnegie Mellon University
Browse

The Great Fuzzy Hashing Debate

Download (116.35 kB)
online resource
posted on 2024-04-26, 21:03 authored by Edward SchwartzEdward Schwartz

In the first post in this series, we introduced the use of hashing techniques to detect similar functions in reverse engineering scenarios. We described PIC hashing, the hashing technique we use in SEI Pharos, as well as some terminology and metrics to evaluate how well a hashing technique is working. We left off last time after showing that PIC hashing performs poorly in some cases, and  wondered aloud if it is possible to do better. In this post, we will try to answer that question by introducing and experimenting with a very different type of hashing called fuzzy hashing. Like regular hashing, there is a hash  function that reads a sequence of bytes and produces a hash. Unlike regular hashing, though, you don't compare fuzzy hashes with equality.  Instead, there is a similarity function that takes two fuzzy hashes as input and returns a number between 0 and 1, where 0  means completely dissimilar and 1 means completely similar.

My colleague, Cory Cohen, and I debated whether there is utility in applying fuzzy hashes to instruction bytes, and our debate motivated this blog post. I thought  there would be a benefit, but Cory felt there would not. Hence, these  experiments. For this blog post, I'll be using the Lempel-Ziv Jaccard Distance fuzzy hash (LZJD) because it's fast, whereas most fuzzy hash algorithms are slow. A fast fuzzy hashing algorithm opens up the possibility of using fuzzy hashes to search for similar functions in a large database  and other interesting possibilities. As a baseline I'll also be using Levenshtein distance,  which is a measure of how many changes you need to make to one string to transform it to another. For example, the Levenshtein distance between "cat" and "bat" is 1, because you only need to change the first letter. Levenshtein distance allows us to define an optimal notion of similarity at the instruction byte level. The tradeoff is that it's really slow, so it's only really useful as a baseline in our experiments.

History

Publisher Statement

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. The view, opinions, and/or findings contained in this material are those of the author(s) and should not be construed as an official Government position, policy, or decision, unless designated by other documentation. References herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. This report was prepared for the SEI Administrative Agent AFLCMC/AZS 5 Eglin Street Hanscom AFB, MA 01731-2100. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

Copyright Statement

Copyright 2024 Carnegie Mellon University.

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC