Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

Jang, Jiyong

doi:10.1184/R1/6721400.v1

Scaling Software Security Analysis to Millions of Malicious Progr.pdf (5.45 MB)

Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

thesis

posted on 2013-08-01, 00:00 authored by Jiyong Jang

Software security is a big data problem. The volume of new software artifacts created far outpaces the current capacity of software analysis. This gap has brought an urgent challenge to our security community—scalability. If our techniques cannot cope with an ever increasing volume of software, we will always be one step behind attackers. Thus developing scalable analysis to bridge the gap is essential.

In this dissertation, we argue that automatic code reuse detection enables an efficient data reduction of a high volume of incoming malware for downstream analysis and enhances software security by efficiently finding known vulnerabilities across large code bases. In order to demonstrate the benefits of automatic software similarity detection, we discuss two representative problems that are remedied by scalable analysis: malware triage and unpatched code clone detection.

First, we tackle the onslaught of malware. Although over one million new malware are reported each day, existing research shows that most malware are not written from scratch; instead, they are automatically generated variants of existing malware. When groups of highly similar variants are clustered together, new malware more easily stands out. Unfortunately, current systems struggle with handling this high volume of malware. We scale clustering using feature hashing and perform semantic analysis using co-clustering. Our evaluation demonstrates that these techniques are an order of magnitude faster than previous systems and automatically discover highly correlated features and malware groups. Furthermore, we design algorithms to infer evolutionary relationships among malware, which helps analysts understand trends over time and make informed decisions about which malware to analyze first.

Second, we address the problem of detecting unpatched code clones at scale. When buggy code gets copied from project to project, eventually all projects will need to be patched. We call clones of buggy code that have been fixed in only a subset of projects unpatched code clones. Unfortunately, code copying is usually ad-hoc and is often not tracked, which makes it challenging to identify all unpatched vulnerabilities in code bases
at the scale of entire OS distributions. We scale unpatched code clone detection to spot over
15,000 latent security vulnerabilities in 2.1 billion lines of code from the Linux kernel, all
Debian and Ubuntu packages, and all C/C++ projects in SourceForge in three hours on a
single machine. To the best of our knowledge, this is the largest set of bugs ever reported in a single paper.

History

Date

2013-08-01

Degree Type

Dissertation

Department

Electrical and Computer Engineering

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

David Brumley

Usage metrics

Keywords

Malware Triage Feature Hashing Co-clustering Hadoop Unpatched Code Clone Bloom Filter Lineage Binary Analysis Code Reuse Big Data

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Scaling Software Security Analysis to Millions of Malicious Programs and Billions of Lines of Code

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports