Evaluating Static Analysis Alerts with LLMs
For safety-critical systems in areas such as defense and medical devices, software assurance is crucial. Analysts can use static analysis tools to evaluate source code without running it, allowing them to identify potential vulnerabilities. Despite their usefulness, the current generation of heuristic static analysis tools require significant manual effort and are prone to producing both false positives (spurious warnings) and false negatives (missed warnings). Recent research from the SEI estimates that these tools can identify up to one candidate error (“weakness”) every three lines of code, and engineers often choose to prioritize fixing the most common and severe errors. However, less common errors can still lead to critical vulnerabilities. For example, a "flooding" attack on a network-based service can overwhelm a target with requests, causing the service to crash. However, neither of the related weaknesses ("improper resource shutdown or release" or "allocation of resources without limits or throttling") is on the 2023 Top 25 Dangerous CWEs list, the Known Exploited Vulnerabilities (KEV) Top 10 list, or the Stubborn Top 25 CWE 2019-23 list. In our research, large language models (LLMs) show promising initial results in adjudicating static analysis alerts and providing rationales for the adjudication, offering possibilities for better vulnerability detection. In this blog post, we discuss our initial experiments using GPT-4 to evaluate static analysis alerts. This post also explores the limitations of using LLMs in static analysis alert evaluation and opportunities for collaborating with us on future work.