Carnegie Mellon University
Browse

Navigating Challenges with LLM-based Code Generation using Software-specific Insights

Download (4.77 MB)
thesis
posted on 2025-06-27, 18:59 authored by Nikitha RaoNikitha Rao
<p dir="ltr">The software development process is rapidly evolving with the advancement of Large Language Models (LLMs). LLMs are not only transforming the way code is written but are also increasingly integrated into AI programming tools, such as ChatGPT and GitHub Copilot, to enhance developer productivity by generating pro- grams from natural language instructions, identifying and fixing bugs, generating documentation and so on. </p><p dir="ltr">These LLMs are pretrained on large volumes of natural language and code data. They are trained using cross-entropy and preference losses that have no coefficient for correctness and only optimize for matching the ground truth. Therefore, despite their proficiency in learning code syntax, they fall short in capturing semantic signals. To date, the main focus of efforts to improve these models has been training larger models and collecting more human preference data. However, user studies have found notable issues with the usability of these larger models, including difficulty in understanding the generated code, the presence of subtle bugs that are hard to find, and a lack of verification of the generated code</p><p dir="ltr">This dissertation demonstrates that integrating domain insights from software engineering into AI-based code generation can enhance reliability and utility for developers. This is done by empowering the model to take on a more active role in building valid and usable code, instilling greater trust among users in the capabilities of the model. I focus on three main challenges identified by prior work and propose solutions using software-specific insights. </p><p dir="ltr">(1) The generated code can be difficult to understand and manipulate, especially for non-expert programmers. To address this, I contribute LOWCODER, a tool that abstracts away the syntactic complexity associated with traditional code and pro vides a more user-friendly interface using drag-and-drop functionality. As a result, LOWCODER provides a trusted environment where users can leverage the capabilities of AI without the need for extensive coding knowledge. </p><p dir="ltr">(2) Verifying the correctness of the generated code is hard. While LLMs excel at generating code, they are lacking when it comes to generating tests. This is largely because current models are trained on individual files and therefore can not consider the code under test context. To overcome this, I contribute CAT-LM, a LLM trained to explicitly consider the mapping between code and test files. CAT-LM can there fore help users with verifying code that they or other models generate, by generating tests that align more coherently with the underlying code. </p><p dir="ltr">(3) The generated code often has subtle bugs that are hard to find. To address this, I contribute DIFFSPEC, a framework for generating differential tests with LLMs using prompt chaining to verify code correctness. DIFFSPEC makes use of various software artifacts like natural language specification documents, source code, existing tests, and previous bug reports to generate tests to not only verify code correctness, but also checks for conformance against the specification. By highlight ing meaningful behavioral differences between implementations, DIFFSPEC can en hance the overall reliability of even extensively tested software systems.</p><p dir="ltr">The goal of my dissertation is to demonstrate the significance of integrating software-specific insights when training models to make code generation more reliable and useful for developers. My dissertation work contributes several artifacts including datasets, evaluation frameworks and models that are trained by integrating software-specific insights to improve the quality of generated code. Importantly, these models are all quite small relative to cutting-edge general purpose models like GPT-4. While large, general models can also be very useful for these tasks, they have their own limitations: few companies can afford the immense resources re- quired to train such large models, and most of these models are closed-source and provide limited (free) access to the community which can be unreliable. In contrast, my work produces smaller open-source models that are specialized to perform various programming related tasks, resulting in tools that make code generation more reliable and useful for developers.</p>

History

Date

2025-05-01

Degree Type

  • Dissertation

Thesis Department

  • Software and Societal Systems (S3D)

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Vincent J Hellendoorn Claire Le Goues

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC