Exploiting Test Structure to Enhance Language Models for Software Testing
Software testing is an integral part of software development. However, testing faces challenges due to the time-consuming and challenging nature of writing high quality tests, leading to poorly maintained test suites and lower overall software quality. Prior work for automatically generating tests, like EvoSuite and Randoop can generate high-coverage tests, however, often these tests are hard to read, unre alistic, or incorrect, necessitating additional effort from developers for verification. In contrast, language models have shown promise in generating human-like, high quality code functions, benefiting tools like Copilot in code generation.
However, language models are not as successful at generating tests, struggling with both hallucination and with correctly invoking internal methods present in the code under test. This is because code generation language models are typically trained primarily for code generation and code completion. Benchmarks also do not resemble real-world development; existing benchmarks consist of simple program ming or LeetCode problems. To help overcome these limitations, I focus on how we can incorporate domain-specific properties of testing such as the strong coupling between source and test files along with important test execution data to improve the evaluation and application of language models to software testing. I also examine how we can better evaluate test generation approaches with metrics that are more meaningful to developers and evaluation on larger codebases that more closely re semble real-world development. My thesis statement is: We exploit the structure of test code and close relationship between code and test files to improve the evaluation and application of language models to software testing in both pretraining and fine tuning. This insight can (a) generate useful unit test cases, (b) identify weaknesses in existing test suites, (c) build more realistic test generation benchmarks, and (d) generate test suites for large scale projects.
My thesis will make the following contributions:
- It presents a new method for pretraining models for test generation, that considers the relationship between source code and test code.
- It provides an approach to automatically classify mutants as detected or undetected without executing the test suite by leveraging additional test context.
- It evaluates all provided techniques with metrics and experiments that are practically meaningful to developers, not considered in prior work.
- It introduces a benchmark for evaluating test generation approaches that is sourced from large scale open source repositories and thus more closely resembles real-world test generation.
- It demonstrates the effectiveness of adding execution context to test generation models, which enables us to generate high quality test suites for large scale projects.
My work (ASE 2023) demonstrated that pretraining language models on dual objectives of code and test generation significantly improves unit test generation. I also leveraged the joint relationship between code and tests (FSE 2023) to improve predictive mutation testing techniques, modeling mutants at the token level, and in corporating both source and test methods during fine-tuning. I improved test gen eration evaluation (ICLR 2025) by introducing a large test generation benchmark, TestGenEval, that is sourced from large scale open source repositories. Finally, I built a test generation agent (submitted to ICSE 2026) that incorporates execution feedback, while also scaling to the large open source repositories in TestGenEval.
History
Date
2025-04-08Degree Type
- Dissertation
Thesis Department
- Software and Societal Systems (S3D)
Degree Name
- Doctor of Philosophy (PhD)