GenAI for Code Review of C++ and Java
Since the release of OpenAI’s ChatGPT, many companies have been releasing their own versions of large language models (LLMs), which can be used by engineers to improve the process of code development. Although ChatGPT is still the most popular for general use cases, we now have models created specifically for programming, such as GitHub Copilot and Amazon Q Developer. Inspired by Mark Sherma's blog post analyaing the effectiveness of Chat GPT-3.5 for C code analysis, this post details our experiment testing and comparing GPT-3.5 versus 4o for C++ and Java code review. We collected examples from the SEI CERT Secure Coding standards for C++ and Java. Each rule in the standard contains a title, a description, noncompliant code examples, and compliant solutions. We analyzed whether ChatGPT-3.5 and ChatGPT-4o would correctly identify errors in noncompliant code and correctly recognize compliant code as error-free. Overall, we found that both the GPT-3.5 and GPT-4o models are better at identifying mistakes in noncompliant code than they are at confirming correctness of compliant code. They can accurately discover and correct many errors but have a hard time identifying compliant code as such. When comparing GPT-3.5 and GPT-4o, we found that 4o had higher correction rates on noncompliant code and hallucinated less when responding to compliant code. Both GPT 3.5 and GPT-4o were more successful in correcting coding errors in C++ when compared to Java. In categories where errors were often missed by both models, prompt engineering improved results by allowing the LLM to focus on specific issues when providing fixes or suggestions for improvement.