Large language models (LLMs) have shown tremendous potential across various applications. However, deploying LLMs without human supervision and evaluation could lead to significant errors including, in the worst case, the potential loss of life. In this blog post from the Carnegie Mellon University Software Engineering Institute (SEI), we outline the fundamentals of LLM evaluation for text summarization in high-stakes applications such as intelligence report summarization. We first discuss the challenges of LLM evaluation, give an overview of the current state of the art, and finally detail how we are filling the identified gaps at the SEI.
History
Publisher Statement
NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT.
[DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.