Carnegie Mellon University
Browse

Practical Supervised Machine Learning Classification of Highly Imbalanced Text

Download (475.2 kB)
report
posted on 2025-05-22, 13:23 authored by Austin WhisnantAustin Whisnant

As the insider threat problem grows and becomes more widely understood, software vendors have started offering more solutions for detecting, preventing, and evaluating the risks of insiders. It is important that these and future solutions are founded on reliable data and evidence-based research. This paper describes research into how to efficiently collect and classify United States Attorneys’ Office (USAO) press releases to determine which ones describe an insider threat. The goal of doing this is to create an automated process for collecting as many insider threat court cases as possible to build a repository of insider threat court cases to support ongoing research. SEI researchers used a machine learning model that gathered and encoded data from USAO press releases. They used this model to classify a corpus of over 200,000 press releases to classify over 24,000 press releases as discussing insider threat dating back to 2013 and will continue to use it going forward to collect new cases to include in the SEI insider threat repository.

SHARE

History

Publisher Statement

This material is based upon work funded and supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. The view, opinions, and/or findings contained in this material are those of the author(s) and should not be construed as an official Government position, policy, or decision, unless designated by other documentation. References herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by Carnegie Mellon University or its Software Engineering Institute. This report was prepared for the SEI Administrative Agent AFLCMC/AZS 5 Eglin Street Hanscom AFB, MA 01731-2100. NO WARRANTY. THIS CARNEGIE MELLON UNIVERSITY AND SOFTWARE ENGINEERING INSTITUTE MATERIAL IS FURNISHED ON AN "AS-IS" BASIS. CARNEGIE MELLON UNIVERSITY MAKES NO WARRANTIES OF ANY KIND, EITHER EXPRESSED OR IMPLIED, AS TO ANY MATTER INCLUDING, BUT NOT LIMITED TO, WARRANTY OF FITNESS FOR PURPOSE OR MERCHANTABILITY, EXCLUSIVITY, OR RESULTS OBTAINED FROM USE OF THE MATERIAL. CARNEGIE MELLON UNIVERSITY DOES NOT MAKE ANY WARRANTY OF ANY KIND WITH RESPECT TO FREEDOM FROM PATENT, TRADEMARK, OR COPYRIGHT INFRINGEMENT. [DISTRIBUTION STATEMENT A] This material has been approved for public release and unlimited distribution. Please see Copyright notice for non-US Government use and distribution.

Copyright Statement

Copyright 2025 Carnegie Mellon University.

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC