Carnegie Mellon University
Browse

Robust Learning with Highly Skewed Category Distributions

Download (1.78 MB)
thesis
posted on 2025-05-20, 20:49 authored by Selen Uguroglu

Highly skewed category distributions are abundant in many real-world tasks in data mining, such as medical diagnosis (rare diseases), text categorization (rare topics), and fraud detection (when most transactions are legitimate). Under extreme class skew, most supervised learning algorithms tend to minimize loss by labeling every instance with the majority class(es), leading to poor recall on the minority class(es). However, true misclassification costs may be much greater when minority class instances are missed, e.g. a massive but rare fraud missed, or an uncommon life-threatening condition misclassified as benign. Hence, a means of detecting rare but consequential classes is required, and that is the topic of this dissertation.

Whereas learning under extreme class skew has been previously investigated, many challenges remain: e.g. disjunctive majority classes and minority-majority class overlap. Prior research did not consider incorporating the structure of minority class into the learning process. In this dissertation, we address class imbalance under the compactness hypothesis, i.e. minority class forms one or more compact clusters in the feature space. Furthermore, we introduce several learning algorithms to address class imbalance under two other assumptions: disjunctive majority class and overlapping classes. We also propose new active learning strategies in cases when there are insufficient labeled minority class instances to learn accurate concept descriptions under highly-skewed settings. Our algorithms are based on a variety of methods/paradigms, including multiple kernel learning, maximum mean discrepancy, and cost-sensitive learning.

We evaluate the new and baseline methods on several real-world datasets with a particular focus on the Womens’ Ischemic Syndrome Evaluation (WISE) dataset, to demonstrate a practical application in medical diagnosis. We show that when the assumptions are satisfied, leveraging the structure of classes, such as compact minority class, disjunctive majority class, leads to better prediction performance, quantified by the improvement in F-1 and AUC measures. Our empirical results reveal an improvement in F-1 as much as 28%.

History

Date

2013-12-01

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Jaime Carbonell

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC