Carnegie Mellon University
Browse

Toward a Phish Free World: A Feature-type-aware Cascaded Learning Framework for Phish Detection

Download (6.1 MB)
thesis
posted on 2025-04-18, 18:42 authored by Guang Xiang

In the real world, many elds have highly skewed class distributions and features that vary dramatically in terms of their classification and runtime performance. With a huge volume of data on the web, such fields typically require machine learning (ML) techniques with low latency and high performance. Anti-phishing is one of those fields, which requires a very low False Positive Rate (FP), a reasonably high True Positive Rate (TP) and a fast response time.

In those great number of areas including anti-phishing, however, almost all existing ML-based approaches simply focused on designing features, and building a monolithic model using them all at once. A fast response time is of paramount importance to the user experience in a live scenario, and naively extracting values for all features upfront is often an overkill.

In our previous work, we proposed a number of anti-phishing approaches that either extend existing URL blacklists in a probabilistic fashion or enhance feature based anti-phishing methods with novel features, and in this thesis, we build on our previous experience with anti-phishing and propose a feature-type-aware cascaded learning framework for the a variety of domains with skewed class distribution and features with various classification and runtime performance in an effort to achieve a good balance between the three desiderata of TP, FP and latency. By utilizing lightweight features in early stages of the cascade and postponing prohibitive ones to later stages, our approach achieves a superior runtime performance in general, and can be further improved via parallelization in the distributed computing environment. Moreover, our approach is scalable with more features, and can be optimized in favor of FP or TP based on the speci c domains. In the context of anti-phishing, our cascaded approach achieves 557% reduction in runtime on average over traditional single-stage models, with a low FP of 065% and a TP of 8334%, and thus provides a fast and reliable solution for live detection scenarios.

History

Date

2013-01-01

Degree Type

  • Dissertation

Thesis Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Jason Hong Carolyn Rosé

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC