Toward a Phish Free World: A Feature-type-aware Cascaded Learning Framework for Phish Detection
In the real world, many elds have highly skewed class distributions and features that vary dramatically in terms of their classification and runtime performance. With a huge volume of data on the web, such fields typically require machine learning (ML) techniques with low latency and high performance. Anti-phishing is one of those fields, which requires a very low False Positive Rate (FP), a reasonably high True Positive Rate (TP) and a fast response time.
In those great number of areas including anti-phishing, however, almost all existing ML-based approaches simply focused on designing features, and building a monolithic model using them all at once. A fast response time is of paramount importance to the user experience in a live scenario, and naively extracting values for all features upfront is often an overkill.
In our previous work, we proposed a number of anti-phishing approaches that either extend existing URL blacklists in a probabilistic fashion or enhance feature based anti-phishing methods with novel features, and in this thesis, we build on our previous experience with anti-phishing and propose a feature-type-aware cascaded learning framework for the a variety of domains with skewed class distribution and features with various classification and runtime performance in an effort to achieve a good balance between the three desiderata of TP, FP and latency. By utilizing lightweight features in early stages of the cascade and postponing prohibitive ones to later stages, our approach achieves a superior runtime performance in general, and can be further improved via parallelization in the distributed computing environment. Moreover, our approach is scalable with more features, and can be optimized in favor of FP or TP based on the speci c domains. In the context of anti-phishing, our cascaded approach achieves 557% reduction in runtime on average over traditional single-stage models, with a low FP of 065% and a TP of 8334%, and thus provides a fast and reliable solution for live detection scenarios.
History
Date
2013-01-01Degree Type
- Dissertation
Thesis Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)