file.pdf (213 kB)
Making Logistic Regression A Core Data Mining Tool
journal contributionposted on 01.01.2005, 00:00 by Paul Komarek, Andrew W Moore
Binary classification is a core data mining task. For large datasets or real-time applications, desirable classifiers are accurate, fast, and automatic (i.e. no parameter tuning). Naive Bayes and decision trees are fast and parameter-free, but their accuracy is often below state-of-the-art. Linear support vector machines (SVM) are fast and have good accuracy, but current implementations are sensitive to the capacity parameter. SVMs with radial basis function kernels are accurate but slow, and have multiple parameters that require tuning. In this paper we demonstrate that a very simple parameter-free implementation of logistic regression (LR) is sufficiently accurate and fast to compete with state-of-the-art binary classifiers on large real-world datasets. The accuracy is comparable to per-dataset tuned linear SVMs and, in higher dimensions, to tuned RBF SVMs. A combination of regularization, truncated-Newton methods, and iteratively re-weighted least squares make this implementation faster than SVMs and relatively insensitive to parameters. Our fitting procedure, TR-IRLS, appears to outperform several common LR fitting procedures in our experiments. TR-IRLS is robust to linear dependencies and scaling problems in the data, and no data preprocessing is necessary. TR-IRLS is easy to implement and can be used anywhere that IRLS is used. Convergence guarantees can be stated for generalized linear models with canonical links.