Improving Text Classification by Shrinkage in a Hierarchy of Classes

McCallum, Andrew; Rosenfeld, Ronald; Mitchell, Thomas; Ng, Andrew Y

doi:10.1184/R1/21708647.v1

hier-icml98.pdf (208.81 kB)

Improving Text Classification by Shrinkage in a Hierarchy of Classes

conference contribution

posted on 2022-12-21, 17:41 authored by Andrew McCallum, Ronald RosenfeldRonald Rosenfeld, Thomas MitchellThomas Mitchell, Andrew Y Ng

When documents are organized in a large number of topic categories, the categories are often arranged in a hierarchy. The U.S. patent database and Yahoo are two examples. This paper shows that the accuracy of a naive Bayes text classifier can be significantly improved by taking advantage of a hierarchy of classes. We adopt an established statistical technique called shrinkage that smooths parameter estimates of a data-sparse child with its parent in order to obtain more robust parameter estimates. The approach is also employed in deleted interpolation, a technique for smoothing n-grams in language modeling for speech recognition. Our method scales well to large data sets, with numerous categories in large hierarchies. Experimental results on three real-world data sets from UseNet, Yahoo, and corporate web pages show improved performance, with a reduction in error up to 29% over the traditional at classifier.

History

Date

1998-01-01

Usage metrics

Keywords

bayes text classifier document hierarchy

Licence

CC BY 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Improving Text Classification by Shrinkage in a Hierarchy of Classes

History

Date

Usage metrics

Categories

Keywords

Licence

Exports