NetCube: A Scalable Tool for Fast Data Mining and Compression

Margaritis, Dimitris; Faloutsos, Christos; Thrun, Sebastian

doi:10.1184/R1/6607658.v1

file.pdf (171.26 kB)

NetCube: A Scalable Tool for Fast Data Mining and Compression

journal contribution

posted on 2008-06-20, 00:00 authored by Dimitris Margaritis, Christos Faloutsos, Sebastian Thrun

We propose an novel method of computing and storing DataCubes. Our idea is to use Bayesian Networks, which can generate approximate counts for any query combination of attribute values and “don’t cares.” A Bayesian network represents the underlying joint probability distribution of the data that were used to generate it. By means of such a network the proposed method, NetCube, exploits correlations among attributes. Our proposed preprocessing algorithm scales linearly on the size of the database, and is thus scalable; it is also parallelizable with a straightforward parallel implementation. Moreover, we give an algorithm to estimate counts of arbitrary queries that is fast (constant on the database size). Experimental results show that NetCubes have fast generation and use (a few minutes preprocessing time per 100,000 records and less than a second query time), achieve excellent compression (at least 1800:1 compression ratios on real data) and have low reconstruction error (less than 5% on average). Moreover, our method naturally allows for visualization and data mining, at no extra cost.

History

Date

2008-06-20

Usage metrics

Keywords

computer sciences

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

NetCube: A Scalable Tool for Fast Data Mining and Compression

History

Date

Usage metrics

Categories

Keywords

Licence

Exports