posted on 2008-06-20, 00:00authored byDimitris Margaritis, Christos Faloutsos, Sebastian Thrun
We propose an novel method of computing
and storing DataCubes. Our idea is to use
Bayesian Networks, which can generate approximate
counts for any query combination of attribute
values and “don’t cares.” A Bayesian network
represents the underlying joint probability
distribution of the data that were used to generate
it. By means of such a network the proposed
method, NetCube, exploits correlations among attributes.
Our proposed preprocessing algorithm
scales linearly on the size of the database, and
is thus scalable; it is also parallelizable with a
straightforward parallel implementation. Moreover,
we give an algorithm to estimate counts
of arbitrary queries that is fast (constant on the
database size). Experimental results show that
NetCubes have fast generation and use (a few minutes preprocessing time per 100,000 records
and less than a second query time), achieve excellent
compression (at least 1800:1 compression
ratios on real data) and have low reconstruction
error (less than 5% on average). Moreover, our
method naturally allows for visualization and data
mining, at no extra cost.