Finding (recently) frequent items in distributed data streams

Manjhi, Amit

doi:10.1184/R1/6605651.v1

file.pdf (854.79 kB)

Finding (recently) frequent items in distributed data streams

journal contribution

posted on 2006-01-01, 00:00 authored by Amit Manjhi

Abstract: "We consider the problem of maintaining frequency counts for items occurring frequently in the union of multiple distributed data streams. Nai╠êve methods of combining approximate frequency counts from multiple nodes tend to result in excessively large data structures that are costly to transfer among nodes. To minimize communication requirements, the degree of precision maintained by each node while counting item frequencies must be managed carefully. We introduce the concept of a precision gradient for managing precision when nodes are arranged in a hierarchical communication structure. We then study the optimization problem of how to set the precision gradient so as to minimize communication, and provide optimal solutions that minimize worst-case communication load over all possible inputs. We then introduce a variant designed to perform well in practice, with input data that does not conform to worst-case characteristics. We verify the effectiveness of our approach empirically using real-world data, and show that our methods incur substantially less communication than nai╠êve approaches while providing the same error guarantees on answers. In addition, we extend techniques for maintaining frequency counts of high-frequency items in one or more streams by making them time-sensitive. Time-sensitivity is achieved by associating weights with items that decay exponentially with time. We analyze the error bounds and worst-case space bounds for the extended algorithms."

History

Date

2006-01-01

Usage metrics

Keywords

Electronic data processing Distributed processing.Data structures (Computer science)

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Finding (recently) frequent items in distributed data streams

History

Date

Usage metrics

Categories

Keywords

Licence

Exports