K-groups : tractable group detection on large link data sets

Kubica, Jeremy; Moore, Andrew W.; Schneider, Jeff

doi:10.1184/R1/6555272.v1

K-groups : tractable group detection on large link data sets

journal contribution

posted on 2003-01-01, 00:00 authored by Jeremy Kubica, Andrew W. Moore, Jeff Schneider

Abstract: "Discovering underlying structure from co-occurrence data is an important task in many fields, including: insurance, intelligence, criminal investigation, epidemiology, human resources, and marketing. For example a store may wish to identify underlying sets of items purchased together or a human resources department may wish to identify groups of employees that collaborate with each other. Previously Kubica et. al. presented the group detection algorithm (GDA) -- an algorithm for finding underlying groupings of entities from co-occurrence data. This algorithm is based on a probabilistic generative model and produces coherent groups that are consistent with prior knowledge. Unfortunately, the optimization used in GDA is slow, making it potentially infeasible [sic] for many real world data sets. For example, in the co-publication domain the MEDLINE database of medical publications alone contains over 2 million papers published within just a 5 year period, 1995-1999 [14]. To this end, we present k-groups -- an algorithm that uses an approach similar to that of k-means (hard clustering and localized updates) to significantly accelerate the discovery of the underlying groups while retaining GDA's probabilistic model. In addition, we show that k-groups is guaranteed to converge to a local minimum. We also compare the performance of GDA and k-groups on several real world and artificial data sets, showing the k-groups' sacrifice in solution quality is significantly offset by its increase in speed. This trade-off makes group detection tractable on significantly larger data sets."