Electricity Based External Similarity of Categorical Attributes
journal contributionposted on 01.06.2013 by Christopher R. Palmer, Christos Faloutsos
Any type of content formally published in an academic journal, usually following a peer-review process.
Similarity or distance measures are fundamental and critical properties for data mining tools. Categorical attributes abound in databases. The Car Make, Gender, Occupation, etc. fields in a automobile insurance database are very informative. Sadly, categorical data is not easily amenable to similarity computations. A domain expert might manually specify some or all of the similarity relationships, but this is error-prone and not feasible for attributes with large domains, nor is it useful for cross-attribute similarities, such as between Gender and Occupation. External similarity functions define a similarity between, say, Car Makes by looking at how they co-occur with the other categorical attributes. We exploit a rich duality between random walks on graphs and electrical circuits to develop REP, an external similarity function. REP is theoretically grounded while the only prior work was ad-hoc. The usefulness of REP is shown in two experiments. First, we cluster categorical attribute values showing improved inferred relationships. Second, we use REP effectively as a nearest neighbour classifier.