posted on 2013-06-01, 00:00authored byChristopher R. Palmer, Christos Faloutsos
Similarity or distance measures are fundamental and critical
properties for data mining tools. Categorical attributes abound in
databases. The Car Make, Gender, Occupation, etc. fields in a automobile
insurance database are very informative. Sadly, categorical data is
not easily amenable to similarity computations. A domain expert might
manually specify some or all of the similarity relationships, but this is
error-prone and not feasible for attributes with large domains, nor is it
useful for cross-attribute similarities, such as between Gender and Occupation.
External similarity functions define a similarity between, say,
Car Makes by looking at how they co-occur with the other categorical
attributes. We exploit a rich duality between random walks on graphs
and electrical circuits to develop REP, an external similarity function.
REP is theoretically grounded while the only prior work was ad-hoc. The
usefulness of REP is shown in two experiments. First, we cluster categorical
attribute values showing improved inferred relationships. Second,
we use REP effectively as a nearest neighbour classifier.