Network Aided Classification and Detection of Data

2015-05-01T00:00:00Z (GMT) by Vinay Uday Prabhu
Two important technological aspects of the Big data paradigm have been the emergence of massive
scale Online Social Networks (OSNs) (such as Facebook and Twitter), and the rise of the
open data movement that has resulted in the creation of richly structured online datasets, such as
Wikipedia, Citeseer and the US federal government’s data.gov initiative. The examples of OSNs
and online datasets cited above share the common feature that they can be thought of as Online
Information Graphs, in the sense that the information embedded in them has a natural graph
structure.
In this thesis, we consider using this underlying Online Information Graph as a statistical prior
to enhance classification accuracy of some hard machine learning problems. Specifically, we look
at instances where the graph is undirected and propose using the graph to define an Ising -
Markov Random Field (MRF) prior. To begin with, we validate the Ising prior using a novel hypothesis
testing framework based approach. Having validated the Ising prior, we demonstrate its
utility by showcasing Network Aided Vector classification (NAC) of real world data from fields
as varied as vote prediction in the US senate, movie earnings level classification (using IMDb
dataset) and county crime-level classification (using the US census data). We then consider a
special case of the classification problem which involves Network Aided Detection (NAD) of a
global sentiment in an OSN. To this end, we consider Latent Sentiment (LS) detection as well as
Majority Sentiment detection. We analyze the performance of the trivial sentiment detector for
LS detection using a novel communications-oriented viewpoint, where we view the underlying
network as providing a weak channel code that transmits one bit of information (the binary
sentiment) and perform error exponent analysis for various underlying graph models. We also
address the problem of optimal Maximum A posterior Probability (MAP) detection of majority
sentiment in the highly noisy labels weak network effect (NW) scenario, deriving the High Temperature
(HT) expansion formula for the partial partition function of the Ising model using the
code-puncturing idea from coding theory and then proposing an approximate MAP detector that
outperforms the Maximum Likelihood (ML) detector and the trivial detector.