posted on 2015-05-01, 00:00authored byVinay Uday Prabhu
Two important technological aspects of the Big data paradigm have been the emergence of massive scale Online Social Networks (OSNs) (such as Facebook and Twitter), and the rise of the open data movement that has resulted in the creation of richly structured online datasets, such as Wikipedia, Citeseer and the US federal government’s data.gov initiative. The examples of OSNs and online datasets cited above share the common feature that they can be thought of as Online Information Graphs, in the sense that the information embedded in them has a natural graph structure. In this thesis, we consider using this underlying Online Information Graph as a statistical prior to enhance classification accuracy of some hard machine learning problems. Specifically, we look at instances where the graph is undirected and propose using the graph to define an Ising - Markov Random Field (MRF) prior. To begin with, we validate the Ising prior using a novel hypothesis testing framework based approach. Having validated the Ising prior, we demonstrate its utility by showcasing Network Aided Vector classification (NAC) of real world data from fields as varied as vote prediction in the US senate, movie earnings level classification (using IMDb dataset) and county crime-level classification (using the US census data). We then consider a special case of the classification problem which involves Network Aided Detection (NAD) of a global sentiment in an OSN. To this end, we consider Latent Sentiment (LS) detection as well as Majority Sentiment detection. We analyze the performance of the trivial sentiment detector for LS detection using a novel communications-oriented viewpoint, where we view the underlying network as providing a weak channel code that transmits one bit of information (the binary sentiment) and perform error exponent analysis for various underlying graph models. We also address the problem of optimal Maximum A posterior Probability (MAP) detection of majority sentiment in the highly noisy labels weak network effect (NW) scenario, deriving the High Temperature (HT) expansion formula for the partial partition function of the Ising model using the code-puncturing idea from coding theory and then proposing an approximate MAP detector that outperforms the Maximum Likelihood (ML) detector and the trivial detector.