Three Essays on Enterprise Information System Mining for Business Intelligence

Sahoo

Nachiketa

2009

This dissertation proposal consists of three essays on data mining in the context of enterprise information system. The first essay develops a clustering algorithm to discover topic hierarchies in text document streams. The key property of this method is that it processes each text documents only once and assigns it to the appropriate place in the topic hierarchy as they arrive. It is done by making a distributional assumption of the word occurrences and by storing the sufficient statistics at each topic node. The algorithm is evaluated using two standard datasets: Reuters newswire data (RCV1) and MEDLINE journal abstracts data (OHSUMED). The results show that by using Katz’s distribution to model word occurrences we can improve the cluster quality in majority of the cases over using the Normal distribution assumption that is often used. The second essay develops a collaborative filter for recommender systems using ratings by users on multiple aspects of an item. The key challenge in developing this method was the correlated nature of the component ratings due to Halo effect. This challenge is overcome by identifying the dependency structure between the component ratings using dependency tree search algorithm and modeling for it in a mixture model. The algorithm is evaluated using a multicomponent rating dataset collected from Yahoo! Movies. The results show that we can improve the retrieval performance of the collaborative filter by using multi-component ratings. We also find that when our goal is to accurately predict the rating of an unseen user-item pair, using multiple components lead to better performance when the training data is sparse, but, when there is a more than a certain amount of training data using only one component rating leads to more accurate rating prediction. The third essay develops a framework for analyzing conversation taking place at online social networks. It encodes the text of the conversation and the participating actors in a tensor. With the help of blog data collected from a large IT services firm it shows that by tensor factorization we are able to identify significant topics of conversation as well as the important actors in each. In addition it proposes three extensions to this study: 1) Evaluation of the tensor factorization approach by measuring its accuracy in topic discovery and community discovery, 2) Extension of the study by incorporating the blog reading data which is unique because it measures consumption of post topics, and 3) Study the interdependence of reading, posting, citation activity at a blog social network.