Three Essays on Enterprise Information System Mining for Business Intelligence
journal contribution
posted on 2009-02-02, 00:00authored byNachiketa Sahoo
This dissertation proposal consists of three essays on data mining in the context of
enterprise information system.
The first essay develops a clustering algorithm to discover topic hierarchies in text
document streams. The key property of this method is that it processes each text documents
only once and assigns it to the appropriate place in the topic hierarchy as they
arrive. It is done by making a distributional assumption of the word occurrences and
by storing the sufficient statistics at each topic node. The algorithm is evaluated using
two standard datasets: Reuters newswire data (RCV1) and MEDLINE journal abstracts
data (OHSUMED). The results show that by using Katz’s distribution to model
word occurrences we can improve the cluster quality in majority of the cases over
using the Normal distribution assumption that is often used.
The second essay develops a collaborative filter for recommender systems using
ratings by users on multiple aspects of an item. The key challenge in developing this
method was the correlated nature of the component ratings due to Halo effect. This
challenge is overcome by identifying the dependency structure between the component
ratings using dependency tree search algorithm and modeling for it in a mixture
model. The algorithm is evaluated using a multicomponent rating dataset collected
from Yahoo! Movies. The results show that we can improve the retrieval performance
of the collaborative filter by using multi-component ratings. We also find that when
our goal is to accurately predict the rating of an unseen user-item pair, using multiple
components lead to better performance when the training data is sparse, but, when
there is a more than a certain amount of training data using only one component rating
leads to more accurate rating prediction.
The third essay develops a framework for analyzing conversation taking place at
online social networks. It encodes the text of the conversation and the participating
actors in a tensor. With the help of blog data collected from a large IT services firm it
shows that by tensor factorization we are able to identify significant topics of conversation
as well as the important actors in each. In addition it proposes three extensions
to this study: 1) Evaluation of the tensor factorization approach by measuring its accuracy
in topic discovery and community discovery, 2) Extension of the study by incorporating
the blog reading data which is unique because it measures consumption
of post topics, and 3) Study the interdependence of reading, posting, citation activity
at a blog social network.