On Learning from Collective Data

Xiong, Liang

doi:10.1184/R1/6720908.v1

On Learning from Collective Data

thesis

posted on 2013-12-01, 00:00 authored by Liang Xiong

In many machine learning problems and application domains, the data are naturally organized by groups. For example, a video sequence is a group of images, an image is a group of patches, a document is a group of paragraphs/words, and a community is a group of people. We call them the collective data. In this thesis, we study how and what we can learn from collective data. Usually, machine learning focuses on individual objects, each of which is described by a feature vector and studied as a point in some metric space. When approaching collective data, researchers often reduce the groups into vectors to which traditional methods can be applied. We, on the other hand, will try to develop machine learning methods that respect the collective nature of data and learn from them directly. Several different approaches were taken to address this learning problem. When the groups consist of unordered discrete data points, it can naturally be characterized by its sufficient statistics – the histogram. For this case we develop efficient methods to address the outliers and temporal effects in the data based on matrix and tensor factorization methods. To learn from groups that contain multi-dimensional real-valued vectors, we develop both generative methods based on hierarchical probabilistic models and discriminative methods using group kernels based on new divergence estimators. With these tools, we can accomplish various tasks such as classification, regression, clustering, anomaly detection, and dimensionality reduction on collective data. We further consider the practical side of the divergence based algorithms. To reduce their time and space requirements, we evaluate and find methods that can effectively reduce the size of the groups with little impact on the accuracy. We also proposed the conditional divergence along with an efficient estimator in order to correct the sampling biases that might be present in the data. Finally, we develop methods to learn in cases where some divergences are missing, caused by either insufficient computational resources or extreme sampling biases. In addition to designing new learning methods, we will use them to help the scientific discovery process. In our collaboration with astronomers and physicists, we see that the new techniques can indeed help scientists make the best of data.

History

Date

2013-12-01

Degree Type

Dissertation

Department

Machine Learning

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Jeff Schneider

Usage metrics

Keywords

Collective data; grouped data; point sets; low-rank decomposition; robust methods; anomaly detection Knowledge Representation and Machine Learning

Licence

In Copyright

On Learning from Collective Data

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports