Carnegie Mellon University
Browse

Multi-view Clustering of Social-based Data

thesis
posted on 2020-10-21, 14:56 authored by Iain CruickshankIain Cruickshank
<div>Real-world, social phenomena produce various types of data, like explicit networks or user-emitted text. When different sets of data describe the same entities, the</div><div>data is termed multi-view or multi-modal. A distinct advantage of multi-view data is that different views may better capture different aspects of the latent structure of</div><div>the data. However, there are difficulties in combining that data to produce something like a clustering of the data. Multi-view clustering techniques, primarily developed</div><div>for image or biological use cases or network only use cases, have typically not been used for clustering social-based use cases. I investigate the use of multi-view clustering</div><div>on various social-based, multi-view data sets, and propose new techniques for multi-view clustering of social-based data. In the first part of this thesis I discuss the use of multi-view clustering for social-based data, and propose a new paradigm and new techniques for multi-view clustering.</div><div>In chapter two I propose a new hybrid paradigm of multi-view clustering, which combines elements of late paradigm and intermediate paradigm integration. I test the various intermediate, late, and hybrid paradigm algorithms on a wide range of benchmark data sets from social-based data scenarios. The results of the empirical testing demonstrate a wide variance in the performance of multi-view clustering</div><div>techniques. This is in part because social-based data often have high inter- and intraview variances that are not present in other data scenarios, which presents difficulties</div><div>for existing techniques. Only two techniques proposed in the chapter have good performance across all of the data sets and are robust to inter- and intra-view differences.</div><div>From the results in chapter two, I devise a new algorithm based in network modularity and graph learning to cluster multi-view social data in chapter three. I present the</div><div>results of a series of empirical tests of the new technique, as well as possible variations on the technique. The results demonstrate that the presented technique often performs well across a wide range of social-scenarios that give rise to multi-view data, is scalable to large data sets, and is robust to inter- and intra-view variance. In the second part of this thesis I use the new techniques to do clustering analyses</div><div>of real-world data. In chapter four I use multi-view clustering on Twitter data collected during the initial stages of the COVID-19 pandemic. This analysis is the first ever use of multi-view clustering to cluster hashtags from large, social-media data sets. The results display that hashtags form topical clusters and that these topical clusters have changed over the course of the pandemic. In chapter five I use</div><div>multi-view clustering to cluster malware samples. The results demonstrate that a multi-view clustering of malware samples provides insight into communities of malware</div><div>use, and confirm the techniques developed can be applied to a wide range of social-based data scenarios. In sum, I demonstrate the suitability of, and create techniques for, multi-view clustering of complex, multi-view, social-based data. This thesis advances practical clustering analyses of large-scale, noisy, social-based data and contributes to the</div><div>field of multi-view clustering in general.</div>

History

Date

2020-07-01

Degree Type

  • Dissertation

Thesis Department

  • Institute for Software Research

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Kathleen M. Carley

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC