Accelerating Text-as-Data Research in Computational Social Science

2020-03-13T17:35:09Z (GMT) by Dallas Card
Natural language corpora are phenomenally rich resources for learning about people and society, and have long been used as such by various disciplines such as history
and political science. Recent advances in machine learning and natural language processing are creating remarkable new possibilities for how scholars might analyze such
corpora, but working with textual data brings its own unique challenges, and much of the research in computer science may not align with the desiderata of social scientists.
In this thesis, I present a line of work on developing methods for computational social science focused primarily on observational research using natural language text.
Throughout, I take seriously the concerns and priorities of the social sciences, leading to a focus on aspects of machine learning which are otherwise sometimes secondary,
including calibration, interpretability, and transparency. Two ideas which unify this work are the problems of exploration and measurement, and as a running example I consider the problem of analyzing how news sources frame contemporary political issues. Following the introduction, I devote one chapter to providing the necessary background on computational social science, framing, and the “text as data” paradigm. Subsequent chapters each focus on a particular model or method that strives to address some aspect of research which may be of particular interest to social scientists. Chapters 3 and 4 focus on the unsupervised setting, with the former presenting a model for learning archetypal character representations, and the latter presenting a framework for neural document models which can flexibly incorporate metadata. Chapters 5 and
6 focus on the supervised setting and present alternately, a method for measuring label proportions in text in the presence of domain shift, and a variation on deep learning
classifiers which produces more transparent and robust predictions. The final chapter concludes with implications for computational social science and possible directions
for future work.