Accelerating Text-as-Data Research in Computational Social Science

Card, Dallas

doi:10.1184/R1/11935419.v1

cdallas_MachineLearning_2019.pdf (1.37 MB)

Accelerating Text-as-Data Research in Computational Social Science

thesis

posted on 2020-03-13, 17:35 authored by Dallas CardDallas Card

Natural language corpora are phenomenally rich resources for learning about people and society, and have long been used as such by various disciplines such as history
and political science. Recent advances in machine learning and natural language processing are creating remarkable new possibilities for how scholars might analyze such
corpora, but working with textual data brings its own unique challenges, and much of the research in computer science may not align with the desiderata of social scientists.
In this thesis, I present a line of work on developing methods for computational social science focused primarily on observational research using natural language text.
Throughout, I take seriously the concerns and priorities of the social sciences, leading to a focus on aspects of machine learning which are otherwise sometimes secondary,
including calibration, interpretability, and transparency. Two ideas which unify this work are the problems of exploration and measurement, and as a running example I consider the problem of analyzing how news sources frame contemporary political issues. Following the introduction, I devote one chapter to providing the necessary background on computational social science, framing, and the “text as data” paradigm. Subsequent chapters each focus on a particular model or method that strives to address some aspect of research which may be of particular interest to social scientists. Chapters 3 and 4 focus on the unsupervised setting, with the former presenting a model for learning archetypal character representations, and the latter presenting a framework for neural document models which can flexibly incorporate metadata. Chapters 5 and
6 focus on the supervised setting and present alternately, a method for measuring label proportions in text in the presence of domain shift, and a variation on deep learning
classifiers which produces more transparent and robust predictions. The final chapter concludes with implications for computational social science and possible directions
for future work.

History

Date

2019-08-16

Degree Type

Dissertation

Department

Machine Learning

Degree Name

Doctor of Philosophy (PhD)

Advisor(s)

Noah A. Smith

Usage metrics

Keywords

natural language processing machine learning computational social science graphical models interpretability calibration conformal methods

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Accelerating Text-as-Data Research in Computational Social Science

History

Date

Degree Type

Department

Degree Name

Advisor(s)

Usage metrics

Categories

Keywords

Licence

Exports