Carnegie Mellon University
Browse

Minimizing The Costs in Generalized Interactive Annotation Learning

Download (4.23 MB)
thesis
posted on 2025-04-24, 20:46 authored by Shilpa Arora

Supervised learning involves collecting unlabeled data, defining features to represent an instance, obtaining annotations for the unlabeled instances, and learning a classifier from the annotated data. Each of these steps has an associated cost. In this thesis, our goal is to reduce the total cost for the desired performance in supervised learning. Specifically, we focus on reducing the cost of feature engineering and the total annotation cost.

An instance in supervised learning is represented by a feature vector. For a text instance, a bag-of-words feature representation is commonly used, since word segmentation is straightforward in most languages, and word features have been found to provide good performance for several learning tasks. However, words are limited in the information they provide about the meaning of a text. Hand-crafted structured features based on linguistic annotations, such as parts of speech, semantic roles, syntactic parse trees, etc., have been found to improve performance beyond bag of-words features. However, such manually engineered features require substantial e↵ort from the expert. In this work, we propose a generic annotation graph representation for linguistic annotations, and use a frequent subgraph mining algorithm to automatically extract structured features from the annotation graphs. For a sentiment classification task and a protein-protein interaction extraction task, we show that these automatically extracted structured features provide a significant improvement in performance over bag-of-words features.

Training a classifier involves learning a function of the features to approximate the target variable. To learn a good approximation, several labeled instances are needed. Labeling an instance may require substantial annotation effort, called the annotation cost. In order to reduce the total annotation cost for the desired performance, in addition to labeling the instances, the user could provide information about the features directly. Direct feedback on features has been shown to reduce the total number of labeled instances required to achieve the desired performance. However, such feedback is restricted to simple features, such as words. Linguistic features, hand-crafted or automatically extracted, are often di cult to visualize and present to the user for feedback. To represent an image, features such as pixel values are commonly used. The user may not be familiar with such features to give feedback on them. An alternative is for the user to indicate parts of an instance that are rationales for its class label. For example, what sentences in a document, segments in an image, and scenes in a video, are rationales for their class label. Rationales provide indirect feature feedback, since features that overlap with the rationales should be important for the classification task and this indication is only indirect. Annotating rationales may incur additional cost, which may vary across annotators, instances, annotation tasks, user interface design, etc. We compare the two annotation strategies of providing instance’s label only (LO) and instance’s label together with rationales (LR), for different additional costs for annotating rationales. For a sentiment classification task and an aviation incident cause identification task, we show that rationales provide better performance for a given annotation cost, when annotating them incurs a small extra cost.

Annotation cost may vary across instances, annotators and annotation strategies. Annotation cost is often not known a priori, but it can be estimated. An estimate of the annotation cost can be used to selectively query the annotator, in order to directly minimize the total annotation cost for the desired performance. We propose a supervised regression model that uses the characteristics of an instance, annotator and annotation strategy for estimating the annotation cost in a multi-annotator environment with indirect feature feedback through rationales. For data collected from multiple annotators for a sentiment classification task, we show that an annotation cost estimate from the proposed approach outperforms simpler estimates based on any one of these characteristics.

History

Date

2012-08-27

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Eric H. Nyberg

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC