Carnegie Mellon University
Browse

Crowd-Sourced Wrapper Construction with End Users

Download (2.22 MB)
thesis
posted on 2022-12-02, 21:21 authored by Steven Gardiner

The web contains a tremendous number of data sets presented visually, which computers cannot currently read. Most people, however, understand the data sets with little difficulty, suggesting the potential for applying the techniques of crowd-sourcing to the problem of understanding web data sets. In this thesis we study several issues with respect to crowd-sourcing a collection of wrappers, or small programs mapping data sets to their logical structure, from a crowd of end users. We pay special attention to the majority of users who are not programmers. 

We present a prototype system, Mixer, that allows end users to demonstrate and execute repetitive ad hoc data retrieval actions over multiple data sources. The evaluation of the prototype suggests that end users, under the strong assumption that input to the individual query systems, as well as their output, is fully understood, are able to construct and combine data from multiple data sources. Furthermore we present another prototype system, SmartWrap showing that end users, explicitly including non programmers, can demonstrate actions sufficient to construct for a data set a wrapper, i.e. the instructions needed to understand the data set. A pilot crowd is able to construct wrappers for most requested data sets, but gives no guidance that the wrapped data sets are useful or relevant to anyone. To narrow the search for relevant data sets we turn to an audience that theory predicts will make use of additional structure in web pages: blind people. We present the theory and the results of a preliminary study demonstrating the concrete benefits non visual users of the web stand to gain from increased structure in web pages and more specifically from the introduction of web tables in place of template-driven visual data sets. 

History

Date

2016-08-05

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Anthony Tomasic

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC