Crowd-Sourced Wrapper Construction with End Users
The web contains a tremendous number of data sets presented visually, which computers cannot currently read. Most people, however, understand the data sets with little difficulty, suggesting the potential for applying the techniques of crowd-sourcing to the problem of understanding web data sets. In this thesis we study several issues with respect to crowd-sourcing a collection of wrappers, or small programs mapping data sets to their logical structure, from a crowd of end users. We pay special attention to the majority of users who are not programmers.
We present a prototype system, Mixer, that allows end users to demonstrate and execute repetitive ad hoc data retrieval actions over multiple data sources. The evaluation of the prototype suggests that end users, under the strong assumption that input to the individual query systems, as well as their output, is fully understood, are able to construct and combine data from multiple data sources. Furthermore we present another prototype system, SmartWrap showing that end users, explicitly including non programmers, can demonstrate actions sufficient to construct for a data set a wrapper, i.e. the instructions needed to understand the data set. A pilot crowd is able to construct wrappers for most requested data sets, but gives no guidance that the wrapped data sets are useful or relevant to anyone. To narrow the search for relevant data sets we turn to an audience that theory predicts will make use of additional structure in web pages: blind people. We present the theory and the results of a preliminary study demonstrating the concrete benefits non visual users of the web stand to gain from increased structure in web pages and more specifically from the introduction of web tables in place of template-driven visual data sets.
History
Date
2016-08-05Degree Type
- Dissertation
Department
- Language Technologies Institute
Degree Name
- Doctor of Philosophy (PhD)