P13-2126.pdf (126.31 kB)
Typesetting for Improved Readability using Lexical and Syntactic Information
journal contribution
posted on 2013-08-04, 00:00 authored by Ahmed Salama, Kemal OflazerKemal Oflazer, Susan HaganWe present results from our study of which
uses syntactically and semantically motivated
information to group segments of
sentences into unbreakable units for the
purpose of typesetting those sentences in
a region of a fixed width, using an otherwise
standard dynamic programming line
breaking algorithm, to minimize raggedness.
In addition to a rule-based baseline
segmenter, we use a very modest size
text, manually annotated with positions of
breaks, to train a maximum entropy classifier,
relying on an extensive set of lexical
and syntactic features, which can then
predict whether or not to break after a certain
word position in a sentence. We also
use a simple genetic algorithm to search
for a subset of the features optimizing F1,
to arrive at a set of features that delivers
89.2% Precision, 90.2% Recall (89.7%
F1) on a test set, improving the rule-based
baseline by about 11 points and the classifier trained on all features by about 1 point
in F1.