We present results from our study of which
uses syntactically and semantically motivated
information to group segments of
sentences into unbreakable units for the
purpose of typesetting those sentences in
a region of a fixed width, using an otherwise
standard dynamic programming line
breaking algorithm, to minimize raggedness.
In addition to a rule-based baseline
segmenter, we use a very modest size
text, manually annotated with positions of
breaks, to train a maximum entropy classifier,
relying on an extensive set of lexical
and syntactic features, which can then
predict whether or not to break after a certain
word position in a sentence. We also
use a simple genetic algorithm to search
for a subset of the features optimizing F1,
to arrive at a set of features that delivers
89.2% Precision, 90.2% Recall (89.7%
F1) on a test set, improving the rule-based
baseline by about 11 points and the classifier trained on all features by about 1 point
in F1.
History
Publisher Statement
Published in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 719–724,
Sofia, Bulgaria, August 4-9 2013.