Typesetting for Improved Readability using Lexical and Syntactic Information
journal contributionposted on 04.08.2013, 00:00 by Ahmed Salama, Kemal Oflazer, Susan Hagan
We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize raggedness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F1.