Typesetting for Improved Readability using Lexical and Syntactic Information

Salama, Ahmed; Oflazer, Kemal; Hagan, Susan

doi:10.1184/R1/6368081.v1

P13-2126.pdf (126.31 kB)

Typesetting for Improved Readability using Lexical and Syntactic Information

journal contribution

posted on 2013-08-04, 00:00 authored by Ahmed Salama, Kemal OflazerKemal Oflazer, Susan Hagan

We present results from our study of which uses syntactically and semantically motivated information to group segments of sentences into unbreakable units for the purpose of typesetting those sentences in a region of a fixed width, using an otherwise standard dynamic programming line breaking algorithm, to minimize raggedness. In addition to a rule-based baseline segmenter, we use a very modest size text, manually annotated with positions of breaks, to train a maximum entropy classifier, relying on an extensive set of lexical and syntactic features, which can then predict whether or not to break after a certain word position in a sentence. We also use a simple genetic algorithm to search for a subset of the features optimizing F1, to arrive at a set of features that delivers 89.2% Precision, 90.2% Recall (89.7% F1) on a test set, improving the rule-based baseline by about 11 points and the classifier trained on all features by about 1 point in F1.

History

Publisher Statement

Published in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 719–724, Sofia, Bulgaria, August 4-9 2013.

Date

2013-08-04

Usage metrics

Keywords

Readability Optimal typesetting Machine Learning

Licence

CC BY-NC-SA 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Typesetting for Improved Readability using Lexical and Syntactic Information

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports