Arabic script writing is typically underspecified
for short vowels and other mark
up, referred to as diacritics. Apart from the
lexical ambiguity found in words, similar
to that exhibited in other languages, the
lack of diacritics in written Arabic script
adds another layer of ambiguity which is
an artifact of the orthography. Diacritization
of written text has a significant impact
on Arabic NLP applications. In this
paper, we present a pilot study on building
a diacritized multi-genre corpus in
Arabic. We annotate a sample of nondiacritized
words extracted from five text
genres. We explore different annotation
strategies: Basic where we present only
the bare undiacritized forms to the annotators,
Intermediate (Basic forms+their POS
tags), and Advanced (automatically diacritized
words). We present the impact of
the annotation strategy on annotation quality.
Moreover, we study different diacritization
schemes in the process.
History
Publisher Statement
Published in Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 80–88,
Beijing, China, July 26-31, 2015.