W15-3209.pdf (149.42 kB)
A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation
journal contribution
posted on 2018-07-26, 00:00 authored by Houda BouamorHouda Bouamor, Wajdi Zaghouani, Mona Diab, Ossama Obeid, Kemal OflazerKemal Oflazer, Mahmoud Ghoneim, Abdelati HawwariArabic script writing is typically underspecified
for short vowels and other mark
up, referred to as diacritics. Apart from the
lexical ambiguity found in words, similar
to that exhibited in other languages, the
lack of diacritics in written Arabic script
adds another layer of ambiguity which is
an artifact of the orthography. Diacritization
of written text has a significant impact
on Arabic NLP applications. In this
paper, we present a pilot study on building
a diacritized multi-genre corpus in
Arabic. We annotate a sample of nondiacritized
words extracted from five text
genres. We explore different annotation
strategies: Basic where we present only
the bare undiacritized forms to the annotators,
Intermediate (Basic forms+their POS
tags), and Advanced (automatically diacritized
words). We present the impact of
the annotation strategy on annotation quality.
Moreover, we study different diacritization
schemes in the process.