A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation

Bouamor, Houda; Zaghouani, Wajdi; Diab, Mona; Obeid, Ossama; Oflazer, Kemal; Ghoneim, Mahmoud; Hawwari, Abdelati

doi:10.1184/R1/6373148.v1

W15-3209.pdf (149.42 kB)

A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation

journal contribution

posted on 2018-07-26, 00:00 authored by Houda BouamorHouda Bouamor, Wajdi Zaghouani, Mona Diab, Ossama Obeid, Kemal OflazerKemal Oflazer, Mahmoud Ghoneim, Abdelati Hawwari

Arabic script writing is typically underspecified for short vowels and other mark up, referred to as diacritics. Apart from the lexical ambiguity found in words, similar to that exhibited in other languages, the lack of diacritics in written Arabic script adds another layer of ambiguity which is an artifact of the orthography. Diacritization of written text has a significant impact on Arabic NLP applications. In this paper, we present a pilot study on building a diacritized multi-genre corpus in Arabic. We annotate a sample of nondiacritized words extracted from five text genres. We explore different annotation strategies: Basic where we present only the bare undiacritized forms to the annotators, Intermediate (Basic forms+their POS tags), and Advanced (automatically diacritized words). We present the impact of the annotation strategy on annotation quality. Moreover, we study different diacritization schemes in the process.

History

Publisher Statement

Published in Proceedings of the Second Workshop on Arabic Natural Language Processing, pages 80–88, Beijing, China, July 26-31, 2015.

Date

2018-07-26

Usage metrics

Keywords

Arabic Diacritization

Licence

CC BY-NC-SA 4.0

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

A Pilot Study on Arabic Multi-Genre Corpus Diacritization Annotation

History

Publisher Statement

Date

Usage metrics

Categories

Keywords

Licence

Exports