We present a method for generating Colloquial
Egyptian Arabic (CEA) from morphologically disambiguated
Modern Standard Arabic (MSA).
When used in POS tagging, this process improves
the accuracy from 73.24% to 86.84% on unseen
CEA text, and reduces the percentage of out-of vocabulary
words from 28.98% to 16.66%. The
process holds promise for any NLP task targeting
the dialectal varieties of Arabic; e.g., this approach
may provide a cheap way to leverage MSA data
and morphological resources to create resources
for colloquial Arabic to English machine translation.
It can also considerably speed up the annotation
of Arabic dialects.
History
Publisher Statement
Published in Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 176–180,
Jeju, Republic of Korea, 8-14 July 2012.