Carnegie Mellon University
Browse
2022_LREC_DID_FINAL.pdf (280.4 kB)

Aggregating Hierarchical Dialectal Data for Arabic Dialect Classification

Download (280.4 kB)
preprint
posted on 2022-07-13, 20:08 authored by Nurpeiis Baimukan, Houda BouamorHouda Bouamor, Nizar Habash
Arabic is a collection of dialectal variants that are historically related but significantly different. These differences can be seen across regions, countries, and even cities in the same countries. Previous work on Arabic Dialect identification has focused mainly on specific dialect levels (region, country, province, or city) using level-specific resources; and different efforts used different schemas and labels. In this paper, we present the first effort aiming at defining a standard unified three-level hierarchical schema (region-country-city) for dialectal Arabic classification. We map 29 different data sets to this unified schema, and use the common mapping to facilitate aggregating these data sets. We test the value of such aggregation by building language models and using them in dialect identification. We make our label mapping code and aggregated language models publicly available.

History

Date

2022-05-08

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC