A Human Judgment Corpus and a Metric for Arabic MT Evaluation.pdf (276.16 kB)
A Human Judgment Corpus and a Metric for Arabic MT Evaluation
journal contribution
posted on 2014-10-01, 00:00 authored by Houda BouamorHouda Bouamor, Hanan AlshikhabobakrHanan Alshikhabobakr, Behrang Mohit, Kemal OflazerKemal OflazerWe present a human judgments dataset
and an adapted metric for evaluation of
Arabic machine translation. Our mediumscale
dataset is the first of its kind for Arabic
with high annotation quality. We use
the dataset to adapt the BLEU score for
Arabic. Our score (AL-BLEU) provides
partial credits for stem and morphological
matchings of hypothesis and reference
words. We evaluate BLEU, METEOR and
AL-BLEU on our human judgments corpus
and show that AL-BLEU has the highest
correlation with human judgments. We
are releasing the dataset and software to
the research community.