<div>We present a human judgments dataset</div><div>and an adapted metric for evaluation of</div><div>Arabic machine translation. Our mediumscale</div><div>dataset is the first of its kind for Arabic</div><div>with high annotation quality. We use</div><div>the dataset to adapt the BLEU score for</div><div>Arabic. Our score (AL-BLEU) provides</div><div>partial credits for stem and morphological</div><div>matchings of hypothesis and reference</div><div>words. We evaluate BLEU, METEOR and</div><div>AL-BLEU on our human judgments corpus</div><div>and show that AL-BLEU has the highest</div><div>correlation with human judgments. We</div><div>are releasing the dataset and software to</div><div>the research community.</div>