We present a novel scheme to apply factored
phrase-based SMT to a language pair
with very disparate morphological structures.
Our approach relies on syntactic
analysis on the source side (English)
and then encodes a wide variety of local
and non-local syntactic structures as complex
structural tags which appear as additional
factors in the training data. On
the target side (Turkish), we only perform
morphological analysis and disambiguation
but treat the complete complex
morphological tag as a factor, instead of
separating morphemes. We incrementally
explore capturing various syntactic substructures
as complex tags on the English
side, and evaluate how our translations
improve in BLEU scores. Our maximal
set of source and target side transformations,
coupled with some additional
techniques, provide an 39% relative improvement
from a baseline 17.08 to 23.78
BLEU, all averaged over 10 training and
test sets. Now that the syntactic analysis
on the English side is available, we
also experiment with more long distance
constituent reordering to bring the English
constituent order close to Turkish, but find
that these transformations do not provide
any additional consistent tangible gains
when averaged over the 10 sets.
History
Publisher Statement
Published in Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pages 454–464,
Uppsala, Sweden, 11-16 July 2010.