This paper presents some very preliminary
results for and problems in developing
a statistical machine translation system
from English to Turkish. Starting with
a baseline word model trained from about
20K aligned sentences, we explore various
ways of exploiting morphological structure
to improve upon the baseline system.
As Turkish is a language with complex
agglutinative word structures, we experiment
with morphologically segmented
and disambiguated versions of the parallel
texts in order to also uncover relations between
morphemes and function words in
one language with morphemes and functions
words in the other, in addition to relations
between open class content words.
Morphological segmentation on the Turkish
side also conflates the statistics from
allomorphs so that sparseness can be alleviated
to a certain extent. We find
that this approach coupled with a simple
grouping of most frequent morphemes and
function words on both sides improve the
BLEU score from the baseline of 0.0752
to 0.0913 with the small training data. We
close with a discussion on why one should
not expect distortion parameters to model
word-local morpheme ordering and that a
new approach to handling complex morphotactics
is needed.
History
Publisher Statement
Published in Proceedings of the Workshop on Statistical Machine Translation, pages 7–14,
New York City, June 2006.