posted on 2002-03-01, 00:00authored byJaime G. Carbonell, Steve Klein, David Miller, Mike Steinbaum, Tomer Grassiany, Jochen Frey
Context-Based Machine Translation™
(CBMT) is a new paradigm for corpusbased
translation that requires no parallel
text. Instead, CBMT relies on a lightweight
translation model utilizing a fullform
bilingual dictionary and a sophisticated
decoder using long-range context
via long n-grams and cascaded overlapping.
The translation process is enhanced
via in-language substitution of tokens and
phrases, both for source and target, when
top candidates cannot be confirmed or resolved
in decoding. Substitution utilizes a
synonym and near-synonym generator implemented
as a corpus-based unsupervised
learning process. Decoding requires a very
large target-language-only corpus, and
while substitution in target can be performed
using that same corpus, substitution
in source requires a separate (and
smaller) source monolingual corpus.
Spanish-to-English CBMT was tested on
Spanish newswire text, achieving a BLEU
score of 0.6462 in June 2006, the highest
BLEU reported for any language pair.
Further testing also shows that quality increases
above the reported score as the
target corpus size increases and as dictionary
coverage of source words and phrases
becomes more complete.