Carnegie Mellon University
Browse

Learning Multiple Non-Terminal Synchronous Grammar

Download (1.35 MB)
thesis
posted on 2025-04-18, 19:05 authored by Andreas Zollmann

Recent work in machine translation has evolved from the traditional word and phrase-based models to include hierarchical phrase-based and syntax based models. These advances are motivated by the desire to integrate richer knowledge within the translation process to explicitly address limitations of the purely lexical phrase-based model.

Generalized phrases as discussed in (Chiang, 2005) attempt to directly address the limitations of purely lexical phrases, and have shown significant improvements in translation quality by introducing constructs for sub-phrase representation. However, generalizations are represented by a single sub phrase category (and a glue rule for serial combination), providing the ability (and risk) of inserting any available sub-phrase into a larger phrase.

The first contribution of this dissertation work is the grammar extraction method of syntax-augmented machine translation (SAMT), an extension to Chiang’s model that provides multiple generalization types based on the phrase-structure parse trees of the training target sentences. We report improvements over strong phrase-based as well as hierarchical phrase-based baselines for French-to-English, Chinese-to-English, and Urdu-to-English.

We then propose several improvements to hierarchical and syntax augmented MT. We add a source-span variance model that estimates rule probabilities based on the number of source words spanned by the rule and its substituted child rules, introduce methods of combining hierarchical and syntax-based PSCFG models, and experiment with syntax-augmented MT variants based on source-side syntax as well as joint source and target syntax.

Syntax-based models such as SAMT typically rely on word-alignments and parse trees of the training sentence pairs, which are assumed to be correct. In reality, these alignments and parses are not human-generated, but instead result from the most probable configuration of a stochastic model. Weprovide a method to induce grammars over hidden alignments and parses, approximated from N-best lists. We present results showing improvements for hierarchical phrase-based MT as well as SAMT when using the widened pipeline. The SAMT model presupposes the availability of phrase-structure parse trees for the target training sentences. However, syntactic parsers are only available for a limited set of languages. We propose methods to label prob abilistic synchronous context-free grammar (PSCFG) rules using only word tags, generated by either part-of-speech analysis or unsupervised word class induction. The proposals range from simple tag-combination schemes to a phrase clustering model that can incorporate an arbitrary number of features. Our models improve translation quality over Chiang’s hierarchical phrase based MT model on the NIST large resource Chinese-to-English translation task. These improvements persist when using automatically learned word tags, suggesting broad applicability of our technique across diverse language pairs for which syntactic resources are not available.

History

Date

2011-07-01

Degree Type

  • Dissertation

Thesis Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Stephan Vogel

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC