Carnegie Mellon University
Browse

Dual Subtitles as Parallel Corpora

Download (243.62 kB)
journal contribution
posted on 2014-05-01, 00:00 authored by Shikun Zhang, Wang Ling, Chris Dyer

In this paper, we leverage the existence of dual subtitles as a source of parallel data. Dual subtitles present viewers with two languages simultaneously, and are generally aligned in the segment level, which removes the need to automatically perform this alignment. This is desirable as extracted parallel data does not contain alignment errors present in previous work that aligns different subtitle files for the same movie. We present a simple heuristic to detect and extract dual subtitles and show that more than 20 million sentence pairs can be extracted for the Mandarin-English language pair. We also show that extracting data from this source can be a viable solution for improving Machine Translation systems in the domain of subtitles.

History

Publisher Statement

Copyright by the European Language Resources Association

Date

2014-05-01

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC