Carnegie Mellon University
Browse

Towards Integrated Acoustic Models for Speech Synthesis

Download (1.13 MB)
thesis
posted on 2022-12-02, 21:21 authored by Prasanna Kumar Muthukumar

All Statistical Parametric Speech Synthesizers consist of a linear pipeline of components. This view means that the synthesizer consists of a top-down structure where data fed into the synthesizer goes to front-end, then to the prediction algorithm, then to the waveform generation, and so on until the speech is finally constructed. Each component in this pipeline naïvely receives a stream of numbers from the preceding component, and spits out a stream of numbers for the next one in line, with little to no knowledge of what happens in the larger scheme of the pipeline. In this thesis, I argue against this “Markovian” structure, and instead propose the idea of an Integrated structure. In an integrated structure, every component in the system influences, and is in turn influenced by every other component in the system. This thesis describes four sets of experiments that move towards this idea. The first involves using lexical information to improve waveform generation algorithms. The second tries to increase the interaction between prediction algorithms and waveform generation. The third is an attempt to derive phonemes and phonetic information automatically from the speech rather than from the text. The last, and probably hardest, describes an idea for an evaluation metric that pays attention to multiple components of the synthesizer, rather than focusing on just a single one. 

History

Date

2016-05-03

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Alan W Black

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC