Improving Prosody Through Analysis by Synthesis
Prosody and prosodic modeling in trainable Speech Synthesis systems are often based on large corpora of automatically annotated training data; however, these annotations are often incorrect. In practice, this has been either addressed through labor intensive manual annotation or simply ignored. In order to overcome this problem and improve prosodic realization, an iterative model-based method is proposed for improving linguistic structure, segmentation, and prosodic annotations that correspond to the delivery of each utterance as regularized across the data. For each iteration, the training utterances are resynthesized according to the existing symbolic annotation. Values of various features and subgraph structures are "twiddled:" each is perturbed based on the features and constraints of the model. Twiddled utterances are evaluated using an objective function appropriate to the type of perturbation and compared with the unmodified, resynthesized utterance. The instance with least error is assigned as the current annotation, and the entire process is repeated. At each iteration, the model is re-estimated, and the distributions and annotations regularize across the corpus. As a result, the annotations have more accurate and effective distributions, which leads to improved control and expressiveness given the features of the model.