Carnegie Mellon University
Browse

De-Entanglement: A Framework towards building Ubiquitous speech technologies

Download (10.29 MB)
thesis
posted on 2023-01-06, 21:34 authored by Sai Krishna RallabandiSai Krishna Rallabandi

Speech driven devices and interfaces like Apple Home pod, Google Home, Amazon Echo are increasingly becoming ubiquitous and have tremendous potential to affect our daily lives. However, deep learning models underlying these applications have yet unaddressed challenges like scalability, explainability and concerns like privacy and security. 

In my dissertation I propose a framework called De-Entanglement that has linguistic concepts as first class objects. De-Entanglement attempts to build speech technology using two core concepts referred to as content and style. Specifically, content encompasses acoustic phonetic information while style encompasses paralinguistic information from the raw audio. In my dissertation I provide experiments that show how De-Entanglement can address three challenges in a holistic fashion: 

  • (1) Scalability: How to build speech technologies for new languages / language phenomena such as code switching? I demonstrate how De-Entanglement helps build more natural Text to Speech (TTS) voices. This part of the work has been deployed in the form of Android application in 13 Indian languages and has been assisting people since 2016. 
  • (2) Flexibility: How to build models that can be manipulated to accomplish a variety of functionality such as finetuning, meta learning, augmentation and self training? I present experiments in two types of models. In the context of generative models, I present an approach to show that De-Entanglement allows explicit global and local control of synthetic voices. In the context of discriminative models, I present approaches that leverage style information to detect para-linguistic events from a speech utterance. 
  • (3) Explainability: How to build technology that is reasonable to the stakeholders? I posit that explainable speech technologies should be characterized by two properties: (a) Reasonable Understanding of internal mechanisms in the model and (b) Demonstrable Utility of the model for downstream applications. Using language identification and intent recognition from acoustics as the target applications, I demonstrate (a) how suitable priors can be incorporated into a model and (b) how such an approach leads to strong performance in low and under resourced scenarios. 

Since these linguistic constructs(concepts) are shared across different tasks within and beyond speech processing, the solutions designed using De-Entanglement hold promise to be applicable across different tasks. I present experiments to this end in both speech processing as well as broader Natural Language Processing.

History

Date

2022-04-23

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Alan W Black

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC