Carnegie Mellon University
Browse

Computational Techniques for Voice Intelligence: Deducing Psychological Factors from Human Voice

Download (4.8 MB)
thesis
posted on 2025-04-17, 19:55 authored by Hira DhamyalHira Dhamyal

Speech carries information about the speaker’s psychological traits, including their behavioral tendencies, leadership, emotions, and personality. The process of communication of these traits through speech can be thought of as a combination of an encoding process and a decoding process. The speaker, who expresses these traits, encodes them into low-dimensional characteristics of speech signals, and the listener, who perceives them, decodes the speech signal to make inferences about the state of the speaker. The encoding of psychological traits into speech is influenced by multiple factors, such as the context of the utterance, the speaker’s personality traits, the environment, and more. It is well known that psychological traits are encoded in different aspects of speech, i.e. not only in ‘what’ is said but also in ‘how’ it is said. Thus while decoding, the listener must consider not only the linguistic content, but also the acoustic, prosodic, and other non-linguistic cues in the speech.

In this thesis, we develop computational models that attempt to emulate the human decoding process for two types of psychological traits: emotion and personality.

This thesis is accordingly divided into two parts. In the first part, we focus on emotion. We study how emotions are encoded and decoded in speech, and how they are affected by factors such as context, the naturalness of expression including real (spontaneous and involuntary) vs. enacted (prompted and voluntary), and other nuances of speech production. In our study of emotion and its representation in speech, we also take the lexical and phonotactic content of speech into consideration. A speech signal contains words and phonemes, each of which can be thought of as a stream or modality of information that exhibits unique characteristics that encode emotion, including intensity, cadence, rhythm, and more. We hypothesize that in order to decode emotion from speech, computational models must capture the characteristic variations within each modality. In this thesis, we objectively show how important this approach is for the computational analysis of emotion.

In devising better methodologies for emotion detection, we also focus on the intra-emotion range and absolute intensity (or degree) of emotional expression. Humans are able to decode emotions at very fine granularities. However, state-of-the-art techniques for automatic emotion detection (or decoding) work with predefined sets of discrete emotions, extended into a three-dimensional continuous space denoting the valence, arousal, and dominance of each discrete emotion. Any decoding technique that is learned from data is clearly restricted by the discretization of labels assigned to emotions by human annotators. In this thesis, we work on techniques that are agnostic to such restrictions. We hypothesize that for efficient decoding, it would be effective to utilize discrete and continuous information simultaneously, in a hierarchical framework.

Expanding this work, we propose a second approach, which is inspired by the fact that humans inherently use natural language to describe the emotions that they perceive. We contend that while the labels that humans assign to each emotion are restricted by the descriptors available in their language, the diversity of emotions can nevertheless be captured by the flexibility that natural language provides – namely by the affective language that is often casually used to describe an emotion. Such affective language can often have measurable acoustic correlates. For example: an angry man ‘shouting loudly’ is describing the emotion by directly referring to the loudness or intensity of the speech. We show that when the labeling is done with natural language descriptions, guided by acoustic properties of speech, the computational process of emotion decoding is significantly improved. In an extension of this work, we add a learnable ‘prompt’, with the hope that more textual information is incorporated automatically by the model, that helps the model towards better emotion decoding.

In the second part of this thesis, we focus on personality. We study how personality is encoded in speech. We specifically explore utterance-level voice signal characteristics such as pitch, loudness, and many others., and many relevant utterance-level voice quality features that have been observed to correlate with personality traits in scientific literature.

In order to better understand personality decoding, we revisit the widely accepted OCEAN traits. OCEAN forms the 5 bases across which every individual is rated, and is the result of the psycholexical hypothesis, which assumes that our usage of language and words to describe humans would reveal the underlying bases of the personality. We hypothesize that these bases, or in fact different bases, would reveal themselves when the same language and words are analyzed with newer techniques, i.e. word representations learned from large language models. In this work, we show how the newer techniques reveal the most informative number of bases as two, and the next most informative as five, which in fact are on average aligned with the OCEAN traits.

In summary, this work fills the gap in our current understanding of computational ways to process psychological traits from speech signals. The potential uses of such technologies in applications that involve human-computer interaction are many. Such technologies can also aid in the assessment and monitoring of mental illnesses and psychological problems in humans, helping all involved – healthcare providers and affected people – in positive ways. They can also help predict the long-term susceptibility of individuals to specific types of work, social situations, and other factors. Looking into the future, we believe this work is important in the inevitable rise of speech-based Artificial Intelligence systems, which will carry their own emotions and personalities and also understand those of their human users

History

Date

2025-01-15

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Rita Singh Bhiksha Raj

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC