Carnegie Mellon University
Gao_cmu_0041E_10820.pdf (18.81 MB)

Audio Deepfake Detection Based on Differences in Human and Machine Generated Speech

Download (18.81 MB)
posted on 2023-01-23, 20:54 authored by Yang GaoYang Gao

Recent advances in deep learning have unfortunately advanced the quality of Deepfakes – entirely synthetic multimedia that provide a pernicious means of committing a wide variety of fraudulent activities such as identity theft and spreading misinformation. Deepfake – a portmanteau of “deep learning” and “fake” – is a term that refers to fake media content that is generated or manipulated using deep learning and machine learning algorithms, with the intent to deceive the observer into accepting it as a representation of reality. Amidst growing observation of its misuse and capacity to dilute authentic information, it is of utmost importance that we work towards developing automated systems that can reliably detect deepfakes, so that they can be taken out of circulation before any damage is done. 

The term Deepfake includes the syndissertation of fake data in all four digital modalities: audio, video, images and text. Deepfakes involving audio, specifically human speech, can be particularly dangerous because of the extensive biometric usage of speech in the world today. Speech systems, especially speaker identification and verification systems, are used to enhance the security of online and telephone-based access control to banking and many other portals. These systems can be attacked using spoofing audios. The related techniques may include audio replay, synthetic speech generation, voice conversion, etc. Many of these techniques can be viewed as variants of voice disguise, meant to conceal the speaker’s identity by impersonating the target or appearing to be a different person. Such disguise-based techniques are often also encountered in voice-based crimes such as vishing, attempts to break into voice authentication systems, fraudulent calls, etc. Voice disguise in itself poses a great threat to automated voice biometric systems, and creates a difficult challenge to forensic speech analysis. Deepfakes deteriorate these problems by enhancing the effectiveness of voice disguise. With the advancement of deep learning techniques, especially the generative models such as generative adversarial networks and WaveNet models, the quality of synthetic speech is steadily becoming closer to a natural speech. 

The objective of this dissertation is to develop robust deepfake-speech detection algorithms that can capture the fundamental differences between fake and genuine speech, i.e., between machine-generated and human-generated speech. The algorithms developed must be trainable with limited training data and be adaptable to the latest generation techniques as they are introduced. To achieve this goal, we divide our research into two main tasks, each geared towards answering two fundamental questions as follows: 

1. Section I: What are the aspects of human speech that deepfake generation techniques cannot reproduce? 

  • Part 1: Unique to humans: What are the characteristics of human speech that are most indicative of the underlying bio-mechanical process of speech production in humans? 
  • Part 2: Unique to machines: What are the characteristics of machine-generated speech that are most indicative of the underlying algorithmic (computational) process of speech generation

2. Section II: What kinds and categories of models and features are likely to be most adaptable to different deepfake generation mechanisms? 

  • Part 3: Robustness of detectors: How can we find and use the features that are specific to the human speech production process and least reproducible by machines to build robust deepfake detection techniques? 
  • Part 4: Adaptability of detectors: How can we develop deepfake detection techniques that can be rapidly adapted to new attacks? 

Accordingly, this dissertation has two sections and four parts as mentioned above. The first part (Chapter 2) aims at addressing human-generated speech and tries to identify its unique characteristics from a voice-production perspective. The goal is to understand the human voice production so that later we could identify the most human-centric features that are not included in the deepfake generative processes. The second part (Chapter 3) discusses machine speech generation mechanisms so that we could later find signatures that are unique to machine-generated speech. For this, we study the mechanisms of both state-of-the-art voice conversion/transformation, and voice synthesis (text-to-speech) systems. 

The results of the two sets of studies are finally combined to identify the aspects of machine-generated speech that are simply not consistent with the counterparts expected in human speech. Then, they are used in developing features and models for deepfake detection – a topic addressed by the last parts of this dissertation. 

The third and fourth parts are in the Chapter 4 of this dissertation, which deals with developing features and models based on the observation from the first and second parts of this dissertation. Several features have been proposed to improve the robustness of detectors for deepfake detection and their adaptability to newer unseen attacks have also been studied and discussed. In Chapter 5, we summarize the findings and contributions of this dissertation and discuss the future directions of audio deepfake detection. 




Degree Type

  • Dissertation


  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)


Rita Singh, Bhiksha Raj