Carnegie Mellon University
Browse

Exploration of Efficient Pre-training Techniques for Speech and Audio Large Language Models

Download (1.63 MB)
thesis
posted on 2025-08-29, 20:40 authored by Eman Ansar
<p dir="ltr">Speech recognition has emerged as a cornerstone technology in the era of artificial intelligence, enabling human-computer interaction with remarkable precision. Recent advances in speech language models (LLMs), such as Deep- Speech [4] and Wav2Vec [10], have significantly enhanced applications across domains. However, the performance of these systems is critically dependent on pretraining methodologies and tokenization techniques, which remain underexplored compared to advancements in natural language processing (NLP) with models like BERT [2] and GPT [9]. </p><p dir="ltr">This thesis addresses these challenges through a multi-phased approach. Initial work investigated the impact of tokenization techniques on speech LLMs, inspired by the success of tokenization in other sequence modeling tasks but computational limitations constrained model training. Sub- sequent efforts shifted toward mitigating hallucination in LLMs by lever- aging synthetic datasets and self-correcting mechanisms, based on error- correcting data approach [11] though the complexity of data generation hindered scalability. The current focus explores innovative pretraining architectures, inspired by the limitations of Autoencoders (AEs) in compressing high-dimensional speech data effectively. To emulate the potential of Deep Compression Autoencoders (DC-AEs) for speech, we adapted its architecture to LibriSpeech spectrograms (128×128 frames), achieving a compact 16×16 latent representation. Despite computational limits preventing full training, reconstruction metrics (PSNR ̃28 dB) and subjective inspection indicated substantial preservation of crucial speech features like phoneme and pitch contours, suggesting effective compression of speech spectrograms. </p><p dir="ltr">Building on insights from "Deep Compression Autoencoders" [5], this re- search evaluates the potential of Variational Autoencoders (VAEs) [7] and their latent space properties for robust pretraining. Furthermore, recognizing the scarcity of real-world labeled data for training models to handle speech errors, we generated a synthetic dataset comprising 10,000 samples. This dataset was created by introducing rule-based phoneme perturbations into clean TIMIT transcriptions, mapping them to spectrograms, and labeling them according to the severity of the induced errors (minor, moderate, se- vere). A 3-class classifier trained on this synthetic data achieved a promising 78% accuracy on a held-out set in distinguishing these error levels, demonstrating the viability of using synthetically generated data for error-corrected supervision in speech models. Preliminary tests on Gaussian mixture data have revealed promising results in latent space representation using VAEs compared to AEs and Denoising Autoencoders (DAEs) [12]. By introducing noise manually or through VAE-based mechanisms, this thesis aims to address deficiencies in high-density data regions and enhance the generalization capabilities of speech recognition models. </p><p dir="ltr">Through these advancements, this research aspires to refine the efficiency and accuracy of speech LLMs, contributing to the evolution of speech recognition systems and enabling more sophisticated applications in human-computer interaction.</p>

History

Date

2025-05-11

Advisor(s)

Bhiksha Raj

Academic Program

  • Computer Science

Usage metrics

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC