Towards Effective and Efficient Open Speech Foundation Models
Speech is a key modality for human-computer interaction, enabling a wide range of speech processing applications. Traditionally, these applications have relied on separate models for each task, limiting scalability and impeding cross-task knowledge sharing. Recently, speech foundation models (SFMs) have emerged as a unifying framework for speech-related tasks. We define SFMs as models trained on broad data that can be adapted to various tasks or languages related to speech processing. Common types of SFMs include self-supervised learning (SSL) speech representation models, task-specific SFMs, and general instruction following SFMs. This thesis primarily focuses on the latter two categories, as SSL models cannot directly perform downstream tasks and are typically used as feature extractors or tokenizers via additional fine-tuning.
Despite their impressive performance, most existing SFMs—primarily developed by large corporations—lack openness and reproducibility, hindering scientific transparency and slowing broader research progress. This thesis addresses these limitations by developing SFMs with full transparency, architectural innovations, and improved efficiency.
We begin with task-specific SFMs trained via large-scale supervised learning, following the paradigm of models like Whisper. We present the Open Whisper-style Speech Models (OWSM), a reproducible framework for large-scale speech model training using only publicly available data and open-source toolkits. To improve speech modeling capabilities, we propose novel encoder architectures, including Branchformer and E-Branchformer, which achieve state-of-the-art results across diverse speech tasks. Through architectural improvements, data scaling, and systematic data cleaning, the OWSM models match or exceed the performance of leading proprietary systems in several benchmarks.
Building on this foundation, we extend to instruction-following SFMs capable of handling unseen tasks via natural language prompts. We introduce VoiceTextBlender, a spoken language model (SLM) that integrates speech and language understanding through a novel single-stage, joint speech-text supervised fine-tuning approach.
To address the computational demands of SFMs, we further propose model compression techniques ranging from static pruning to dynamic architectures, significantly reducing inference costs while maintaining high performance.
Finally, this thesis offers new insights into the behavior of SFMs, including emergent capabilities and scaling trends, contributing to a deeper understanding of their design and potential. By tackling the key challenges of openness, efficiency, and architecture, this work aims to democratize access to advanced speech technologies and catalyze further innovation in the field.
History
Date
2025-04-21Degree Type
- Dissertation
Thesis Department
- Electrical and Computer Engineering
Degree Name
- Doctor of Philosophy (PhD)