Carnegie Mellon University
Browse

Scalable Alignment of Large Language Models Towards Truth Seeking, Complex Reasoning, and Human Values

Download (9.26 MB)
thesis
posted on 2025-05-02, 18:10 authored by Zhiqing SunZhiqing Sun

The exponential advancement in Large Language Models (LLMs) and reasoning-powered AI agents, exemplified by GPT-4 and OpenAI Deep Research, has accelerated the timeline toward Artificial General Intelligence (AGI), with capabilities expanding at an unprecedented rate. As we stand at the threshold of potentially achieving AGI in the near future, the challenge of alignment—ensuring these systems remain truthful, capable of sophisticated reasoning, and aligned with human values—has become increasingly critical.

This thesis proposes novel methodologies to address fundamental alignment challenges for systems approaching superhuman capabilities. Extending beyond conventional paradigms such as Supervised Fine-Tuning (SFT) and Reinforcement Learning from Human Feedback (RLHF), we develop scalable alignment mechanisms through our Principle-Driven Alignment methodology. Implemented within a reinforcement learning from AI feedback (RLAIF) framework, this approach demonstrates significant improvements in maintaining system reliability under capability scaling. To mitigate factual inconsistencies in generation, we introduce Recitation Augmentation and Factually Augmented RLHF, which demonstrate robust performance on large language and multimodal models. The proposed Easy-to-Hard Generalization framework provides a systematic approach for preserving alignment by leveraging the insight that models can more reliably evaluate solutions than generate them, enabling supervision of complex reasoning tasks through reward models trained on simpler problems. Additionally, we proposed Lean-STaR, a framework that improves theorem-proving performance by guiding models to generate informal thoughts before formal solutions, demonstrating the effectiveness of Chain-of-Thought reasoning in enhancing autonomous decision-making capabilities while providing greater transparency of model reasoning processes.

This research contributes to a critical area of AI development by establishing rigorous frameworks for maintaining alignment as systems become increasingly capable. Our findings demonstrate the effectiveness of these approaches in creating AI systems that are aligned with fundamental human values while preserving performance reliability. These frameworks provide a foundation for scalable solutions that will shape the future development of advanced AI systems

History

Date

2025-04-24

Degree Type

  • Dissertation

Department

  • Language Technologies Institute

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Yiming Yang

Usage metrics

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC