Carnegie Mellon University
Browse

Learning from Failures - Methods for Improving Quality and Analysis in Test

Download (10.66 MB)
thesis
posted on 2025-11-12, 21:19 authored by Christopher NighChristopher Nigh
<p dir="ltr">Technological advancements are constantly occurring in every corner of the integrated circuit world. Semiconductor manufacturing technologies continually drive to smaller physical features and place those features more densely in integrated circuit designs, in turn exposing new and unexpected electrical phenomena. Circuit design continually strives for performance optimizations to various processes, often by pushing chips to operate as close to their limits as possible. The end application for many produced chips has entered a new paradigm, where massive data centers operate as centralized compute factories, combining hundreds of thousands of chips to serve diverse workloads. The equipment and software tooling that supports the industry is itself continually progressing, increasing in complexity and requiring greater expertise to be used effectively. With all of these advancements, the need to meaningfully connect dots between the related and diverse domains of the semiconductor ecosystem grows all the more important, and also all the more challenging.</p><p dir="ltr">In many respects, the field of test sits at the crossroads of these domains: connecting a virtual design to its physical implementation, logical models to electrical realizations, software tools to hardware chips. Its success is dependent on the ability to connect the dots between them, and as a result, often bears the brunt of their increasing complexity. Advances in manufacturing and design lead to new physical defects that produce new circuit failure behaviors, while reduced margins raise the likelihood of those failures, all of which test is responsible for detecting. The growing data center applications have brought an increased impact, visibility, and measurability of test escapes, which has shined a bright light on weaknesses of current test methodology and applied significant pressure for improvement. Roles in test have a near-requirement of specialization due to their complexity, a great obstacle in a field that relies on connection and collaboration. This dissertation aims to recognize these challenges, and in the face of them, provide methods for the meaningful connection of dots to aid in two of the central objectives of test: detecting chips that fail, and learning from their failures. </p><p dir="ltr">When a defective chip passes through production tests only to fail in its end application, the test methodology itself has failed at the first of those objectives. These instances of test’s failings have been highlighted more and more in recent hyperscale data center reports that identify the impactful failures known as silent data corruptions. To help address these challenges, Pseudo-Exhaustive Physically-Aware Physical Region (PEPR) testing is developed, which connects the physical domain of circuit layout with the logical domain of test generation to exhaustively test the physical regions where defects may manifest. PEPR takes the stances that i) defects manifest physically within a circuit, and ii) defect behavior may be unknown, therefore adopting a methodology to i) identify regions of physically co-located signals that may be influencing and/or influenced by a defect, and ii) exhaustively test those signals to account for all possible defect behaviors. Experiments that connect circuit fault simulation with real silicon failure data demonstrate the efficacy of PEPR to accurately account for defect behavior, particularly in comparison to commonly used fault models. While conventional approaches for test pattern generation applied to PEPR may be costly, connecting test generation methodologies to modern distributed compute environments may reduce cost and improve feasibility. </p><p dir="ltr">In pursuit of the second objective, a chip that fails a test is an opportunity for learning: about defects from the process, about weaknesses of the design, and/or about deficiencies of the test. Engineering experiments drive learning, but are challenged by the required collaboration of engineers with distinct specialties in pre-silicon and post-silicon domains that face increasing complexity and volume of work. To address complexity, a framework is developed to abstract the high learning curve of automated test equipment and electronic design automation testing tools to familiar Python-based scripting. The framework also links the two domains, directly connecting data collection to data analysis and enabling use by non-experts. To address scale of work, methods developed in this framework are deployed through Automated, on-ATE AI, a rule-based expert system for test failure debug that enables faster experiment execution, iteration, and information extraction with minimal human involvement. We hope that the proposed methods enable expert engineers to work across domains more easily to develop intuition, draw conclusions, and most importantly, improve communication and collaboration with their colleagues and counterparts. </p><p dir="ltr">Test serves as a crucial point of connection between many domains in the field of semiconductors. With the increasing weight of domain-specific technological advancements, such connecting points are susceptible to strain, and if not sufficiently supported, risk breakage. Ultimately, the work described in this dissertation aims to bolster methods in test by leaning into its connective nature—viewing it not as a point of vulnerability, but rather as a source of strength and opportunity.</p>

History

Date

2025-09-23

Degree Type

  • Dissertation

Thesis Department

  • Electrical and Computer Engineering

Degree Name

  • Doctor of Philosophy (PhD)

Advisor(s)

Shawn Blanton

Usage metrics

    Licence

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC