Post-Silicon Bug Localization
for Processors Using IFRA
Abstract
IFRA, an acronym for Instruction Footprint Recording and
Analysis, overcomes major challenges associated with a very
expensive step in post-silicon validation of processors—
pinpointing a bug location and the instruction sequence
that exposes the bug from a system failure, such as a crash.
Special on-chip recorders, inserted in a processor during
design, collect instruction footprints—special information
about flows of instructions, and what the instructions did
as they passed through various microarchitectural blocks
of the processor. The recording is done concurrently during the normal operation of the processor in a post-silicon
system validation setup. Upon detection of a system failure,
the recorded information is scanned out and analyzed off-line for bug localization. Special self-consistency-based program analysis techniques, together with the test-program
binary of the application executed during post-silicon validation, are used for this purpose. Major benefits of using IFRA
over traditional techniques for post-silicon bug localization
are ( 1) it does not require full system-level reproduction of
bugs, and ( 2) it does not require full system-level simulation.
Hence, it can overcome major hurdles that limit the scalability of traditional post-silicon validation methodologies.
Simulation results on a complex superscalar processor demonstrate that IFRA is effective in accurately localizing electrical bugs with 1% chip-level area impact.
1. intRoDuction
Post-Silicon validation involves operating one or more
manufactured chips in actual application environments
to validate correct behaviors across specified operating
conditions. According to recent industry reports,
post-silicon validation is becoming significantly expensive. Intel reported a headcount ratio of 3: 1 for design vs.
post-silicon validation. 19 According to Abramovici et al., 1
post-silicon validation may consume 35% of average chip
development time. Yerramilli25 observes that post-silicon
validation costs are rising faster than the design costs.
Loosely speaking, there are two types of bugs that design
and validation engineers worry about:
1. Bugs caused by the interactions between the design
and the physical effects, also called electrical bugs. 10
Such bugs generally manifest themselves only under
certain operating conditions (temperature, voltage,
frequency). Examples include setup and hold time
problems.
2. Functional bugs, also called logic bugs, caused by design
errors.
106 communicAtions of the Acm | FEbrUary 2010 | VoL. 53 | No. 2
Post-silicon validation involves four steps:
1. Detecting a problem by running a test program, such
as OS, games, or functional tests, until a system failure
occurs (e.g., system crash, segmentation fault, or
exceptions).
2. Localizing the problem to a small region from the system failure, e.g., a bug in an adder inside an ALU of a
complex processor. The stimulus that exposes the bug,
e.g., the particular 10 lines of code from some application, is also important.
3. Identifying the root cause of the problem. For example,
an electrical bug may be caused by power-supply noise
slowing down a circuit path resulting in an error at the
adder output.
4. Fixing or bypassing the problem by microcode patching, 7
circuit editing, 11 or, as a last resort, respinning using a new
mask.
Josephson9 points out that the second step, bug localization, dominates post-silicon validation effort and costs. Two
major factors that contribute to the high cost of traditional
post-silicon bug localization approaches are:
1. Failure reproduction which involves returning the chip
to an error-free state, and re-executing the failure-causing stimulus (including test-program segment,
interrupts, and operating conditions) to reproduce the
same failure. Unfortunately, many electrical bugs are
hard to reproduce. The difficulty of bug reproduction
is exacerbated by the presence of asynchronous I/Os
and multiple clock domains.
2. System-level simulation for obtaining golden responses, i.e., correct signal values for every clock cycle
for the entire system (i.e., the chip and all the peripheral devices on the board) to compare against the
signal values produced by the chip being validated.
Running system-level simulation is typically 7–8 orders
of magnitude slower than actual silicon.
Due to these factors, a functional bug typically takes hours
to days to be localized vs. an electrical bug that requires days
to weeks and more expensive equipments. 10
A previous version of this paper appeared in the
Proceedings of the 45th ACM-IEEE Design Automation
Conference (2008, Anaheim, CA).