units), there are 200 different microarchitectural blocks
(excluding array structures and arithmetic units since
errors inside those structures are immediately detected
and localized using parity and/or residue codes, as discussed in Section 2. 2). Each block has an average size
equivalent of 10K 2-input NAND gates. Seven benchmarks from SPECint2000 (bzip2, gcc, gap, gzip, mcf,
parser, vortex) were chosen as validation test programs
as they represent a variety of workloads. Each recorder
was sized to have 1024 entries.
All bugs were modeled as single bit-flips at flip-flops to
target hard-to-repeat electrical bugs. This is an effective
model because electrical bugs eventually manifest themselves as incorrect values arriving at flip-fops for certain
input combinations and operating conditions. 15
Errors were injected in one of 1191 flip-flops [Park and
Mitra17]. No errors were injected inside array structures
since they have built-in parities for error detection.
Upon error injection, the following scenarios are
possible:
1. The error vanishes without any effect at the system
level or produces an incorrect program output without any post-trigger firing. This case is related to the
coverage of validation test programs and post-triggers,
and is not the focus of this paper.
2. Failure manifestation with short error latency, where
recorders successfully capture the history from error
injection to failure manifestation (including situations
where recording is stopped/paused upon activation of
soft post-triggers).
3. Failure manifestation with long error latency, where
1024-entry recorders fail to capture the history from
error injection to failure (including soft triggers).
Out of 100,000 error injection runs, 800 of them
resulted in Cases 2 and 3. Figure 9 presents results from
these two cases. The “exactly located” category represents
the cases in which IFRA returned a single and correct
location–time pair (as defined in Section 1). The “
candidate located” category represents the cases in which
IFRA returned multiple location–time pairs (called candidates) out of over 200,000 possible pairs ( 1 out of 200
microarchitectural blocks and 1 out of 1,000 cycles), and
at least 1 pair was fully correct in both location and in
time. The “completely missed” category represents the
figure 9. ifRA bug localization summary.
Correct
localization
(96%)
Exact
localization
(78%)
Complete
miss (4%)
Avg. 6 candidates
out of 200,000
(22%)
112 communicAtions of the Acm | FEbrUary 2010 | VoL. 53 | No. 2
cases where none of the returned pairs were correct, even
if either location or time is correct. In addition, we pessimistically report all errors that resulted in Case 3 as
“completely missed.” All error injections were performed
after a million cycles from the beginning of the program
in order to demonstrate that there is no need to keep
track of footprints from the beginning.
It is clear from Figure 9 that a large percentage of bugs
were uniquely located to correct location–time pair, while
very few bugs were completely missed, demonstrating the
effectiveness of IFRA.
5. concLusion
IFRA targets the problem of post-silicon bug localization in
a system setup, which is a major challenge in processor post-silicon design validation. There are two major novelties of
IFRA:
1. High-level abstraction for bug localization using
low-cost hardware recorders that record semantic
information about instruction data and control flows
concurrently in a system setup.
2. Special techniques, based on self-consistency, to analyze the recorded data for localization after failure
detection.
IFRA overcomes major post-silicon bug localization
challenges.
1. It helps bridge a major gap between system-level and
circuit-level debug.
2. Failure reproduction is not required.
3. Self-consistency checks associated with the analysis
techniques minimize the need for full system-level
simulation.
IFRA creates several interesting research directions:
1. Automated construction of the post-analysis decision
diagram for a given microarchitecture.
2. Sensitivity analysis and characterization of the interrelationships between post-analysis techniques, architectural features, error detection mechanisms, recorder
sizes, and bug types.
3. Application to homogeneous/heterogeneous multi-and many-core systems, and system-on-chips (SoCs)
consisting of nonprocessor designs.
Acknowledgment
The authors thank A. Bracy, B. Gottlieb, N. Hakim,
D. Josephson, P. Patra, J. Stinson, H. Wang of Intel
Corporation, O. Mutlu and S. Blanton of Carnegie
Mellon University, T. Hong of Stanford University, and
E. Rentschler of AMD for helpful discussions and advice.
This research is supported in part by the Semiconductor
Research Corporation and the National Science
Foundation. Sung-Boem Park is also partially supported by
Samsung Scholarship, formerly the Samsung Lee Kun Hee
Scholarship Foundation.