gcc and binutils, demonstrating the simplicity of porting a
compiler to Native Client.
profiling and Debugging: Native Client’s open source release
includes a simple profiling framework to capture a complete
call trace with minimal performance overhead. This support
is based on gcc’s -finstrument-functions code generation option combined with the rdtsc timing instruction.
This profiler is portable, implemented entirely as untrusted
code. In our experience, optimized builds profiled in this framework have performance somewhere between -00 and -02
builds. Optionally, the application programmer can annotate the profiler output with methods similar to printf,
with output appearing in the trace rather than stdout.
Our release also includes a modified version of gdb on
Linux for Native Client debugging. The debugger recognizes the different addressing domains used by trusted
and untrusted code, and independent symbol tables for
both domains. Even with this support, the additional com-plexities of Native Client can interfere with debugging. As
such we maintain a set of libraries to facilitate building
both standalone and Native Client versions of a project,
and commonly debug the standalone version first.
Performance measurements in this section are made without the Native Client outer sandbox. The outer sandbox
implementations are platform-dependent, and generally
use standard kernel facilities (e.g. system call ACLs on
Windows, user IDs on Linux) with inherently small incremental overhead.
4. 1. sPec2000
A primary goal of Native Client is to deliver substantially all
of the performance of native code execution. NaCl module
performance is impacted by alignment constraints, extra
instructions for indirect control flow transfers, and the
incremental cost of NaCl communication abstractions.
We first consider the overhead of making native code
side effect free. To isolate the impact of the NaCl binary
constraints (Table 1), we built the SPEC2000 CPU benchmarks using the NaCl compiler, and linked to run as a
standard Linux binary. The worst case for NaCl overhead is
CPU bound applications, as they have the highest density
of alignment and sandboxing overhead. Figure 3 shows the
overhead of NaCl compilation for a set of benchmarks from
SPEC2000. The worst case performance overhead is crafty
at about 12%, with an average of about 5% across all benchmarks. Hardware performance counter measurements
indicate that the largest slowdowns are due to instruction
cache misses. For crafty, the instruction fetch unit is stalled
during 83% of cycles for the NaCl build, compared to 49%
for the default build. Gcc and vortex are also significantly
impacted by instruction cache misses.
As our current alignment implementation is conservative,
aligning some instructions that are not indirect control flow
targets, we hope to make incremental code size improvement as we refine our implementation. “NaCl32” measurements use statically linked binaries, 32-byte alignment, and
the nacljmp pseudo-instruction for indirect control flow
transfers. To isolate the impact of the indirect control flow
sequence, Figure 3 also shows “align32” results for static
linking and 32-byte alignment only. These comparisons
make it clear that alignment is a factor in some cases where
overhead is significant. Impact from static linking and sandboxing instruction overhead is small by comparison.
The impact of alignment is not consistent across the
benchmark suite. In some cases, alignment appears to
improve performance, and in others it seems to make
things worse. We hypothesize that alignment of branch
targets to 32-byte boundaries sometimes interacts favorably with caches, instruction prefetch buffers, and other
facets of processor microarchitecture. These effects are
curious but not large enough to justify further investigation. In cases where alignment makes performance
figure 3. sPec2000 performance. “Align32” results are for binaries with aligned 32-byte instruction blocks. “nacl32” results are for nacl
binaries. Performance for both is presented relative to standard compilation with static linking.
Slowdown versus static