It is not enough for
a processor to be
in order to be
it must be able
to run all programs
involve two branches. Optimizing this
has been the focus of a huge amount of
research over the past 40 years (most
of the techniques apply to Smalltalk
and similar languages, predating Java
Script). These techniques involve
dynamically recompiling the code
based on type information gained at
runtime; this ensures the case involving the most common types contains
It is worth noting that SPARC architecture has a tagged integer type and
add and subtract instructions specifically designed for dynamic languages.
A SPARC-tagged integer is 30 bits, with
the lowest two bits in a 32-bit word
reserved for type information. The
tagged operations set a condition code
(or trap) if either of the operands has a
nonzero tag or if the result overflows.
These instructions are not widely
used for several reasons. They use the
opposite definitions of tag values from
most 32-bit implementations (it is
simpler to use a 0 tag for pointers, as
it allows the pointer value to be used
common to use NaN (not a number)
representations to differentiate object
pointers from numeric values. This is
possible because most 64-bit architectures have a “memory hole”—a range in
the middle of the virtual address space
that cannot be mapped—and this
conveniently lines up with the pointer
interpretation of certain NaN values.
This allows generated code to assume
that values are floating point and either
branch or trap for NaN values.
In spite of the popularity of Java
Script as a general-purpose programming language, most “
general-purpose” processors still require the compiler to perform complex convolutions
to generate moderately efficient code.
Importantly, all of the techniques
that do generate good code from Java
Script require a virtual machine that
performs dynamic recompilation. If
the most efficient way of running general-purpose code on a processor is to
implement an emulator for a different
(imaginary) architecture, then it is difficult to argue the underlying substrate
is truly general purpose.
Locality of Reference
One feature shared by most modern
CPUs is their large caches. In the case
of Itanium and POWER, this trend
has gone beyond the realm of propri-
ety, but even commodity x86 chips
now have more cache than a 386 or
486 had RAM.
Cache, unlike scratchpad memo-
ries, is (as its name implies) hidden
from the software and exists to speed
up access to main memory. This abil-
ity to stay hidden is possible only if
the software has a predictable access
pattern. The strategy most commonly
used for CPU caches optimizes for lo‑
cality of reference. The caches are split
into lines of typically 64 or 128 bytes.
Loading a single byte from main mem-
ory will fill an entire cache line, making
accesses to adjacent memory cheap.
Often, caches will fill more than one
adjacent line on misses, further opti-
mizing for this case.
This design is a result of the work‑
ing‑set hypothesis, which proposes
that each phase of computation ac-
cesses a subset of the total memory
available to a program, and this set
changes relatively slowly.
2 GPUs, in
contrast, are optimized for a very dif-
ferent set of algorithms. The memory
controller on an Nvidia Tesla GPU, for
example, has a small number of fixed
access patterns (including some re-
cursive Z-shapes, commonly used for
storing volumetric data); it fetches
data in those patterns within a texture
and streams it to the processing ele-
ments. Modern Intel CPUs can also
identify some regular access patterns
and efficiently stream memory but
with a lower throughput.
Jae Min Kim et al. showed that some
programs run faster on the slow cores
in ARM’s big.LITTLE architecture.
Their explanation was that the slow
Cortex A7 cores have a single-cycle ac-
cess to their level- 1 caches, whereas the
fast Cortex A15 cores have a four-cycle
penalty. This means that, if the work-
ing-set hypothesis holds—the working
set fits into the L1 cache, and the per-
formance is dominated by memory ac-
cess—then the A7 will be able to satu-
rate its pipeline, whereas the A15 will
spend most of its time waiting for data.
This highlights the fact that, even
within commodity microprocessors
from the same manufacturer and generation, the different cache topologies
can bias performance in favor of a specific category of algorithm.