Introduced at a time of hardware scarcity almost 70 years ago, the von Neumann
apparatus of stored program and program counter forced the threading of instructions
through a metaphoric eye of a needle. Coupling of mathematical induction and (serial)
ISe abstraction was engineered to provide this threading, as discussed throughout the
article. See especially the description of how variable X is used in the pseudo-code of
the serial iterative algorithm in the exchange problem; also the first-in-first-out queue
data structure in the serial breadth-first search; and the serial merging algorithm in
which two elements are compared at a time, one from each of two sorted input arrays.
as eye-of-a-needle threading is already second nature for many programmers, it has
come to be associated with ease of programming.
threading through the eye of a needle is an aphorism for extreme difficulty, even
impossibility, in the broader culture, including in the texts of three major religions.
the Xmt extension to the von Neumann apparatus (noted in “the Xmt Processor”
sidebar) exploits today’s relative abundance of hardware resources to free computing
from the constraint of threading through the original apparatus. Coupling
mathematical induction and the ICe abstraction explored here is engineered to
capitalize on this freedom for ease of parallel programming and improved machine
and application performance.
Eye-of-a-Needle
Aphorism
XMT teachability (see Torbert et al. 23).
Highlights include evidence of 100X
speedups on general-purpose applications on a simulator of 1,000 on-chip
processors13 and speedups ranging
from 15X to 22X for irregular problems (such as Quicksort, breadth-first
search on graphs, finding the longest
path in a directed acyclic graph), and
speedups of 35X–45X for regular programs (such as matrix multiplication
and convolution) on the 64-processor
XMT prototype versus the best serial
code on XMT. 28
In 2009, Caragea et al. 8 demonstrat-
ed nearly 10X average performance
improvement potential relative to In-
tel Core 2 Duo for a 64-processor XMT
chip using the same silicon area as a
single core. In 2010, Caragea et al. 7
demonstrated that, using the same
silicon area as a modern graphics pro-
cessing unit (GPU), the XMT design
achieves an average speedup of 6X rel-
ative to the GPU for irregular applica-
tions and falls only slightly behind on
regular ones. All GPU code was written
and optimized by researchers and pro-
grammers unrelated to the XMT proj-
ect.
figure 4. Left side. fPGa board (size of a car license plate) with three fPGa chips (
generously donated by xilinx): a, B: Virtex-4Lx200; c: Virtex-4fx100. Right side. 10mm x 10mm
chip using iBm flip-chip technology.
a
B
c
DDR2
Pci bus
a, B: Virtex-4Lx200.
c: Virtex-4fx100.
10mm x 10mm chip using
iBm flip-chip technology.
ming difficulties have failed all general-purpose parallel systems to date
by limiting their use. In contrast, XMT
frees its programmers from doing all
the steps, in line with the ICE/PRAM
abstraction.
The XMT software environment
release (a 2010 release of the XMTC
compiler and cycle-accurate simulator
of XMT) is available by free download
from the XMT home page and from
sourceforge.net, along with extensive
documentation, and can be downloaded to any standard desktop computing
platform. Teaching materials covering a class-tested programming methodology in which students are taught
only parallel algorithms are also available from the XMT Web pages.
Most CS programs today graduate
students to a job market certain to be
dominated by parallelism but without
the preparation they need. The level of
awareness of parallelism required by
the ICE/PRAM abstraction is so basic
it is necessary for all other current approaches. As XMT is also buildable,
the XMT approach is sufficient for programming a real machine. I therefore
propose basing the introduction of
the new generation of CS students to
parallelism on the workflow presented
here, at least until CS generally converges on a many-core platform.
Related efforts. Related efforts toward parallelism come in several flavors; for example, Valiant’s Multi-BSP
bridging model for multi-core computing24 appears closest to the XMT
focus on abstraction. The main difference, however, is that XMT aims
to preempt known shortcomings in
existing machines by showing how to
build machines differently, while the
modeling in Valiant24 aims to improve
understanding of existing machines.
These prescriptive versus descriptive objectives are not the only difference. Valiant24 modeled relatively low-level parameters of certain multi-core
architectures, making them closer to
Vishkin et al. 27 than to this article. Unlike both sources, simplicity drives the
“one-liner” ICE abstraction. Parallel
languages (such as CUDA, MPI, and
OpenMP) tend to be different from
computational models, as they often
do not involve performance modeling. They require a level of detail that
distances them farther from simple