ment in absolute performance than the expected speedup
predicted by Moore’s law over our development time period, and because we are currently at the cusp of simulating timescales of great biological significance. We expect
Anton to run simulations over 1000 times faster than was
possible when we began this project. Assuming that transistor densities continue to double every 18 months and
that these increases translate into proportionally faster
processors and communication links, one would expect
approximately a tenfold improvement in commodity solutions over the five-year development time of our machine
(from conceptualization to bring-up). We therefore expect
that a specialized solution will be able to access biologically critical millisecond timescales significantly sooner than
commodity hardware.
To simulate a millisecond within a couple of months,
we must complete a time step every few microseconds, or
every few thousand clock ticks. The sequential dependence
of successive time steps in an MD simulation makes speculation across time steps extremely difficult. Fortunately,
specialization offers unique opportunities to accelerate an
individual time step using a combination of architectural
features that reduce both computational latency and communication latency.
For example, we reduced computational latency
by designing:
l Dedicated, specialized hardware datapaths and control
logic to evaluate the range-limited interactions and to
perform charge spreading and force interpolation. In
addition to packing much more computational logic
on a chip than is typical of general-purpose architec-
tures, these pipelines use customized precision for
each operation.
l Specialized, yet programmable, processors to compute
bond forces and the FFT and to perform integration.
The instruction set architecture (ISA) of these proces-
sors is tailored to the calculations they perform. Their
programmability provides flexibility to accommodate
various force fields and integration algorithms.
l Dedicated support in the memory subsystem to accu-
mulate forces for each particle.
We reduced communication latency by designing:
l A low-latency, high-bandwidth network, both within an
ASIC and between ASICs, that includes specialized
routing support for common MD communication pat-
terns such as multicast and compressed transfers of
sparse data structures.
l Support for choreographed “push”-based communica-
tion. Producers send results to consumers without the
consumers having to request the data beforehand.
l A set of autonomous direct memory access (DMA)
engines that offload communication tasks from the
computational units, allowing greater overlap of com-
munication and computation.
l Admission control features that prioritize packets car-
rying certain algorithm-specific data types.
We balance our design very differently from a general-purpose supercomputer architecture. Relative to other
high-performance computing applications, MD uses much
communication and computation but surprisingly little
memory. The entire architectural state of an MD simulation
of 25,000 particles, for example, is just 1. 6 MB, or 3. 2 KB per
node in a 512-node system. We exploit this property by using only SRAMs and small L1 caches on our ASIC, with all
code and data fitting on-chip in normal operation. Rather
than spending silicon area on large caches and aggressive
memory hierarchies, we instead dedicate it to communication and computation.
It is serendipitous that the most computationally intensive parts of MD—in particular, the electrostatic interactions—are also the most well established and unlikely to
change as force field models evolve, making them particularly amenable to hardware acceleration. Dramatically accelerating MD simulation, however, requires that we accelerate more than just an “inner loop.”
Calculation of electrostatic and van der Waals forces accounts for roughly 90% of the computational time for a
representative MD simulation on a single general-purpose
processor. Amdahl’s law states that no matter how much we
accelerate this calculation, the remaining computations,
left unaccelerated, would limit our maximum speedup to
a factor of 10. Hence, we dedicated a significant fraction
of silicon area to accelerating other tasks, such as bond
force computation, constraint computation, and velocity
and position updates, incorporating programmability as
appropriate to accommodate a variety of force fields and
integration methods.
4. SYStem aRchitectuRe
The building block of the system is a node, depicted in
Figure 2. Each node comprises an MD-specific ASIC, attached DRAM, and six ports to the system-wide interconnection network. Each ASIC has four major subsystems, which
are described briefly in this section. The nodes, which are
logically identical, are connected in a three-dimensional
torus topology (which maps naturally to the periodic
boundary conditions frequently used in MD simulations).
The initial version of Anton will be a 512-node torus with
eight nodes in each dimension, but our architecture also
supports larger and smaller toroidal configurations. The
ASICs are clocked at a modest 400 MHz, with the exception
of one double-clocked component in the high-throughput
interaction subsystem (HTIS), discussed in the following
section.
4. 1. high-throughput interaction subsystem
The HTIS calculates range-limited interactions and
performs charge spreading and force interpolation. The
HTIS, whose internal structure is shown in Figure 3, applies
massive parallelism to these operations, which constitute
the bulk of the calculation in MD. It provides tremendous
arithmetic throughput using an array of 32 pairwise point interaction modules (PPIMs) (Figure 3), each of which includes
a force calculation pipeline that runs at 800 MHz and is capable of computing the combined electrostatic and van der