4. 2. flexible subsystem
The flexible subsystem controls the ASIC and handles all other computations, including the bond force calculations, the
FFT, and integration. Figure 5 shows the components of the
flexible subsystem. Four identical processing slices form the
core of the flexible subsystem. Each slice comprises a general-purpose core with its caches, a remote access unit (RAU)
that performs autonomous data transfers, and two geometry
cores (GCs), which are programmable cores that perform
most of the flexible subsystem’s computation. The RAU is a
programmable data transfer engine that enables the flexible
subsystem to participate in “push” communication, both offloading messages sent from the processor cores and tracking incoming messages to determine when work is ready to
be done. Each GC is a dual-issue, statically scheduled, 4-way
SIMD processor with pipelined multiply accumulate support and instruction set extensions to support common MD
calculations. Other components of the flexible subsystem
include a correction pipeline, which computes force correction terms; a racetrack, which serves as a local, internal
interconnect for the flexible subsystem components; and a
ring interface unit, which allows the flexible subsystem components to transfer packets to and from the communication
subsystem. More detail about the flexible subsystem is given
in a second paper at this year’s HPCA conference. 12
4. 3. communication subsystem
The communication subsystem provides high-speed, low-latency communication both between ASICs and among
figure 5: Flexible subsystem. it is a collection of four identical
processing slices (one of which is indicated by a box at the left) and
a correction pipeline unit. the processing slices communicate with
each other and with the correction pipeline via the racetrack. the
various components communicate with the intra-chip communication ring via the ring interface unit shown at the top of the figure.
Processing
node
Intra-chip ring network
Ring interface unit
Processing slice
GP
core 0
RAU 0
GC GC
01
GP
core 1
RAU 1
GC GC
23
GP
core 2
RAU 2
GC GC
45
GP
core 3
RAU 3
GC GC
67
Racetrack station
Racetrack station Racetrack station
Racetrack station
Racetrack station
Racetrack
Correction
pipeline
the subsystems within an ASIC. Between chips, each torus
link provides 5.3GB/s full-duplex communication with
a hop latency around 50ns. Within a chip, two 256-bit,
400 MHz communication rings link all subsystems and the
six inter-chip torus ports. The communication subsystem
supports efficient multicast, provides flow control, and
provides class-based admission control with rate metering. The communication subsystem also allows access to
an external host computer system for input and output of
simulation data.
4. 4. memory subsystem
The memory subsystem provides access to the ASIC’s attached DRAM. In addition to basic memory read//write access, the memory subsystem supports accumulation and
synchronization. Special memory write operations numerically add incoming write data to the contents of the memory location specified in the operation. These operations
implement force, energy, potential, and spread charge accumulations, reducing the computation and communication load on the flexible subsystem. By taking advantage of
the attached DRAM, Anton will be able to simulate chemical systems with billions of atoms.
5. PeRfoRmance anD accuRacY meaSuRementS
In this section, we show that the performance of Anton
significantly exceeds that of other MD platforms, and that
Anton is capable of performing simulations of high numerical accuracy. Because we do not yet have a working
512-node segment, performance estimates for our machine come from our performance simulator. The cycle fidelity of this simulator varies across components, but we
expect overall fidelity better than ±20%.
5. 1. Performance comparison
We compare the performance of various MD platforms in
terms of simulation rate (nanoseconds of simulated time
per day of execution) on a particular chemical system. In
this section and in Section 5. 2, we use a system with 23,558
atoms in a cubic box measuring 62. 2 Å on a side. This system represents dihydrofolate reductase (DHFR), a protein
targeted by various cancer drugs, surrounded by water.
The highest-performing MD codes achieve a simulation
rate of a few nanoseconds per day for DHFR on a single
state-of-the-art commodity processor core. 8 Existing multiprocessor machines with high-performance interconnects achieve simulation rates up to a few hundred nanoseconds per day using many hundreds or thousands of
processor cores. 2, 3, 5
We expect a 512-node Anton system to achieve a simulation rate of approximately 14,500 nanoseconds per day for
DHFR, enabling a millisecond simulation in just over two
months. While the performance of general-purpose machines will undoubtedly continue to improve, Anton’s performance advantage over other MD platforms significantly
exceeds the speedup predicted by Moore’s law over the
next few years. A more detailed performance comparison
of Anton and other MD platforms is given in the proceedings of last year’s ISCA conference. 20