figure 2: Anton processing node. the htiS performs the most
demanding calculations in an mD simulation. the flexible subsystem
performs the remaining mD calculations, coordinates mD time step
activity, and manages housekeeping tasks.
−Y
+Y
+X
Host
computer
Torus
link
Torus
link
Torus
link
Host
interface
Router
Router
+Z
Torus
link
Router
−Z
Torus
link
Flexible
subsystem
High-
throughput
interaction
subsystem
(HTIS)
Router
Memory controller
DRAM
−X
Torus
link
Router
Router
Memory controller
DRAM
Intra-chip
ring network
Waals interactions between a pair of atoms at every cycle. This
26-stage pipeline (Figure 4) includes adders, multipliers, function evaluation units, and other specialized datapath elements.
Inside this pipeline, we use customized numerical precisions:
functional unit width varies across the different pipeline stages
but still produces a sufficiently accurate 32–bit result.
figure 3: High-throughput interaction subsystem. the htiS
comprises an array of 32 PPims and an embedded control processor
to coordinate the distribution of particles to the PPim array.
Processing
node
From intra-chip
ring network
Particle
memory
Particle
pre-
processing
Particle distribution logic
32 PPIM
array
Interaction
control
block
processor
Force reduction logic
To intra-chip
ring network
In order to keep the pipelines busy with useful computation, the remainder of the HTIS must determine pairs of
atoms that need to interact, feed them to the pipelines,
and aggregate the pipelines’ outputs. This proves a formidable challenge given communication bandwidth limitations between ASICs, between the HTIS and other subsystems on the same ASIC, and between pipelines within
the HTIS. We address this problem using an architecture
tailored for direct product selection reduction operations
(DPSRs), which take two sets of points and perform computation proportional to the product of the set sizes but
only require input and output volume proportional to the
sum of their sizes. The HTIS considers interactions between all atoms in a region called the tower and all atoms
in a region called the plate. Each atom in the tower is assigned to one PPIM, while each atom in the plate streams
by all the PPIMs. Eight match units in each PPIM perform
several tests, including a low-precision distance check,
to determine which pairs of plate and tower particles are
fed to the force calculation pipeline. Because the HTIS is
a streaming architecture, with no feedback in its computational path, it is simple to scale the PPIM array to any
number of PPIMs. The HTIS also includes an interaction
control block processor, which controls the flow of data
through the HTIS. More detail about the HTIS and about
DPSR operations can be found in the proceedings of this
years’s HPCA conference. 13
The PPIMs are the most hard-wired component of our
architecture, reflecting the fact that they handle the most
computationally intensive parts of the MD calculation.
That said, even the PPIMs include programmability where
we anticipate potential future changes to force fields. For
instance, the functional forms for van der Waals and
electrostatic interactions are specified using SRAM lookup tables, whose contents are determined at runtime.
figure 4: PPIM detail. this figure gives a sense of the numerical
calculation units in a PPim. the top portion of the figure shows the
match units and particle memories. the lower portion shows the
general structure of the force calculation pipelines.
Tower particles
Plate particles
Plate particle Tower particle
position and position and
parameter FIFO parameter RAM
Plate and tower particle match units
Pair queue and select
Particle distance q
p
calculations
r2
q
t
Electrostatic function
evaluator
Combining rule
calculations
1/s2 e
van der Waals
function evaluator
Adder
Multiplier
Force(x,y,z) Potentials Energy
Tower and plate force reduction
Tower forces
Plate forces