facilities and is widely used in performance-critical environments today.
The author was involved with the creation of the NUMA facilities in Linux
and is most familiar with those.
Solaris also has somewhat comparable features,a but the number of systems deployed is orders of magnitude
less. Work is under way to add support
to other Unix-like operating systems,
but that support so far has been mostly confined to operating-system tuning parameters for placing memory
accesses. Microsoft Windows also has
a developed NUMA subsystem that
allows placing memory structures
effectively, but the software is used
mostly for enterprise applications
rather than high-performance computing. Requirements on memory-access speeds for enterprise-class applications are frequently more relaxed
than in high-performance computing,
meaning that less effort is spent on
NUMA memory handling in Windows
compared with Linux.
how operating systems
handle nUma memory
There are several broad categories in
which modern production operating
systems allow for the management
of NUMA: accepting the performance
mismatch, hardware memory striping, heuristic memory placement,
a static NUMA configurations, and
application-controlled NUMA placement.
Ignore the difference. Since NUMA
placement is a best-effort approach,
one option is simply to ignore the possible performance benefit and just
treat all memory as if no performance
differences exist. This means the operating system is not aware of memory
nodes. The system is functional, but
performance varies depending on
how memory happens to be allocated.
The smaller the differences between
local and remote accesses, the more
viable this option becomes.
This approach allows software and
the operating system to run unmodi-
fied. Frequently, this is the initial ap-
a For details, see http://docs.oracle.com/
2239/ madv.so.1-1/ index.html/.
cesses. Proper placement of data will
increase the overall bandwidth and
the latency to memory.
As the trend toward improving system performance by bringing memory
even nearer to processor cores continues, NUMA will play an increasingly important role in system performance. Modern processors have
multiple memory ports, and the latency of access to memory varies even
only depending on the position of the
core on the die relative to the controller. Future generations of processors
will have increasing differences in
performance as more cores on chip
necessitate more sophisticated caching. As the access properties of these
different kinds of memory continue
to diverge, new functionality may be
needed in operating systems to allow
for good performance.
NUMA systems today are mostly
encountered on multisocket systems.
A typical high-end business-class server today comes with two sockets and
will therefore have two NUMA nodes.
Latency for a memory access (
random access) is about 100ns. Access to
memory on a remote node adds another 50% to that number.
tions can require complex logic to
handle memory with diverging perfor-
mance characteristics. If a developer
requires explicit control of the place-
ment of memory for performance rea-
sons, some operating systems provide
APIs for this (for example, Linux, So-
laris, and Microsoft Windows provide
system calls for NUMA). However, var-
ious heuristics have been developed
in the operating systems that manage
memory access to allow applications
to transparently utilize the NUMA
characteristics of the underlying hard-
A NUMA system classifies memory
into NUMA nodes (what Solaris calls
locality groups). All memory available
in one node has the same access characteristics for a particular processor.
Nodes have an affinity to processors
and to devices. These are the devices
that can use memory on a NUMA node
with the best performance since they
are locally attached. Memory is called
node local if it was allocated from the
NUMA node that is best for the processor. For the example, the NUMA
system exhibited in Figure 1 has one
node belonging to one socket with
four cores each.
The process of assigning memory
from the NUMA nodes available in
the system is called NUMA placement.
As placement influences only performance and not the correctness of the
code, heuristics approaches can yield
acceptable performance. In the special case of noncache-coherent NUMA
systems, this may not be true since
writes may not arrive in the proper
sequence in memory. However, noncache-coherent NUMA systems have
multiple challenges when attempting
to code for them. We restrict ourselves
here to the common cache-coherent
The focus in these discussions
will be mostly on Linux since it is an
operating system with refined NUMA
figure 1. a system with two nUma nodes and eight processors.
nUMA node 0
nUMA node 1