move only pages of a process that are
referenced by only that process alone
(otherwise, the user could interfere
with performance optimization of
processes owned by other users). Only
Root has the capability to move all
pages of a process.
It can be difficult to ensure all pages are local to a process since some
text segments are heavily shared and
there can be only one page backing an
address of a text segment. This is particularly an issue with the C library or
other heavily shared libraries.
Linux has a migratepages command-line tool to manually move pages around by specifying a pid, as well
as the source and destination nodes.
The memory of the process will be
scanned for pages currently allocated on the source node. Those will be
moved to the destination node.
NUMA scheduling. The Linux
scheduler had no notion of the page
placement of memory in a process
until Linux 3. 8. Decisions about migrating processes were made on an
estimate of the cache hotness of a
process’s memory. If the Linux scheduler moved the execution of a process
to a different NUMA node, then the
performance of that process could
be significantly impacted because its
memory now would require access
via the cross-connect. Once that move
was complete the scheduler would estimate the process memory is cache
hot on the remote node and leave the
process there as long as possible. As
a result, administrators who wanted
the best performance felt it best not to
let the Linux scheduler interfere with
memory placement. Processes were
often pinned to a specific set of processors using taskset, or the system
was partitioned using the cpusets
feature to isolate applications to stay
within the NUMA node boundaries.
In Linux 3. 8 the first steps were
made to address this situation by
merging a framework that will enable
the scheduler at some point to con-
sider the page placement and perhaps
automatically migrate pages from re-
mote nodes to the local node. Howev-
er, there is a significant development
effort still needed, and the existing
approaches do not always enhance
the performance of a given computing
load. This was the state of affairs ear-
lier this year; for more recent informa-
tion on the Linux kernel mailing list,
see http://vger.kernel.org or articles
from Linux Weekly News (http://lwn.
net; for example, http://lwn.net/Arti-
NUMA support has been around for
a while in various operating systems.
NUMA support in Linux has been
available since early 2000 and is being
continually refined. Frequently kernel
NUMA support will optimize process
execution without the need for user
intervention, and in most use cases an
operating system can simply be run
on a NUMA system, providing decent
performance for typical applications.
Special NUMA configuration
through tools and kernel configuration comes into play when the heuristics provided by the operating system
do not provide satisfactory application
performance to the end user. This is
typically the case in high-performance
computing, high-frequency trading,
and for real-time applications, but recently these issues have become more
significant for regular enterprise-class
applications. Traditionally, NUMA
support required special knowledge
about the application and hardware
for proper tuning using the knobs
provided by the operating systems.
Recent developments point (
especially around the Linux NUMA scheduler)
to developments that will result in the
ability of the operating systems to automatically balance a NUMA application load properly over time.
The use of NUMA needs to be guided by the increase in performance that
is possible. The larger the difference
between local and remote memory
access, the greater the benefits that
arise from NUMA placement. NUMA
latency differences are due to memory
accesses. If the application does not
rely on frequent memory accesses
(because, for example, the processor
caches absorb most of the memory operations), then NUMA optimizations
will have no effect. Also for I/O-bound
applications the bottleneck is typically the device and not memory access.
An understanding of the characteristics of the hardware and software is
required in order to optimize applications using NUMA.
Photoshop Scalability: Keeping It Simple
Clem Cole and Russell Williams
The Cost of Virtualization
Braithwaite, R., McCormick, P., Feng, W.
Empirical memory-access cost models in
multicore nUMA architectures. Virginia Tech
Department of Computer Science, 2011.
Using nUMA on RhEL 6; http://www.redhat.
A NUMA API for Linux. novell, 2005;
Effective synchronization on Linux/nUMA
systems. Gelato Conference, 2005. Effective
synchronization on Linux/nUMA systems.
Gelato Conference, 2005.
2006. Remote and local memory: Memory in
a Linux/nUMA system. Gelato Conference,
Li, Y., Pandis, I., Mueller, R.,
Raman, V., Lohman, G.
nUMA-aware algorithms: The case of data
shuffling. University of Wisconsin-Madison /
IBM Almaden Research Center, 2013.
2004. Linux Kernel Development.
Indianapolis: Sams Publishing.
Memory and Thread Placement
Optimization Developer’s Guide, 2010;
Unix Systems for Modern Architectures:
Symmetric Multiprocessing and Caching
for Kernel Programmers. Addison-Wesley,
Christoph Lameter specializes in high-performance
computing and high-frequency trading technologies.
as an operating-system designer and developer, he has
been developing memory management technologies for
linux to enhance performance and reduce latencies. he
is fond of new technologies and new ways of thinking that
disrupt existing industries and cause new development
communities to emerge.
© 2013 aCm 0001-0782/13/09 $15.00