practice
DOi: 10.1145/1516046.1516060
Article development led by
queue.acm.org
The history of NFE processors sheds light on
the trade-offs involved in designing network
stack software.
By mike O’DeLL
network
front-end
Processors,
yet again
“This time for sure, Rocky!”
—Bullwinkle J. Moose
thE hIstory oF
the network front-end (NFE)
processor, best known as a TCP offload engine
(or TOE), extends back to the Arpanet interface
message processor and possibly before. The notion
is beguilingly simple: partition the work of executing
communications protocols from the work of executing
the applications that require the services of those
protocols. That way, the applications and the network
machinery can achieve maximum performance
and efficiency, possibly taking advantage of special
hardware performance assistance. While this looks
utterly compelling on the whiteboard, architectural
and implementation realities intrude,
often with considerable force.
This article will not attempt to discern whether the NFE is a heavenly gift
or a manifestation of evil incarnate.
Rather, it will follow its evolution starting from a pure host-based implementation of a network stack and then moving
the network stack farther from that initial position, observing the issues that
arise. The goal is to offer insight into the
trade-offs that influence the location
choice for network stack software in a
larger systems context. As such, it is an
attempt to prevent old mistakes from
being reinvented while harvesting as
much clean grain as possible.
As a starting point, consider the canonical structure of a common workstation or server before the advent of
multicore processors. Ignoring the
provenance of the operating-system
code, this model springs directly from
the quintessential early to mid-1980s
computer science department computer, the DEC VAX 11/780 with a 10Mb
Ethernet interface with single-cycle direct memory access (DMA) ability and
connected to a relatively slow 16-bit
bus (the DEC Unibus).
Since there is only one processor,
the network stack vies for the attention of the CPU with everything else
running on the machine, albeit probably with the aid of a software priority
mechanism that makes the network
code “more equal than others.”
When a packet arrives, the Ethernet
interface validates the Ethernet frame
cyclic redundancy check (CRC) and
then uses DMA to transfer the packet
into buffers used by the network code
for protocol processing. The DMA
transfers require only one local bus
cycle for each16-bit word, and on the
VAX 11/780 the processor controller
for the Unibus buffers 16-bit words
into a single 32-bit transfer into main
memory.
The TCP checksum is then calculated by the network code, the protocol
state machinery conducts its business,
and the TCP payload data is copied into
“socket buffers” to await consumption