practice

DOi: 10.1145/1516046.1516060

Article development led by queue.acm.org

The history of NFE processors sheds light on
the trade-offs involved in designing network
stack software.

By mike O’DeLL
network
front-end
Processors,
yet again
“This time for sure, Rocky!”
—Bullwinkle J. Moose
thE hIstory oF
the network front-end (NFE)
processor, best known as a TCP offload engine

(or TOE), extends back to the Arpanet interface message processor and possibly before. The notion is beguilingly simple: partition the work of executing communications protocols from the work of executing the applications that require the services of those protocols. That way, the applications and the network machinery can achieve maximum performance and efficiency, possibly taking advantage of special hardware performance assistance. While this looks utterly compelling on the whiteboard, architectural

and implementation realities intrude, often with considerable force.

This article will not attempt to discern whether the NFE is a heavenly gift or a manifestation of evil incarnate. Rather, it will follow its evolution starting from a pure host-based implementation of a network stack and then moving the network stack farther from that initial position, observing the issues that arise. The goal is to offer insight into the trade-offs that influence the location choice for network stack software in a larger systems context. As such, it is an attempt to prevent old mistakes from being reinvented while harvesting as much clean grain as possible.

As a starting point, consider the canonical structure of a common workstation or server before the advent of multicore processors. Ignoring the provenance of the operating-system code, this model springs directly from the quintessential early to mid-1980s computer science department computer, the DEC VAX 11/780 with a 10Mb Ethernet interface with single-cycle direct memory access (DMA) ability and connected to a relatively slow 16-bit bus (the DEC Unibus).

Since there is only one processor, the network stack vies for the attention of the CPU with everything else running on the machine, albeit probably with the aid of a software priority mechanism that makes the network code “more equal than others.”

When a packet arrives, the Ethernet interface validates the Ethernet frame cyclic redundancy check (CRC) and then uses DMA to transfer the packet into buffers used by the network code for protocol processing. The DMA transfers require only one local bus cycle for each16-bit word, and on the VAX 11/780 the processor controller for the Unibus buffers 16-bit words into a single 32-bit transfer into main memory.

The TCP checksum is then calculated by the network code, the protocol state machinery conducts its business, and the TCP payload data is copied into “socket buffers” to await consumption

References:

http://queue.acm.org

Archives