DOi: 10.1145/1516046.1516060
Article development led by queue.acm.org
(or TOE), extends back to the Arpanet interface message processor and possibly before. The notion is beguilingly simple: partition the work of executing communications protocols from the work of executing the applications that require the services of those protocols. That way, the applications and the network machinery can achieve maximum performance and efficiency, possibly taking advantage of special hardware performance assistance. While this looks utterly compelling on the whiteboard, architectural
and implementation realities intrude, often with considerable force.
This article will not attempt to discern whether the NFE is a heavenly gift or a manifestation of evil incarnate. Rather, it will follow its evolution starting from a pure host-based implementation of a network stack and then moving the network stack farther from that initial position, observing the issues that arise. The goal is to offer insight into the trade-offs that influence the location choice for network stack software in a larger systems context. As such, it is an attempt to prevent old mistakes from being reinvented while harvesting as much clean grain as possible.
As a starting point, consider the canonical structure of a common workstation or server before the advent of multicore processors. Ignoring the provenance of the operating-system code, this model springs directly from the quintessential early to mid-1980s computer science department computer, the DEC VAX 11/780 with a 10Mb Ethernet interface with single-cycle direct memory access (DMA) ability and connected to a relatively slow 16-bit bus (the DEC Unibus).
Since there is only one processor, the network stack vies for the attention of the CPU with everything else running on the machine, albeit probably with the aid of a software priority mechanism that makes the network code “more equal than others.”
When a packet arrives, the Ethernet interface validates the Ethernet frame cyclic redundancy check (CRC) and then uses DMA to transfer the packet into buffers used by the network code for protocol processing. The DMA transfers require only one local bus cycle for each16-bit word, and on the VAX 11/780 the processor controller for the Unibus buffers 16-bit words into a single 32-bit transfer into main memory.
The TCP checksum is then calculated by the network code, the protocol state machinery conducts its business, and the TCP payload data is copied into “socket buffers” to await consumption
References:
Archives