cards developed efficient DMA hardware, some combined the TCP checksum generation with the copy operation, reducing the pass count to three.
This clearly reduced CPU use for a
given amount of TCP throughput and
started the march to “protocol assist”
services performed by network interfaces. (“If a little help is good, a lot of
help should be better!”) Adapting the
network stack code to exploit this new
checksum capability was not trivial,
but the handwriting on the wall made
it clear that such evolution was likely
to continue. Significant redesign of the
network code had to be done to allow
functions to move between hardware
and software with greater ease in the
future. This was genuine architectural
progress, although it did not happen
overnight.
a success Disaster
With the explosion of the Web, performance demands on network servers
skyrocketed. Processors and network
interfaces were getting faster, and
memory bandwidth strangulation was
being solved. Gigabit Ethernet quickly
became commonplace on server moth-erboards (and gamer desktop moth-erboards!). By this time, the cost of all
those data copies was clearly unacceptable. Simply halving the number of
copies would come close to doubling
the sustainable transaction rate for
many Web workloads.
This gave rise to the Holy Grail of
what became known as zero-copy TCP.
The idea was that programs written to
exploit this new capability could have
data delivered right into application
buffers without any intervening copies (ignoring the possible exception of
one efficient DMA transfer from the
hardware). Clearly this would require
some cooperation (or at least reduced
antagonism) from designers of Ethernet interface hardware, but a working
solution would win many hearts and
minds.
The step from a zero-copy TCP network stack to a full-blown TCP offload
engine looks pretty obvious at this
point. It seems even more attractive given that many PC-based platforms were
slow to exploit the multiprocessor abilities the PC was developing. (Whether
it is multiple chips or multiple cores
on one chip is largely irrelevant.) The
simply moving
data directly off the
network wire into
application buffers
is not sufficient.
the delivery of
packets must be
coordinated with all
the other things the
application is doing
and all the other
operating-system
machinery behind
the scenes.
ability to add a fast processor that can
be applied entirely to protocol processing is certainly an attractive idea. It is,
however, much more difficult to do
in real life than it first appears on the
whiteboard.
Simply moving data directly off the
network wire into application buffers
is not sufficient. The delivery of packets
must be coordinated with all the other
things the application is doing and all
the other operating-system machinery
behind the scenes. As a result, the network protocol stack interacts with the
rest of the operating system in exquisitely delicate ways. Truth be told, this
coordination machinery is the lion’s
share of the code in most stack implementations. The actual TCP state machine fits on a half page, once divorced
of all the glue and scaffolding needed
to integrate it with the rest of the system environment. It is precisely this
subtle and complex control coupling
that makes it surprisingly difficult to
isolate a network protocol stack fully
from its host operating system. There
are multiple reasons why this interaction is such a rich breeding ground for
implementation bugs, but one vast category is “abstraction mismatch.”
Because communications protocols
inherently deal with multiple communicating entities, some assumptions
must be made about the behavior of
those entities. The degree to which
those assumptions match between a
host system and protocol code determines how difficult it will be to map
to existing semantics and how much
new structure and machinery will be
required. When networking first went
into Berkeley Unix, subtleties on both
sides required considerable effort to
reconcile. There was a critical desire to
make network connections appear to
be natural extensions of existing Unix
machinery: file descriptors, pipes, and
the other ideas that make Unix conceptually compact. But because of radical
differences in behavior, especially delay, it is impossible to completely disguise reading 1,000 bytes from a round-the-world network connection so that
it appears indistinguishable from reading that same 1,000 bytes from a file on
a local file system. Networks have new
behaviors that require new interfaces
to capture and manage, but those new
interfaces must make sense with exist-