ing interfaces. This was difficult work,
and the modifications left few pieces of
the system untouched; a few changed
in profound ways.
The fundamental capabilities provided by a network protocol stack are
data transfer, multiplexing, flow control, and error management. All of
these functions are required for the
coordinated delivery of data between
endpoints across the Internet. Indeed,
the purpose of all the structure in the
packet headers: to carry the control coordination information, as well as the
payload data.
The critical observation is that the
exact same operations are required
to coordinate the interaction of a network protocol stack and the host operating system within a single system.
When all the code is in the same place
(that is, running on the same processor), this signaling is easily done with
simple procedure calls. If, however,
the network protocol stack executes
on a remote processor such as a TOE,
this signaling must be done with an explicit protocol carried across whatever
connects the front-end processor to
the host operating system. This protocol is called a host-front end protocol
(HFEP).
Designing an HFEP is not trivial,
especially if the goal is that it be materially simpler than the protocol being
offloaded to the remote processor. Historically, the HFEP has been the Achilles’ heel of NFE processors. The HFEP
ends up being asymptotically as complex as the “primary” protocol being
offloaded, so there is very little to gain
in offloading it. In addition, the HFEP
must be implemented twice: once in
the host and once in the front-end processor, each one of those being a different host platform as far as the HFEP
is concerned. Two implementations,
two integrations with host operating
systems—this means twice as many
sources of subtle race conditions,
deadlocks, buffer starvations, and other nasty bugs. This cost requires a huge
payoff to cover it.
But Wait a minute…
About now some readers may be eager
to throw a penalty flag for “
unconvincing hand waving” because even in the
base case, there is a protocol between
the Ethernet interface and the host
computer device driver. “Doesn’t that
count?” you rightfully ask. Yes, indeed,
it does.
There is a long history of peripheral
chips being designed with absolutely
dreadful interfaces. Such chips have
been known to make device-driver writers contemplate slow, painful violence
if they ever meet the chip designer in a
dark alley. The very early Ethernet chips
from one famous semiconductor company were absolute masterpieces of
egregious overdesign. Not only did they
contain many complex functions of dubious utility, but also the functions that
were genuinely required suffered from
the same virulent infestation of bugs
that plagued the useless bits. Tom Lyon
wrote a famous Usenix paper in 1985,
“All the Chips that Fit,” delivering an
epic rant on this expansive topic. (It
should be required reading for anyone
contemplating hardware design.)
If the goal is efficiency and performance of network code, all of the
“mini-protocols” in the entire network
protocol subsystem must be examined
carefully. Both internal complexity and
integration complexity can be serious
bottlenecks. Ultimately, the question is
how hard is it to glue this piece onto the
other pieces it must interact with frequently? If it is very difficult, it is likely
not fast (in an absolute sense), nor is it
likely robust from a bug standpoint.
Remember the protocol state machines are generally not the principal
source of complexity or performance
issues. One extra data copy can make
a huge difference in the maximum
achievable performance. Therefore,
implementations must focus on avoiding data motion: put it where it goes
the first time it is touched, then leave
it alone. If some other operation on
packet payload is required, such as
checksum computation, bury it inside
an unavoidable operation such as the
single transfer into memory. In line
with those suggestions, streamline the
operating-system interface to maximize concurrency. Once all those issues have been addressed aggressively,
there’s not a lot of work left to avoid.
What Does all this mean for nfes?
Many times, but not every time, an NFE
is likely to be an overly complex solution to the wrong part of the problem.
It is possibly an expedient short-term
measure (and there’s certainly a place
in the world for those), but as a long-term architectural approach, the com-moditization of processor cores makes
specialized hardware very difficult to
justify.
Lacking NFEs, what is required for
maximizing host-based network performance? Here are some guidelines:
• Wire interfaces should be designed
to be fast and brilliantly simple. Do the
bit-speed work and then get the data
into memory as quickly as possible, doing any additional work such as checksums that can readily be buried in the
unavoidable transfer. Streamline the
device as seen by the driver so as to
avoid playing “Twenty Questions” with
the hardware to determine what just
happened.
•Interconnects should have sufficient capacity to carry the network
traffic without strangling other I/O operations. From the standpoint of a network interface, PCI Express appears
to have adequate performance for
10Gbps Ethernet as does Hyper Transport 3.0.
• The system must have sufficient
memory bandwidth to get the network
payload in and out without strangling
the rest of the system, especially the
processors. Historically, the PC platform has been chronically starved for
memory bandwidth.
•Processors should have enough
cores able to exploit the sufficient
memory bandwidth.
• Network protocol stacks should be
designed to maximize parallelism and
minimize blocking, while never copying data.
• A set of network APIs should be
designed to maximize performance
as opposed to mandatory similarity
with existing system calls. Backward
compatibility is important to support,
but some applications may wish to pay
more to get more.
historical Perspective
NFEs have been rediscovered in at
least four or five different periods. In
the spirit of full and fair disclosure, I
must admit to having directly contributed to two of those efforts and having
purchased and integrated yet another.
So why does this idea keep recurring if
it turns out to be much more difficult
than it first appears?