ing interfaces. This was difficult work, and the modifications left few pieces of the system untouched; a few changed in profound ways.

The fundamental capabilities provided by a network protocol stack are data transfer, multiplexing, flow control, and error management. All of these functions are required for the coordinated delivery of data between endpoints across the Internet. Indeed, the purpose of all the structure in the packet headers: to carry the control coordination information, as well as the payload data.

The critical observation is that the exact same operations are required to coordinate the interaction of a network protocol stack and the host operating system within a single system. When all the code is in the same place (that is, running on the same processor), this signaling is easily done with simple procedure calls. If, however, the network protocol stack executes on a remote processor such as a TOE, this signaling must be done with an explicit protocol carried across whatever connects the front-end processor to the host operating system. This protocol is called a host-front end protocol (HFEP).

Designing an HFEP is not trivial, especially if the goal is that it be materially simpler than the protocol being offloaded to the remote processor. Historically, the HFEP has been the Achilles’ heel of NFE processors. The HFEP ends up being asymptotically as complex as the “primary” protocol being offloaded, so there is very little to gain in offloading it. In addition, the HFEP must be implemented twice: once in the host and once in the front-end processor, each one of those being a different host platform as far as the HFEP is concerned. Two implementations, two integrations with host operating systems—this means twice as many sources of subtle race conditions, deadlocks, buffer starvations, and other nasty bugs. This cost requires a huge payoff to cover it.

But Wait a minute…

About now some readers may be eager to throw a penalty flag for “ unconvincing hand waving” because even in the base case, there is a protocol between the Ethernet interface and the host

computer device driver. “Doesn’t that count?” you rightfully ask. Yes, indeed, it does.

There is a long history of peripheral chips being designed with absolutely dreadful interfaces. Such chips have been known to make device-driver writers contemplate slow, painful violence if they ever meet the chip designer in a dark alley. The very early Ethernet chips from one famous semiconductor company were absolute masterpieces of egregious overdesign. Not only did they contain many complex functions of dubious utility, but also the functions that were genuinely required suffered from the same virulent infestation of bugs that plagued the useless bits. Tom Lyon wrote a famous Usenix paper in 1985, “All the Chips that Fit,” delivering an epic rant on this expansive topic. (It should be required reading for anyone contemplating hardware design.)

If the goal is efficiency and performance of network code, all of the “mini-protocols” in the entire network protocol subsystem must be examined carefully. Both internal complexity and integration complexity can be serious bottlenecks. Ultimately, the question is how hard is it to glue this piece onto the other pieces it must interact with frequently? If it is very difficult, it is likely not fast (in an absolute sense), nor is it likely robust from a bug standpoint.

Remember the protocol state machines are generally not the principal source of complexity or performance issues. One extra data copy can make a huge difference in the maximum achievable performance. Therefore, implementations must focus on avoiding data motion: put it where it goes the first time it is touched, then leave it alone. If some other operation on packet payload is required, such as checksum computation, bury it inside an unavoidable operation such as the single transfer into memory. In line with those suggestions, streamline the operating-system interface to maximize concurrency. Once all those issues have been addressed aggressively, there’s not a lot of work left to avoid.

 

What Does all this mean for nfes? Many times, but not every time, an NFE is likely to be an overly complex solution to the wrong part of the problem. It is possibly an expedient short-term

measure (and there’s certainly a place in the world for those), but as a long-term architectural approach, the com-moditization of processor cores makes specialized hardware very difficult to justify.

Lacking NFEs, what is required for maximizing host-based network performance? Here are some guidelines:

Wire interfaces should be designed to be fast and brilliantly simple. Do the bit-speed work and then get the data into memory as quickly as possible, doing any additional work such as checksums that can readily be buried in the unavoidable transfer. Streamline the device as seen by the driver so as to avoid playing “Twenty Questions” with the hardware to determine what just happened.

Interconnects should have sufficient capacity to carry the network traffic without strangling other I/O operations. From the standpoint of a network interface, PCI Express appears to have adequate performance for 10Gbps Ethernet as does Hyper Transport 3.0.

The system must have sufficient memory bandwidth to get the network payload in and out without strangling the rest of the system, especially the processors. Historically, the PC platform has been chronically starved for memory bandwidth.

Processors should have enough cores able to exploit the sufficient memory bandwidth.

Network protocol stacks should be designed to maximize parallelism and minimize blocking, while never copying data.

A set of network APIs should be designed to maximize performance as opposed to mandatory similarity with existing system calls. Backward compatibility is important to support, but some applications may wish to pay more to get more.

 

historical Perspective NFEs have been rediscovered in at least four or five different periods. In the spirit of full and fair disclosure, I must admit to having directly contributed to two of those efforts and having purchased and integrated yet another. So why does this idea keep recurring if it turns out to be much more difficult than it first appears?

References:

Archives