the networking community, resulting in many algorithms
for flow control and congestion control. TCP/IP, for
example, uses acknowledgments from the receiver to pace
the sender, opening and closing a window of unacknowledged packets that is a measure of the bandwidth-delay
product. If a packet loss occurs, TCP/IP assumes it is congestion and closes the window. Otherwise, it continues
trying to open the window to discover new bandwidth as
it becomes available.
Figure 3 shows how TCP/IP attempts to discover the
correct window size for a path through the network. The
line indicates what is available, and significantly, this
changes with time, as competing connections come and
go, and capacities change with route changes. When new
capacity becomes available, the protocol tries to discover
it by pushing more packets into the network until losses
indicate that too much capacity is used; in that case
the protocol quickly reduces the window size to protect
the network from overuse. Over time, the “sawtooth”
reflected in this figure results as the algorithm attempts to
learn the network capacity.
A major “physics” challenge for TCP/IP is that it is
learning on a round-trip timescale and is thus affected by
distance. Some new approaches based on periodic router
estimates of available capacity are not subject to round-trip time variation and may be better in achieving high
throughputs with high bandwidth-delay paths.
IMPLICATIONS FOR DIS TRIBUTED S YS TEMS
Many modern distributed systems are built as if all
network locations are roughly equivalent. As we have
seen, even if there is connectivity, delay can affect
FIGURE
3
TCP/IP Attempts to Discover the
Available Network Capacity
window
bottleneck
bandwidth
time
some applications and protocols more than others. In a
request/response type of IPC, such as a remote procedure
call, remote copies of data can greatly delay application
execution, since the procedure call is blocked waiting on
the response. Early Web applications were slow because
the original HTTP opened a new TCP/IP connection for
each fetched object, meaning that the new connection’s
estimate of the bandwidth-delay was almost always an
underestimate. Newer HTTPs exhibit persistent learning
of bandwidth-delay estimates and perform much better.
The implication for distributed systems is that one
size does not fit all. For example, use of a centralized data
store will create large numbers of hosts that cannot possibly perform well if they are distant from the data store. In
some cases, where replicas of data or services are viable,
data can be cached and made local to applications. This,
for example, is the logical role of a Web-caching system.
In other cases, however, such as stock exchanges, the data
is live and latency characteristics in such circumstances
have significant financial implications, so caching is not
effective for applications such as computerized trading.
While in principle, distributed systems might be built
that take this latency into account, in practice, it has
proven easier to move the processing close to the market.
RULES OF THUMB TO HOLD YOUR O WN WI TH PH YSICS
Here are a few suggestions that may help software developers adapt to the laws of physics.
Bandwidth helps latency, but not propagation delay.
If a distributed application can move fewer, larger messages, then this can help the application as the total
cost in delay is reduced since fewer round-trip delays are
introduced. The effects of bandwidth are quickly lost for
large distances and small data objects. Noise can also be
a big issue for increasingly more common wireless links,
where shorter packets suffer a lower per-packet risk of bit
errors. The lesson for the application software designer
is to think carefully about a design’s assumptions about
latency. Assume large latencies, make it work under those
circumstances, and take advantage of lower latencies
when they are available. For example, use a Web-embed-ded caching scheme to ensure the application is responsive when latencies are long, but no cache when it’s not
necessary.
Spend available resources (such as throughput and
storage capacity) to save precious ones, such as response
time. This may be the most important of these rules. An
example is the use of caches, including preemptive caching of data. In principle, caches can be replicated local to