in a DMA burst (128B). When transmitting, the large RCB
memory enables Sora software to first write the generated
samples onto the RCB, and then trigger transmission with
another command to the RCB. This functionality provides
flexibility to the Sora software for precalculating and storing several waveforms before actually transmitting them,
while allowing precise control of the timing of the waveform
transmission.
While implementing Sora, we encountered a consistency
issue in the interaction between DMA operations and the
CPU cache system. When a DMA operation modifies a memory location that has been cached in the L2 cache, it does not
invalidate the corresponding cache entry. When the CPU
reads that location, it can therefore read an incorrect value
from the cache.
We solve this problem with a smart-fetch strategy, enabling
Sora to maintain cache coherency with DMA memory without drastically sacrificing throughput if disabling cached
accesses. First, Sora organizes DMA memory into small slots,
whose size is a multiple of a cache line. Each slot begins with
a descriptor that contains a flag. The RCB sets the flag after it
writes a full slot of data, and clears it after the CPU processes
all data in the slot. When the CPU moves to a new slot, it first
reads its descriptor, causing a whole cache line to be filled.
If the flag is set, the data just fetched is valid and the CPU
can continue processing the data. Otherwise, the RCB has
not updated this slot with new data. Then, the CPU explicitly
flushes the cache line and repeats reading the same location.
This next read refills the cache line, loading the most recent
data from memory.
Table 1 summarizes the RCB throughput results, which
agree with the hardware specifications. To precisely measure PCIe latency, we instruct the RCB to read a memory
address in host memory, and measure the time interval
between issuing the request and receiving the response in
hardware. Since each read involves a round trip operation,
we use half of the measured time to estimate the one-way
delay. This one-way delay is 360ns with a worst case variation of 4 ns.
5. 2. software
The Sora software is written in C, with some assembly for
performance-critical processing. The entire Sora software
is implemented on Windows XP as a network device driver
and it exposes a virtual Ethernet interface to the upper TCP/IP
stack. Since any software radio implemented on Sora can
appear as a normal network device, all existing network
applications can run unmodified on it.
Ph Y Processing Library: In the Sora PHY processing library,
we extensively exploit the use of look-up tables (LUTs) and
SIMD instructions to optimize the performance of PHY
table 1. Dma throughput performance of the RcB.
mode
PCie-x4
PCie-x8
Rx (Gbps)
6. 71
12. 8
tx (Gbps)
6. 55
12. 3
algorithms. We have been able to rewrite more than half of
the PHY algorithms with LUTs. Some LUTs are straightforward precalculations, others require more sophisticated
implementations to keep the LUT size small. For the soft-demapper example mentioned earlier, we can greatly reduce
the LUT size (e.g., 1.5KB for the 802.11a/g 54Mbps modulation) by exploiting the symmetry of the algorithm. In our
Soft WiFi implementation described below, the overall size
of the LUTs is around 200KB for 802.11a/g and 310KB for
802.11b, both of which fit comfortably within the L2 caches
of commodity CPUs.
We also heavily use SIMD instructions in coding Sora
software. We currently use the SSE2 instruction set designed
for Intel CPUs. Since the SSE registers are 128-bit wide while
most PHY algorithms require only 8-bit or 16-bit fixed-point
operations, one SSE instruction can perform 8 or 16 simultaneous calculations. SSE also has rich instruction support for
flexible data permutations, and most PHY algorithms, e.g.,
FFT, FIR Filter and Viterbi, can fit naturally into this SIMD
model. For example, the Sora Viterbi decoder uses only 40
cycles to compute the branch metric and select the shortest
path for each input. As a result, our Viterbi implementation
can handle 802.11a/g at the 54Mbps modulation with only
one 2. 66 GHz CPU core, whereas previous implementations
relied on hardware implementations. Note that other GPP
architectures, like AMD and PowerPC, have very similar
SIMD models and instruction sets, and we expect that our
optimization techniques will directly apply to these other
GPP architectures as well.
Table 2 summarizes some key PHY processing algorithms we have implemented in Sora, together with the
optimization techniques we have applied. The table also
compares the performance of a conventional software
implementation (e.g., a direct translation from a hardware
implementation) and the Sora implementation with the
LUT and SIMD optimizations.
Lightweight, Synchronized fIfos: Sora allows different
PHY processing blocks to streamline across multiple cores,
and we have implemented a lightweight, synchronized FIFO
to connect these blocks with low contention overhead. The
idea is to augment each data slot in the FIFO with a header
that indicates whether the slot is empty or not. We pad each
data slot to be a multiple of a cache line. Thus, the consumer is always chasing the producer in the circular buffer
for filled slots. If the speed of the producer and consumer
is the same and the two pointers are separated by a particular offset (e.g., two cache lines in the Intel architecture),
no cache miss will occur during synchronized streaming
since the local cache will prefetch the following slots before
the actual access. If the producer and the consumer have
different processing speeds, e.g., the reader is faster than
the writer, then eventually the consumer will wait for the
producer to release a slot. In this case, each time the producer writes to a slot, the write will cause a cache miss at
the consumer. But the producer will not suffer a miss since
the next free slot will be prefetched into its local cache.
Fortunately, such cache misses experienced by the consumer will not cause significant impact on the overall performance of the streamline processing since the consumer