As we detail later in Section 5. 2. 1, more than half of the
common PHY algorithms can indeed be rewritten with
LUTs, each with a speedup from 1. 5× to 50×. Since the size of
each LUT is sufficiently small, the sum of all LUTs in a processing path can easily fit in the L2 caches of contemporary
GPP cores. With core dedication (Section 4. 3), the possibility
of cache collisions is very small. As a result, these LUTs are
almost always in caches during PHY processing.
To accelerate PHY processing with data-level parallelism, Sora heavily uses the SIMD extensions in modern GPPs,
such as SSE, 3DNow! and AltiVec. Although these extensions
were designed for multimedia and graphics applications,
they also match the needs of wireless signal processing very
well because many PHY algorithms have fixed computation
structures that can easily map to large vector operations.
4. 2. multi-core streamline processing
Even with the above optimizations, a single CPU core may
not have sufficient capacity to meet the processing requirements of high-speed wireless communication technologies.
As a result, Sora must be able to use more than one core in
a multi-core CPU for PHY processing. This multi-core technique should also be scalable because the signal processing
algorithms may become increasingly more complex as wireless technologies progress.
As discussed in Section 2, PHY processing typically contains several functional blocks in a pipeline. These blocks
differ in processing speed and in input/output data rates
and units. A block is only ready to execute when it has sufficient input data from the previous block. Therefore, a key
issue is how to schedule a functional block on multiple cores
when it is ready.
Sora chooses a static scheduling scheme. This decision
is based on the observation that the schedule of each block
in a PHY processing pipeline is actually static: the processing pattern of previous blocks can determine whether a subsequent block is ready or not. Sora can thus partition the
whole PHY processing pipeline into several sub-pipelines
and statically assign them to different cores. Within one
sub-pipeline, when a block has accumulated enough data
for the next block to be ready, it explicitly schedules the next
block. Adjacent sub-pipelines are still connected with a synchronized FIFO (SFIFO), but the number of SFIFOs and their
overhead are greatly reduced.
to minimize overhead like cache misses or TLB flushes.
Second, previous work on multi-core OSes also suggests
that isolating applications into different cores may have better performance compared to symmetric scheduling, since
an effective use of cache resources and a reduction in locks
can outweigh dedicating cores. 9 Moreover, a core dedication
mechanism is much easier to implement than a real-time
scheduler, sometimes even without modifying an OS kernel.
For example, we can simply raise the priority of a kernel
thread so that it is pinned on a core and it exclusively runs
until termination (Section 5. 2. 3).
5. imPLementation
5. 1. hardware
We have designed and implemented the Sora RCB as shown
in Figure 4. It contains a Virtex- 5 FPGA, a PCIe-× 8 interface,
and 256MB of DDR2 SDRAM. The RCB can connect to various RF front-ends. In our experimental prototype, we use a
third-party RF front-end that is capable of transmitting and
receiving a 20 MHz channel at 2. 4 or 5 GHz.
Figure 5 illustrates the logical components of the Sora
hardware platform. The DMA and PCIe controllers interface with the host and transfer digital samples between the
RCB and PC memory. Sora software sends commands and
reads RCB states through RCB registers. The RCB uses its
onboard SDRAM as well as small FIFOs on the FPGA chip
to bridge data streams between the CPU and RF front-end.
When receiving, digital signal samples are buffered in
on-chip FIFOs and delivered into PC memory when they fit
figure 4. sora radio control board.
4. 3. Real-time support
SDR processing is a time-critical task that requires strict
guarantees of computational resources and hard real-time
deadlines. As an alternative to relying upon the full generality of real-time operating systems, we can achieve real-time
guarantees by simply dedicating cores to SDR processing in a multi-core system. Thus, sufficient computational
resources can be guaranteed without being affected by other
concurrent tasks in the system.
This approach is particularly plausible for SDR. First,
wireless communication often requires its PHY to constantly monitor the channel for incoming signals. Therefore,
the PHY processing may need to be active all the time. It is
much better to always schedule this task on the same core
PCIe
bus
figure 5. hardware architecture of RcB and Rf.
DMA
Controller
FPGA
FIFO
FIFO
RF
Controller
A/D
D/A
RF Circuit
Antenna
PCIE
Controller
SDRAM
Controller
RF Front-end
Registers
RCB
DDR
SDRAM