video frame into 16 × 16 macro-blocks and encodes each one
separately. Each block goes through five major functions:
IME finds the closest match for an image block versus a
previous reference image. While it is one of the most com-
pute intensive parts of the encoder, the basic algorithm
lends itself well to data-parallel architectures. On our base
CMP, IME takes up 56% of the total encoder execution time
and 52% of total energy.
2. 4. current h.264 implementations
The computationally intensive H.264 encoding algorithm
poses a challenge for general-purpose processors, and is
typically implemented as an ASIC. For example, T. C. Chen
et al. implement a full-system H.264 encoder and demonstrate that real-time HD H.264 encoding is possible in hardware using relatively low power and area cost.
2
H.264 software optimizations exist, particularly for
motion estimation, which takes most of the encoding time.
For example, sparse search techniques speed performance
of IME and FME by up to 10×.
14, 25 Combining aggressive
algorithmic modifications with multiple cores and SSE
extensions leads to highly optimized H.264 encoders on
Intel processors.
3, 12
3. eXPeRimentaL methoDoLoGY
To understand what is needed to gain ASIC level efficiency,
we use existing H.264 partitioning techniques, and modify
the H.264 encoder reference code JM 8.611 to remove dependencies and allow mapping of the five major algorithmic
blocks to the four-stage macro-block (MB) pipeline shown
in Figure 1. This mapping exploits task level parallelism at
the macro-block level and significantly reduces the inter-processor communication bandwidth requirements by
sharing data between pipeline stages.
table 1. intel’s optimized h.264 encoder versus a 720p hD asic.
fPs area (mm2) energy/frame (mJ)
Intel (720 × 480 sD) 30 122 742
Intel (1280 × 720 hD) 11 122 2023
AsIC 30 8 4
The second row gives Intel’s sD data scaled to hD. AsIC data is scaled from 180 down
to 90nm.
Read/Write to
main memory
figure 1. four stage macroblock partition of h.264. (a) Data
flow between stages. (b) how the pipeline works on different
macroblocks. iP includes Dct + Quant. ec is caBac.
Luma Ref. Pels,
Cur.Luma MB
Luma Ref. Pels,
Cur. Luma MB
MV Info.
Chroma MB,
Upper Pels
Upper Pels
Residue
MB, QP,
Intra Flag
Delayed main memory
data
Bitstream
MV Info.,
MC Luma MB
Cur. Luma MB
Data produced in prev.
pipe stage
(a)
MB0
MB1
MB2
MB3
FME IP EC IME
IME FME IP EC
FME IME IP EC
(b)