That is why NVIDIA and Itseez decided to create a Tegra-optimized
version of OpenCV. This work benefited from three major optimization
opportunities: code vectorization with
NEON, multithreading with the Intel
TBB (Threading Building Blocks) library, and GPGPU with GLSL.
Taking advantage of the NEON
instruction set was the most attractive of the three choices. Figure 6
compares the performance of original and NEON-optimized versions of
OpenCV. In general, NEON requires
basic arithmetic operations using
simple and regular memory-access
patterns. Those requirements are
often satisfied by image-processing
primitives, which are almost ideal
for acceleration by NEON vector operations. As those primitives are often in the critical path of high-level
computer vision workflows, NEON
instructions can significantly accelerate OpenCV routines.
Multithreading on up to four symmetric CPUs can help at a higher level.
TBB and other threading technologies enable application developers to
get the parallel-processing advantage
of multiple CPU cores. At the application level independent activities can be
distributed among different cores, and
the operating system will take care of
load balancing. This approach is consistent with the general OpenCV strategy for multithreading—to parallelize
the whole algorithmic pipeline—while
on a mobile platform we often also
have to speed up primitive functions.
One approach is to split low-level
functions into several smaller sub-
tasks, which produces faster results. A
popular technique is to split an input
image into several horizontal stripes
and process them simultaneously.
An alternative approach is to create a
background thread and get the result
later while the main program works
on other parts of the problem. For ex-
ample, in the video stabilization ap-
plication a special class returns an
asynchronously calculated result from
the previous iteration. Multithread-
ing limits the speedup factor by the
number of cores, which on the most
advanced current mobile platforms
is four, while NEON supports vector
operations on 16 elements. Of course,
both of these technologies can be com-
bined. If the algorithm is constrained
by the speed of memory access, how-
ever, multithreading may not provide
the expected performance improve-
ment. For example, the NEON version
of cv::resize does not gain from
adding new threads, because a single
thread already fully consumes the
memory-bus capacity.
Figure 7 shows example speedups of
some filters and geometric transformations from the OpenCV library.
An additional benefit of using the
GPU is that at full speed it runs at a
lower average power than the CPU.
On mobile devices this is especially
important, since one of the main usability factors for consumers is how
long the battery lasts on a charge. We
measured the average power and time
elapsed to perform 10,000 iterations
of some optimized C++ functions,
compared with the same functions
written in GLSL. Since these functions are both faster on the GPU, and
the GPU runs at lower peak power. We
measured the result is significant energy savings (see the accompanying
table). We measured energy savings of
3–15 times when porting these functions to GPU.
figure 6. Performance improvement with neon on Tegra 3.
tegra CPu tegra neon
300
1.6x 250
200
Time (ms)
150
100
23x
1.6x
9.5x
5.4x
50
4.6x
2.6x
3.1x
3.4x
7.6x
0
Canny
Median
Blur
optical
flow
Morphology
Color
Conversion
Gaussian
Blur
fAsT
Detector
sobel
pyrDown
image
Resize
figure 7. Performance improvement with GLsL on Tegra 3.
tegra CPu tegra GPu
800
700
600
2.4x
13x
Time (ms)
500
400
300
9.8x
14x
5.7x
200
100
0
3.3x
Median
Blur
Planal
Warper
warpPerspective
Cylindrical
Warper
blur3x3
warpAffine