an API called jSIMD for mapping
Java code to SIMD instructions using
vectorized data of various data types.
JNI is a feature provided within Java
to allow access to native code. It is
used in the back end as a bridge between the Java-familiar API seen by
the programmer and the SIMD mappings compiled as native code. Using
an image-processing program as an
example, we observed a 34% speedup
over traditional Java code. Earlier tests
on simpler and purely mathematical
constructs have yielded speedups of
two to three times.
11
An overview of the API is shown in
the accompanying figure. Once a transaction of operations on vectors is built
up, the user code tells the API to initiate the desired operations. The API
identifies the available operating system and SIMD units in order to decide
when to execute API calls in Java and
when to pass the calls to the dynamic
libraries with SIMD mappings for parallel execution. If parallel execution
is possible, the API makes JNI calls to
dynamic libraries, which carry out
SIMD instructions on data in the Java
memory space. The SIMD native library
can be recompiled for a different target
architecture through gcc or by using a
prepackaged binary. Generic source
code has been used to facilitate simple
cross-compilation.
motivating example
Consider a motivating example that
uses the jSIMD API to obtain a speedup
over an out-of-the-box Java solution.
Alpha blending, used for creating transitions in video processing or blending
two individual images, is one example
of an algorithm that can be moderately
accelerated through the use of SIMD
instructions. There are many such parallel data-processing applications that
are easy to write using the SIMD paradigm. Examples of SIMD tasks that are
inherently parallel include: 3D graphics, real-time physics, video transcoding, encryption, and scientific applications. Selective versions of these
applications are usually supported by
custom native code in the virtual machine, whereas our solution gives the
programmer the ability to express any
algorithm, not just the ones built into
the interpreter.
Execution profiles were obtained
Future generations
of processors
may include GPus
on the die, but
until that is the
case for existing
infrastructures,
simD is a
low-hanging fruit,
not fully utilized
for getting more
computations
per core.
using Intel’s VTune Performance Analyzer,
7 which can be used to profile and
analyze various executable applications. We used it to observe the number and types of SSE calls performed
by the JVM alone. An alpha-blending
program was executed using several
standard-size images (640 x 480 - 1920
x 1080 pixels); 1,000 samples for each
test executed on an Intel Core 2 Duo
E6600 with 2-GB DDR2 RAM running
Windows XP Pro SP3. Using jSIMD resulted in an average speedup of 34%,
and a large number of SSE calls as
expected. Also, no SIMD instructions
were executed when using the out-of-the-box Java solution, while the results
when using the jSIMD API showed
that the number of retired SIMD instructions was in the millions and
saved several milliseconds per frame.
For video transcoding this is a significant performance improvement. The
linear relationship between retired
SIMD instructions and pixel count
means that the API works well at large
and small scales.
These observations show that exposing SIMD intrinsics will improve
execution time by calling more SIMD
instructions. The results from the current jSIMD implementation yielded a
speedup below the anticipated level,
based upon a maximum of four concurrent operations within SSE for
the data types and processor that we
used. The speedup is still significant,
considering that no changes to the
underlying system architecture were
needed and that the changes to the
user code were relatively simple and
natural. As it is impossible to guarantee that arrays remain pinned in the
JVM9 because of the garbage collector,
memory copies occur occasionally, as
confirmed through analysis.
making it happen
Some of the problems that arose during
the development of the jSIMD API were
dependencies between SIMD code and
regular Java code, and multiple instan-tiations of the API. The integrity of the
vector registers during program execution is another area of concern. We
found that even though Java does make
SIMD calls on its own, the JVM will not
interrupt the JNI call to our API, and
therefore it will not replace any of the
contents of the SSE registers on the fly.