The use of SIMD registers can be inefficient unless data transfer between
memory and the SIMD unit is reduced.
Looking at lists of SIMD operations as
transactions allows for further analysis, weighing the performance gain
versus the overhead cost. One drawback to our approach is that interlacing SSE calls with regular Java code
may cause thrashing of register files.
Our current solution requires the programmer to write all SSE code in one
continuous block so that the JVM does
not need to execute while the JNI call
is performed.
When calling the API to perform
a sequence of SIMD operations, the
API packages the operations into a
transaction using a simple sequential
list scheduling algorithm, and then
passes off all of the instructions and
data by reference to the C program,
which executes the SIMD instructions.
Dependencies with regular Java code,
such as casting before an API execute
statement, must occur outside of a
transaction unless they are done using
the API. Dependency and anti-depen-dency resolutions will further improve
execution time and utilization.
Looking to the Future with simD
Interpreted languages can expose vector functionality to the programmer,
and the results will be faster, smaller,
and simpler code as demonstrated by a
practical application of this approach
using Java. Furthermore, better SIMD
utilization within cloud-computing
infrastructures has the potential to reduce costs significantly.
Improving the scheduling algorithm within individual transactions
is a future direction that will indeed
increase performance and throughput. Another clear next step is to take
advantage of multiple cores at the
same time in a real cloud-computing
infrastructure.
Our results can be generalized and
included in many virtual machines. For
example, Flash would clearly benefit
from further manual SIMD interven-
tion by the developer for ActionScript
computational segments. PHP and Ja-
vaScript can also derive benefits from
such an approach in order to increase
the speed of Web applications. More
generally, if you create a virtual ma-
chine, you should allow explicit access
to generic SIMD instructions. Since
you have paid for the SIMD unit inside
your server, you might as well let your
programmers use it. Although this
work is still in progress, we are confi-
dent that it will be widely adopted for
interpreted languages.
Related articles
on queue.acm.org
GPUs: A Closer Look
Kayvon Fatahalian, Mike Houston
http://queue.acm.org/detail.cfm?id=1365498
Scalable Parallel Programming with CUDA
John Nickolls, Ian Buck,
Michael Garland, and Kevin Skadron
http://queue.acm.org/detail.cfm?id=1365500
Data-Parallel Computing
Chas. Boyd
http://queue.acm.org/detail.cfm?id=1365499
References
1. aMd. aparapi; http://developer.amd.com/zones/java/
aparapi/.
2. amedro, b., bodnartchouk, V., caromel, d., delb, c.,
huet, F. and taboada, G. l. current state of Java for
hPc. sophia antipolis, France, 2008; http://hal.inria.fr/
docs/00/31/20/39/PdF/rt-0353.pdf.
3. catanzaro, b., kamil, s. a., lee, y., asanovi, k.,
demmel, J., keutzer, k., shalf, J., yelick, k. a. and Fox,
a. seJIts: Getting productivity and performance
with selective embedded JIt specialization. technical
report ucb/eecs-2010-23. eecs department,
university of california, berkeley; http://www.eecs.
berkeley.edu/Pubs/techrpts/2010/eecs-2010-23.
html.
4. cheema, M. o. and hammami, o. application-specific
sIMd synthesis for reconfigurable architectures.
Microprocessors and Microsystems 30, 6 (2006),
398–412.
5. codeplay. Vectorc compiler engine; http://www.
codeplay.com.
6. Intel software network. Intel aVX optimization in
Intel Mkl V10.3, 2010; http://software.intel.com/en-
us/articles/intel-avx-optimization-in-intel-mkl-v103/.
7. Intel software network. Intel Vtune amplifier Xe,
2010; http://software.intel.com/en-us/intel-vtune/.
8. nvidia. adobe and nvidia announce GPu acceleration
for Flash player, 2009; http://www.nvidia.com/object/
io 1243934217700.html.
9. oracle. JnI enhancements introduced in version 1. 2
of the Java 2 sdk, 2010; http://download.oracle.com/
javase/1.3/docs/guide/jni/jni- 12. html#GetPrimitivearr
aycritical.
10. orc (oil runtime compiler); http://code.entropywave.
com/projects/orc/.
11. Parri, J., desmarais, J., shapiro, d., bolic, M. and Groza,
V. design of a custom vector operation aPI exploiting
sIMd within Java. In Proceedings of the Canadian
Conference on Electrical and Computer Engineering
(May 2010).
12. ranganathan, l. 3d gaming on Intel Integrated
Graphics, 2009; http://software.intel.com/en-us/
articles/3d-gaming-on-intel-integrated-graphics/.
13. rojas, J.c. Multimedia macros for portable optimized
programs. Ph.d. dissertation, northeastern university,
2003.
Jonathan Parri ( jparri@uottawa.ca) is a Ph.d. candidate
at the university of ottawa and a senior member of the
computer architecture research Group. his current
research focuses on design space exploration in the
hardware/software domain, targeting embedded and
traditional systems.
Daniel Shapiro ( dshap092@uottawa.ca) is a Ph.d.
candidate at the university of ottawa and a senior
member of the computer architecture research Group.
his research interests include custom processor design,
instruction-set extensions, It security, and biomedical
engineering.
Miodrag Bolic ( mbolic@site.uottawa.ca) is an associate
professor at the school of Information technology and
engineering, university of ottawa, where he also serves
as director of the computer architecture research Group.
his research interests include computer architectures,
radio frequency identification, and biomedical signal
processing.
Voicu Groza ( groza@site.uottawa.ca) works in the
school of Information technology and engineering
at the university of ottowa, where he is co-director
of the computer architecture research Group. his
research interests include hardware/software co-design,
biomedical instrumentation and measurement, along with
reconfigurable computing.
© 2011 acM 0001-0782/11/04 $10.00