4. IMPLEMENTATION
There are three substantial components of our implementation. We first modified GHC itself to add support for
SSE instructions. This required modifying GHC’s register
allocator to allow overlapping register classes, which was
necessary to allow SSE vectors to be stored in registers.
We then added support for fully unboxed primitive SIMD
vector types and primitive operations over these types to
GHC’s dialect of Haskell. The STG and C-intermediate
languages as well as GHC’s LLVM code generator, were
also extended to support compiling the new Haskell
SIMD primitives. Boxed wrappers for the unboxed primitives and the Multi Type type class and its associated Multi
type complete the high-level support for working directly
with basic SIMD data types. Because the SIMD support we
added to GHC utilizes the LLVM back-end, it should be
relatively straightforward to adapt our modifications for
other CPU architectures, although at this time only x86-64
is supported.
Second, we implemented generalized stream fusion in a
modified version of the vector library10 for computing with
efficient unboxed vectors in Haskell. We replaced the existing stream fusion implementation with an implementation that uses the Bundle representation and extended the
existing API with functions such as mfold′ and mzipWith
that enable using SIMD operations on the contents of vectors. The examples in this paper are somewhat simplified
from the actual implementations. For example, the actual
implementations are written in monadic form and involve
type class constraints that we have elided. Vectors whose
scalar elements can be accessed in SIMD-sized groups, that
is, vectors whose scalar elements are laid out consecutively
in memory, are actually represented using a PackedVector
type class. These details do not affect the essential design
choices we have described, and the functions used in all
examples are simply type-specialized instances of the true
implementations.
Third, we modified the DPH libraries to take advantage
of our new vector library. The DPH libraries are built on
top of the stream representation from a previous version of
the vector library, so we first updated DPH to use our bundle representation instead. We next re-implemented the
primitive wide-vector operations in DPH in terms of our
new SIMD operations on bundles. While we only provided
SIMD implementation for operations on double-precision
floating point values, this part of the implementation was
quite small, consisting of approximately 20 lines of code
not counting #ifdefs. Further extending SIMD support in
DPH will be easy now that it is based on bundles rather
than streams.
Our support for SSE and AVX instructions is part of
the standard GHC distribution, and our modifications to
the vector and DPH libraries are available in a public git
repository.
5. EVALUATION
Our original goal in modifying GHC and the vector
library was to make efficient use of SSE instructions
from high-level Haskell code. The inability to use SSE
The programmer has to use mmap, which is a bit incon-
venient. However, in separate work, 2, 13 the Data Parallel
Haskell project has shown how to automatically vectorize
programs; the target there was turning nested data paral-
lelism into flat data parallelism, but it turns out that we
can use the same technology to turn element-wise data
parallelism into SIMD multi-style data parallelism. Putting
together DPH and the ideas of this paper gives the best
of both worlds: programmers can write data parallel pro-
grams without considering SIMD, and the compiler will
automatically exploit the vector instructions if they are
present. Better still, DPH allows us to take advantage of
multiple cores as well as the SIMD units in each core.
We updated DPH to use our modified vector library.
Because DPH programs are vectorized by the compiler so
that all scalar operations are turned into operations over
wide vectors, by implementing these wide vector operations using our new SIMD functions like msum, programs
written using DPH automatically and transparently
take advantage of SSE instructions—no code changes are
required of the programmer. The full version of the paper
includes benchmarks for our modified implementation
of DPH.
3. 4. How general is generalized stream fusion?
We do not mean to suggest that the representations we
have chosen for our Bundle data type are complete in any
sense except that they allow us to take advantage of bulk
memory operations and SIMD instructions, which was our
original goal. Generalized stream fusion is not “general”
because we have finally hit upon the full set of representations one could possibly ever need, but because the frameworks we have put forth admit multiple new, specialized
representations. The key features of generalized stream
fusion are ( 1) the ability to add new specialized stream representations, notably without requiring the library writer
to rewrite the entire library; ( 2) leveraging the compiler
to statically eliminate all intermediate Bundle structures
and leave behind the single representation that is actually necessary to perform the desired computation; and
( 3) not requiring the end user to know about the details of
Bundles, or even that they exist.
Generalized stream fusion provides a representation
and algebraic laws for rewriting operations over this representation whose usefulness extends beyond Haskell.
Although we have implemented generalized stream fusion
as a library, it could also be incorporated into a compiler
as an intermediate language. This was not necessary in
our implementation because GHC’s generic optimizer is
powerful enough to eliminate all intermediate structures
created by generalized stream fusion. In other words,
GHC is such a good partial evaluator that we were able to
build generalized stream fusion as a library rather than
incorporating it into the compiler itself. Writing high-level code without paying an abstraction tax is desirable
in any language, and compilers other than GHC could also
avoid this tax by using the ideas we outline in this paper,
although perhaps only by paying a substantial one-time
implementation cost.