Not only can the client of our modified vector library
write programs in terms of boxed values and directly compose
vector operations instead of manually fusing operations
without paying an abstraction penalty, but he or she can
transparently benefit from low-level prefetch “magic” baked
into the library. Of course the same prefetch magic could
be expressed manually in the C version. However, when we
originally wrote the C implementation of dot product using
SSE intrinsics, we did not know about prefetching. We suspect that many C programmers are in the same state of ignorance. In Haskell, this knowledge is embedded in a library,
and clients benefit from it automatically.
6. RELATED WORK
Wadler16 introduced the problem of deforestation, that is,
of eliminating intermediate structures in programs written as compositions of list transforming functions. A great
deal of follow-on work4, 7, 8, 12, 14, 15 attempted to improve the
ability of compilers to automate deforestation through program transformations. Each of these approaches to fusion
has severe limitations. For example, Gill et al. 4 cannot fuse
left folds, such as that which arises in sum, or zipWith, and
Takano and Meijer14 cannot handle nested computations
such as mapping a function over concatenated lists. Our
work is based on the stream fusion framework described by
Coutts et al., 3 which can fuse all of these use cases and more.
The vector library uses stream fusion to fuse operations on
vectors rather than lists, but the principles are the same.
7. CONCLUSION
Generalized stream fusion is a strict improvement on stream
fusion; by re-casting stream fusion to operate on bundles of
streams, each vector operation or class of operations can
utilize a stream representation tailored to its particular pattern of computation. Though we focused on leveraging SSE
instructions in this article, our implementation also adds
support for efficient use of bulk memory operations in vector operations. As part of our work, we added support for
low-level SSE instructions to GHC and incorporated generalized stream fusion into the vector library. Using our
modified library, programmers can write compositional,
high-level programs for manipulating vectors without loss
of efficiency. Benchmarks show that these programs can
perform competitively with hand-written C.
Although we implemented generalized stream fusion in a
Haskell library, the bundled stream representation could be
used as an intermediate language in another compiler. Vector
operations would no longer be first class in such a formulation,
but it would allow a language to take advantage of fusion without
requiring implementations of the general purpose optimizations present in GHC that allow it to eliminate the intermediate
structures produced by generalized stream fusion.
register pressure on the register allocator. Adding multiple
Leaps of different sizes would also be possible. MultisC consumers may choose not to use the Leap stepping function,
in which case loops will not be unrolled:
data Leap a = Leap a a a a
data MultisC a where
MultisC :: (s → Step s (Leap (Multi a) ) )
→ (s → Step s (Multi a))
→ (s → Step s a)
→ s
→ MultisC a
Prefetch instructions on Intel processors allow the program to give the CPU a hint about memory access patterns,
telling it to prefetch memory that the program plans to use
in the future. In our library, these prefetch hints are implemented using prefetch primitives that we added to GHC.
When converting a Vector to a MultisC, we know exactly
what memory access pattern will be used—each element
of the vector will be accessed in linear order. The function
that performs this conversion, stream, takes advantage
of this knowledge by executing prefetch instructions as it
yields each Leap. Only consumers using Leaps will compile to loops containing prefetch instructions, and stream
will only add prefetch instructions for vectors whose size is
above a fixed threshold (currently 8192 elements), because
for shorter vectors the extra instruction dispatch overhead
is not amortized by the increase in memory throughput.
A prefetch distance of 128 12, based on the line fill buffer size of 128 bytes, was chosen empirically. Loop unrolling and prefetching produce an inner loop for our Haskell
implementation of ddotp that is shown in Figure 4.a
References
1. Chakravarty, M.M. T., Keller, G., Peyton
Jones, S., Marlow, S. Associated
types with class. In Proceedings of
the 32nd ACM SIGPLAN-SIGACT
Symposium on Principles of
Programming Languages, POPL, 05
(New York,
NY, USA, 2005). ACM, New York, N Y,
USA, 1–13.
2. Chakravarty, M.M. T., Leshchinskiy, R.,
Peyton Jones, S., Keller, G., Marlow,
S. Data parallel Haskell: A status
report. In Proceedings of the 2007
Workshop on Declarative Aspects of
Multicore Programming, DAMP, 07
.LBB4_12:
prefetcht0 1600(%rsi,%rdx)Z
movupd 64(%rsi,%rdx), %xmm3
prefetcht0 1600(%rdi,%rdx)
movupd 80(%rsi,%rdx), %xmm0
movupd 80(%rdi,%rdx), %xmm2
mulpd %xmm0, %xmm2
movupd 64(%rdi,%rdx), %xmm0
mulpd %xmm3, %xmm0
addpd %xmm1, %xmm0
addpd %xmm2, %xmm0
movupd 96(%rsi,%rdx), %xmm3
movupd 96(%rdi,%rdx), %xmm1
movupd 112(%rsi,%rdx), %xmm4
movupd 112(%rdi,%rdx), %xmm2
mulpd %xmm4, %xmm2
mulpd %xmm3, %xmm1
addq $64, %rdx
leaq 8(%rax), %rcx
addq $16, %rax
addpd %xmm0, %xmm1
cmpq %r9, %rax
addpd %xmm2, %xmm1
movq %rcx, %rax
jle .LBB4_ 12
Figure 4. Inner loop of Haskell ddotp function.
a The prefetch constant 1600 in the listing is 128 12 + 64 since the loop
index in the generated assembly is offset by − 64 bytes.