values and combines the yielded values with the accumulator z using the function f (the call to seq guarantees that the
accumulator is strictly evaluated):
foldl′s :: (a → b → a) → a → Stream b → a
foldl′s f z (Stream step s) = loop z s
where
loop z s = z ‘seq’
case step s of
Yield x s′ → loop (f z x) s′
Skip s′ → loop z s′
Done → z
However, in zipWiths there is no loop—the two input
streams are consumed until either both streams yield
a value, in which case a value is yielded on the output
stream, or until one of the input streams is done producing values. The internal state of the stream associated
with zip Withs contains the state of the two input streams
and a one-item buffer for the value produced by the first
input stream:
zip Withs :: (a → b → c) → Stream a → Stream
b → Stream c
zip Withs f (Stream stepa sa na) (Stream stepb sb nb) =
Stream step (sa, sb, Nothing) (min na nb)
where
step (sa, sb, Nothing) =
case stepa sa of
Yield x sa′ → Skip (sa′, sb, Just x)
Skip sa′ → Skip (sa′, sb, Nothing)
Done → Done
step (sa, sb, Just x) =
case stepb sb of
Yield y sb′ → Yield (f x y) (sa, sb′, Nothing)
Skip sb′ → Skip (sa, sb′, Just x)
Done → Done
Given these definitions, GHC’s call-pattern specialization in concert with the standard inliner suffice to transform
dotp into a single loop that does not produce an intermediate structure. If there is any doubt that this results in efficient machine code, we give the actual assembly language
inner loop output by GHC using the LLVM back end. Stream
fusion preserves the ability to write compositionally without
sacrificing performance:
.LBB2_3:
movsd (%rcx), %xmm0
mulsd (%rdx), %xmm0
addsd %xmm0, %xmm1
addq $8, %rcx
addq $8, %rdx
decq %rax
jne .LBB2_ 3
2. 2. Stream fusion inefficiencies
Though stream fusion does well for the examples we have
shown, it still does not produce efficient implementations
for many other operations. In particular, the inadequacy of
the single-value-at-a-time nature of streams becomes par-
ticularly problematic when attempting to opportunistically
utilize the SIMD instructions available on many current
architectures, for example, SSE on x86 and NEON on ARM.
These instructions operate in parallel on data values that
contains two (or four or eight, depending on the hardware
architecture) floating point numbers at a time. To avoid nota-
tional confusion, we call these multi-values, or sometimes
just multis.
To enable sum to use SIMD instructions, we would like a
stream representation that yields multi-values (rather than
scalars), with perhaps a bit of scalar “dribble” at the end of
the stream when the number of scalar values is not divisible
by the size of a multi.
Although a stream of scalar values is useless for SIMD
computation, a stream of multi-values is not quite right
either, because of the “dribble” problem. Perhaps, we could
get away with a stream that yielded either a scalar or a multi
at each step, but this would force all scalar-only operations
to handle an extra case, complicating the implementations
of all operations and making them less efficient. There is a
better way!
3. GENERALIZED STREAM FUSION
We have seen that different stream operations work best
with different stream representations. In this section, we
describe how to incorporate multiple stream representations into the stream fusion framework, elaborate on the
details of a representation that enables SIMD computation with vectors, and show how to use our framework to
transparently take advantage of SIMD instructions in Data
Parallel Haskell programs.
The idea underlying generalized stream fusion is straightforward but its effects are wide-ranging: instead of transforming a function over vectors into a function over streams,
transform it into a function over a bundle of streams. A bundle is a collection of streams, each semantically identical but
with a different cost model, allowing each stream operation
to choose the most advantageous stream representation in
the bundle. We give a simplified version of the Bundle data
type here:
data Bundle a = Bundle
{sSize :: Size
, sElems :: Stream a
, sChunks :: Stream (Chunk a)
, sMultis :: Multis a
}
The sElems field of the Bundle data type contains the
familiar stream of scalar values that we saw in Section 2.
The stream of Chunks contained in the sChunks field of
the record enables the efficient use of bulk memory opera-
tions, like vector append, which we do not describe here. We
next describe the representation contained in the sMultis
field of the record, which enables the efficient use of SSE
instructions.