This implementation strategy works nicely for folds.
However, if we try to implement the SIMD equivalent of
zip Withs, we hit a roadblock. A SIMD version of zip Withs
requires that at each step either both of its input streams
yield a Multi or they both yield a scalar—if one stream were
to yield a scalar while the other yielded a Multi, we would
have to somehow buffer the components of the Multi. And if
one stream yielded only scalars while the other yielded only
Multis, we would be hard-pressed to cope.
Instead of a stream representation where the producer
chooses what is yielded, let us instead choose a representation where the stream consumer is in control:
data MultisC a where
MultisC :: (s → Step s (Multi a))
→ (s → Step s a)
→ MultisC a
The idea is for a MultisC a to be able to yield either a value
of type Multi a or a value of type a—the stream consumer
chooses, which by calling one of the two step functions.
Note that the existential state is quantified over both step
functions, meaning that the same state can be used to yield
either a single scalar or a Multi. If there is not a full Multi available, the first step function will return Done. The remaining
scalars will then be yielded by the second step function. This
representation allows us to implement a SIMD version of
Regrettably, a MultisC still is not quite what we need.
Consider appending two vectors of Doubles, each of
which contains 41 elements. We cannot assume that the
two vectors being appended are laid out consecutively
in memory, so even though the stream that results from
appending them together will contain 82 scalars, this
stream is forced to yield a scalar in the middle of the
stream. One might imagine an implementation that buffers and shifts partial Multi values, but this leads to very
inefficient code. The alternative is for appends to produce
a stream in which either a scalar or a Multi is yielded at
3. 1. A stream representation fit for SIMD computation
Modifying the stream fusion framework to accommodate
SIMD operations opens up the possibility of dramatically
increased performance for a wide range of numerical algorithms but requires a more thoughtful choice of representation. We focus on SIMD computations using 128-bit wide
vectors and SSE instructions on x86/x64 since that is what our
current implementation supports, although the approach
Our implementation represents SIMD values using the
type family Multi. We have chosen the name to avoid confusion with the Vector type, which represents arrays of arbitrary extent. In contrast, a value of type Multi a is a short
vector containing a fixed number of elements—known as
its multiplicity—of type a. On a given platform, Multi a has
a multiplicity that is appropriate for the platform’s SIMD
instructions. For example, on x86, a Multi Double, will have
multiplicity 2 since SSE instructions operate on 128-bit wide
vectors, whereas a Multi Float will have multiplicity 4. Multi
is implemented as an associated type1 in the Multi Type type
class; their definitions are shown in Figure 1. MultiType
includes various operations over Multi values, such as replicating a scalar across a Multi and folding a function over
the scalar elements of a Multi. These operations are defined
in terms of new primitives we added to GHC that compile
directly to SSE instructions.
Given a value of type Vector Double, how can we operate
on it efficiently using SSE instructions within the generalized stream fusion framework? An obvious first attempt is
to include a stream of Multi Doubles in the stream bundle.
However, this representation is insufficient for a vector with
an odd number of elements since we will have one Double
not belonging to a Multi at the end—the “dribble” mentioned earlier. Let us instead try this instead: a stream that
can contain either a scalar or a Multi. We call this stream a
MultisP because the producer chooses what will be yielded
at each step:
data Either a b = Left a | Right b
type MultisP a = Stream (Either a (Multi a) )
Now we can implement summation using SIMD operations. Our strategy is to use two accumulating parameters,
one for the sum of the Multi values yielded by the stream and
one for the sum of the scalar values. Note that (+) is overloaded: we use SIMD (+) to add summ and y, and scalar (+)
to add sum1 and x:
msumPs :: (Num a, Num (Multi a) ) ⇒ MultisP a → a
msumPs (Stream step s _) = loop 0.0 0.0 s
loop summ sum1 s =
case step s of
Yield (Left x) s′ → loop summ (sum1 + x) s′
Yield (Right y) s′ → loop (summ + y) sum1 s′
Skip s′ → loop summ sum1 s′
Done → multifold (+) sum1 summ
When the stream is done yielding values, we call the multifold member of the Multi Type type class to fold the addition
operator over the components of the Multi.
class Multi Type a where
data Multi a -- Associated type
-- The number of elements of type a in a Multi a.
multiplicity :: Multi a Int
-- A Multi a containing the values 0, 1, ...,
-- Replicate a scalar across a Multi a.
multireplicate :: a Multi a
→ →→ →
-- Map a function over the elements of a Multi a.
multimap :: (a a) Multi a Multi a
-- Fold a function over the elements of a Multi a.
multifold :: (b a b) b Multi a b
-- Zip two Multi a’s with a function.
multizipWith :: (a a a)
Multi a Multi a Multi a
Figure 1. The Multi Type type class and its associated type, Multi.