[Haskell-cafe] Bounded parMap

Thu Sep 10 05:24:38 EDT 2009

Hi,

lately I've been working on a parallel parser whose core is something like this:

pMap parse blocks

where pMap is a map-like function sparking independent computations
over the elements of the "blocks" list.

The actual implementation used Control.Parallel.Strategies:

map parse blocks `using` parListChunk n rnf

In a recent post [1], however, Edward Kmett noted that such code is
cache oblivious because the "blocks" list is traversed independently
by the map and by the sparks. It may therefore happen that the first
spark starts to be evaluated when the map is already at the end of the
list. Moreover, all the sparks are generated almost instantaneously
even if only a few of them can be evaluated concurrently (I'm assuming
here that the length of the list is bigger than the number of
available native threads, but that seems reasonable).

To solve this issue I imagined an alternative "par" primitive sparking
a new computation in parallel when a native thread is free to perform
its evaluation and otherwise blocking. It turns out it is possible to
implement something like this in a library; have a look at the
BoundedParMap module here:

http://github.com/akborder/HsMakefileParser/

The code uses unsafe operations, so I'm not sure it can be completely
trusted. Given however that the "boundedParMap" function combines
deterministically the results of multiple applications of a pure
function, it should remain referentially transparent.

To measure performances I made two versions of my parser
(Makefile.hs), one keeping Control.Parallel.Strategies as before and
the other using BoundedParMap. I built them (ghc -O2 -threaded; ghc
6.11.20090907) and run them on a quad-core system (+RTS -N4) over a
list generating 48 sparks.

Running each implementation 10 times I got the following run times:

Control.Parallel.Strategies: 1.328 +- 0.1
BoundedParMap: 1.095 +- 0.12

BoundedParMap seems to be faster, especially considering that the
total number of sparks (48) is quite small. Do you think that the
difference is actually due to better caching (both in the list
traversing and because the runtime system deals with less sparks at
any time) or to something else I'm missing?

-- AB

[1] http://www.haskell.org/pipermail/haskell-cafe/2009-September/066158.html