[Haskell-cafe] Parallel Haskell stories

Fri Mar 11 00:04:31 CET 2011

On Wed, Mar 9, 2011 at 11:18 AM, Simon Peyton-Jones
<simonpj at microsoft.com> wrote:

>        could you contribute a little vignette or story
>        about using Haskell in a *parallel/concurrent* application
>        that I could use to illustrate my talk?

The Cryptographic Protocol Shapes Analyzer (CPSA) attempts to
enumerate all essentially different executions possible for a
cryptographic protocol.  At each step in the analysis, it infers what
else must have happened based on an incomplete description of an
execution.  The steps form a derivation tree.  If CPSA terminates on
its input, each leaf of the derivation tree is a complete description
of some executions, and the leaves collectively enumerate all possible
executions.

It is crucial for performance reasons that CPSA does not compute a
tree, but instead computes a directed acrylic graph (DAG).  To do so,
it must identify pairs of nodes in the derivation that are isomorphic.
 For large problems, the isomorphism check dominates all other
computation during the analysis.

So how does CPSA use parallelism?  CPSA expands the derivation DAG in
a breadth-first order.  For each node at the current level, the
program performs the isomorphism check on all nodes higher up in the
DAG in one thread.  All threads then join, and in the main thread, the
intra-level isomorphism checks are performed.  It's that simple.

At the code level, all that is done is replace one map call with a
parallelized version of map that I saw in one of your papers:

#if defined HAVE_PAR
parMap :: (a -> b) -> [a] -> [b]
parMap _ [] = []
parMap f (x:xs) =
    par y (pseq ys (y:ys))
    where
      y = f x
      ys = parMap f xs

Performance varies depending on the shape of the DAG.  A wide bushy
DAG makes better use of more processors.  I have found one can often
get a speedup of three on a quad processor machine.  Of course
performance also varies depending on the version of GHC runtime in
use.  A speedup of three may not sound like much, but on some
problems, CPSA takes twenty-four hours to run on a single core.

Parallelizing CPSA was not as simple as I made it sound.  I had the
restructure the sequential description of the algorithm to expose the
parallelism to the runtime, and the algorithm became less obvious.  Of
course, this cost is well worth it.

John