[Haskell-cafe] Haskell Parallel Digest 4

Fri Jul 22 18:08:19 CEST 2011

Parallel Haskell Digest
=======================
Edition 4
2011-07-22
http://www.well-typed.com/blog/55

Hello Haskellers!

It's time for the fourth edition of the Parallel Haskell Digest,
bringing you a summary of the news and discussions of parallelism and
concurrency in Haskell.
The digest is made possible by the [Parallel GHC Project][0].

News
----------------------------------------------------------------------

* The Monad.Reader: special issue on parallelism and concurrency

     In the words of [The Monad Reader][6] itself:

 > Whether you're an established academic or have only just
 > started learning Haskell, if you have something to say, please
 > consider writing an article for The Monad.Reader! Issue 19
 > will be a special issue focusing on articles related to
 > parallelism and concurrency, construed broadly. The submission
 > deadline for Issue 19 will be: Tuesday, August 16.

* Threadscope Implementor's Summit

     The Threadscope implementor's summit was held this month at
     Microsoft Research, Cambridge. The summit brought together
     developers who are currently working with Threadscope, whether
     that be hacking on generating the events that are emitted by GHC
     for analysis in Threadscope, using the event trace that is
     produced for detailed profiling information, or working on
     improving Threadscope itself to provide better tools for parallel
     profile analysis.

     The meeting was full of ideas, and covered topics such as: adding
     extensions to the current eventlog format to enable additional
     information to be tagged to events; improving the visualisation of
     information in Threadscope; formalising the transitions of thread
     states into a finite state machine; and matching up executed code
     with corresponding source locations. With all this food for
     thought, we should expect plenty of interesting work in this area.

Word of the Month
----------------------------------------------------------------------

In this issue we have two words of the month: *par* and *pseq*.

Haskell provides two annotations, `par` and `pseq`, that allow the
programmer to give hints to the compiler about where there are
opportunities to exploit parallelism. While these annotations are not
typically used directly by programmers, it is useful to understand
them because they are the underlying mechanism for higher level tools
such as "parallel strategies" and the Eval monad.

The two combinators have the following signatures:

     par  :: a -> b -> b
     pseq :: a -> b -> b

While their signatures are the same, they are used to annotate
different things. The `par` combinator hints to the Haskell
implementation that it might be beneficial to evaluate the first
argument in parallel. However, since Haskell does not impose an
evaluation order, we also need `pseq`, which instructs the compiler to
ensure that its first argument is evaluated before the second.

Let's take a look at an example inspired by [Parallel Performance
Tuning for Haskell][24] by Jones, Marlow, and Singh, which illustrates
this nicely. Suppose you're interested in the sum of two expensive
computations. To keep things simple, we'll use a naive implementation
of `fib` (the point here isn't to have an efficient computation, I'm
trying to show an *expensive* one):

     fib :: Int -> Int
     fib 0 = 0
     fib 1 = 1
     fib n = fib (n-1) + fib (n-2)

For a second expensive computation, we'll calculate the negafibonacci
number, which works on negative numbers:

     negafib :: Int -> Int
     negafib 0 = 0
     negafib (-1) = 1
     negafib n = nfib (n+2) - nfib (n+1)

The sum of these two can be calculated by the following sequential
function:

     sumfib :: Int -> Int
     sumfib n = x + y
      where
       x = fib n
       y = negafib (-n)

There's obvious room for improvement here when we have two cores: we
simply calculate the expensive computations on separate cores.
Annotating the code above is a fairly simple process. We first use
`par` to annotate the fact that `x` must be calculated in parallel
with the rest of the computation. Second, we ensure that `y` gets
computed before `x + y` by annotating with `pseq`. The result is as
follows:

     import Control.Parallel (par, pseq)

     psumfib :: Int -> Int
     psumfib n = x `par` (y `pseq` x + y)
      where
       x = fib n
       y = negafib (-n)

We can write a simple program that outputs the result of running this
computation with the following `main` function:

     main :: IO ()
     main = putStrLn . show . sumfib $ 37

We should hope for the parallel version to work twice as fast, since
the two expensive functions should take about the same time to
compute. Here's the output of compiling and running the sequential
version of the program:

     $ ghc -rtsopts Main.hs
     [1 of 1] Compiling Main             ( Main.hs, Main.o )
     Linking Main ...
     $ time ./Main

     real        0m6.113s
     user        0m6.090s
     sys         0m0.010s

Now replacing `sumfib` with `psumfib` produces the following results:

     $ ghc -rtsopts -threaded Main.hs
     [1 of 1] Compiling Main             ( Main.hs, Main.o )
     Linking Main ...
     $ time ./Main +RTS -N2

     real        0m3.402s
     user        0m6.660s
     sys         0m0.040s

This is obviously a very trivial example, but the point is that
annotations provide a powerful way of expressing parallel algorithms.
It's interesting to note that for this simple program, the timings for
the parallel version on a single core performs as well as the single
core version compiled without threading.

While annotations are a simple mechanism for expressing where
parallelism might be exploited in a program, beware that there are a
number of pitfalls to using this technique: all that glitters is not
gold! The main difficulty in using `par` and `pseq` directly is that
you really need to have a clear understanding of evaluation order. In
particular, unless you understand what laziness is doing to the
evaluation order, then you might find that the computations you're
sparking off with `par` might not occur when you expected they should.

Then there are all the general difficulties that you face with
parallel programming, like getting the right granularity of work to do
in parallel. Using `par` is quite lightweight so can be used to
exploit reasonably fine grained parallelism, but it is certainly not
free. Finally, parallel performance suffers when there are too many
garbage collections, so keeping this to a minimum by either using more
efficient data structures or increasing available memory, becomes an
important factor.

Nevertheless, it's well worth having a play with `par` and `pseq`. The
next step after that is to look at parallel strategies. Strategies is
a layer of abstraction on top of `par` and `pseq`. You might like to
to read [Seq no more: Better Strategies for Parallel Haskell][1], by
Simon Marlow et al. which describes Strategies and the Eval monad.
It's all available in the [parallel library][2] on Hackage. More
recently, the `Par` monad has also been introduced as yet another way
of describing parallel evaluations. These key topics will no doubt
feature in a future word of the month, so stay tuned!

Parallel GHC Project Update
----------------------------------------------------------------------

The Parallel GHC Project is an MSR-funded project to push the
real-world use of parallel Haskell. The aim is to demonstrate that
parallel Haskell can be employed successfully in industrial projects.
This month we're having a guest column from the team over at Los
Alamos National Laboratory, one of the partners involved in the
project (you can see the full details in report LA-UR 11-0341). They
have been working on writing Monte Carlo physics simulations in
Haskell, which has given them high levels of parallelism, along with
useful tools for abstraction. So, without further ado, over to Michael
Buksas from LANL:

Our goal is to build highly efficient Monte Carlo physics simulations
using parallel Haskell. We're focusing on SMP performance though some
combination of explicit threading and pure parallel annotations.

The Monte Carlo approach involves randomly sampling the space of
solutions to generate data which contributes to the solution. For
these physical problems, our samples are the tracks of particles as
they move through space, interacting with a physical material as they
go. Data collected from each particle trajectory is then combined into
information needed to compute the solution. For example, the detailed
information about the particle's interaction with the material is
collected into a collective effect on the material properties.

To date, we have a code base which includes two approaches to the
problem. One is a specific and parallel-tuned application code
targeting relativistic neutrino transport in stellar atmospheres. The
other is building a more general environment for creating specific
applications, such as this one.

We recently presented to our colleagues in LANL some preliminary
results on the parallel performance of the targeted application code.

To give a sense of the approach to parallelization in this code,
consider these high-level functions from an earlier serial version:

     main :: IO ()
     main = do
       (n, rest) <- parseCL
       let tally = runMany infMesh simpleMat n
       writeTally "tally" tally

     runMany :: Mesh -> Material -> Word32 -> RNG -> Tally
     runMany msh mat ntot rng = let
       ps = genParticles ntot msh rng
       tallies = map (runParticle msh mat) $ ps
       in foldl' merge emptyTally tallies

And consider the following changes for the parallel version:

     main :: IO ()
     main = do
       (n,sz) <- parseCL
       let tally = feed infMesh simpleMat n sz prand
       writeTally "tally" tally

     feed :: Mesh -> Material -> Word32 -> Word32 -> RNG -> Tally
     feed msh mat ntot chunkSz rng
         | ntot <= chunkSz = runMany msh mat ntot rng
         | otherwise       = t `par` (ts `pseq` (merge t ts))
         where t  = runMany msh mat chunkSz g1
               ts = feed msh mat (ntot - chunkSz) chunkSz g2
               (g1,g2) = split g

We've wrapped function `runMany` in `feed`, which partitions the
collection of generated particles into groups of size `chunkSz`, and
issues these particles to `runMany` in parallel.

With this simple change, we seeing upwards of 80% utilization of up to
8 cores, for a performance improvement greater than a factor of 6. We
believe that performance can be further improved with different
strategies for breaking down the work, and looking for additional
parallelization opportunities in the collection of results.

Our other branch of development is focused on finding useful
abstractions and high-level functions to support programming a variety
of Monte Carlo problems of this kind. We have identified a few such
useful abstractions, and implemented them as type classes and type
families.

For example, `Space` is a general term for the physical space and
imposed symmetries in which we can perform a simulation. We express
this as follows:

     class Space s where
       type Position s  :: *
       type Direction s :: *
       stream    :: s -> Distance -> s
       position  :: s -> Position s
       direction :: s -> Direction s
       make      :: Position s -> Direction s -> s

and implement specific spaces, such as one with the symmetry of the
unit sphere:

     instance Space Spherical1D where
         type Position  Spherical1D = Radius
         type Direction Spherical1D = Normalized Vector2
         stream (Vector2 x y) (Distance d) = Vector2 (x+d) y
         position s  = Radius $ vmag s
         direction s = normalize s
         make (Radius pos) dir = pos *| (normalized_value dir)

This allows the specific space data types to be used in a variety of
contexts. Using ordinary parametric polymorphism is also effective:

     -- | Stream a single particle:
     stream :: (p -> (e,p))   -- ^ Function to produce each step. Comes 
from a model.
               -> (e -> Bool) -- ^ Check for terminal events to stop 
streaming
               -> p           -- ^ Initial particle
               -> [(e, p)]    -- ^ Resulting list of events and particle 
states.
     stream stepper continue p = next p
       where next p =
               let (e, p') = stepper p
               in  (e, p') : if continue e then next p' else []

The above is our high-level routine function for generating a history
from a single particle, recorded as a list of (event, particle) pairs,
where the event and particle data types are provided for each problem.

Blogs, Papers, and Packages
----------------------------------------------------------------------

* [Fun with parallel monad comprehensions (19 July 2011)][3]

     This blog post by Thomas Petricek featured in the Monad Reader 18,
     and covers some of the interesting things that can be achieved
     with monad comprehensions when viewed from a parallel perspective.
     Along the way, he deals with examples such as the parallel
     composition of parsers.

* [Parallelizing a nonogram solver (05 July 2011)][4]

     Jasper Van der Jeugt detailed his implementation of a parallel
     nonogram solver. Nonograms also go by the name of Paint Sudoku:
     the aim is to colour in a grid where a list of numbers is given
     for each row and column and these numbers indicate consecutive
     runs of filled-in squares in the corresponding row or column. For
     large puzzles, grids that are 20x20, Jasper reports that on a dual
     core machine his a parallel algorithm reduces execution by 37.9%
     compared to its sequential counterpart.

* [The IVar monad (29 June)][5]

     Edward Z. Yang has written a series of posts discussing IVars,
     which are immutable variables which are a write-once, read-many
     (these are particularly handy for communicating results from a
     child process to its parent). Edward's post outlines the
     difficulties involved in defining a monad for IVars.

* [Map reduce as a monad (05 July 2011)][7]

     Julian Porter wrote an article for The Monad Reader 18 about how
     MapReduce could be expressed as a monad. The MapReduce framework
     finds its roots in functional programming, and this is an
     interesting take on the problem.

Mailing list discussions
---------------------------------------------------------------------

* [NVIDIA's CUDA and Haskell (5 July 2011)][11]

     Vasili Galchin was wondering whether or not there had been any
     efforts to build bridges between NVIDIA's CUDA and Haskell. Don
     Stewart was quick to respond with a number of links to active work
     in the area:

     * [Direct access to CUDA][20]
     * [CUDA in Haskell][21]
     * [Direct access to OpenCL][22]
     * [The accelarate package][23]

     Trevor McDonell noted that the accelerate package was best
     accessed from the source repository on github, and that the CUDA
     bindings hadn't yet been tested or updated for the latest toolkit
     release.

* [Unbelievable parallel speedup (3 June 2011)][12]

     While reading Simon Marlow's [tutorial][19] on parallel and
     concurrent programming, John Ramsdell reported some remarkable
     (slightly superlinear!) performance gains for one of his programs.
     Thomas Schilling guessed that this was due to the large variance
     in the figures reported, but went on to describe how it might be
     possible to obvserve such performance boosts due to reduced local
     cache misses when using several cores. Without more information
     about the program in question, it's difficult to do any kind of
     diagnosis, but nevertheless, it's great to hear about good results
     from a happy Haskeller!

* [Automatic Reference Counting (2 July 2011)][13]

     After hearing about the new static analysis tools in Clang that
     does automatic reference counting (ARC), Thomas Davie was
     wondering if some compiler gurus might be able to comment on the
     applicability of this kind of analysis to Haskell, as an
     alternative to garbage collection. This led to an enlightening
     discussion about reference counting versus garbage collection.

* [Haskell on NUMA (16 June 2011)][14]

     Michael Lesniak was wondering what the state of parallel
     performance of Haskell on Non-Uniform Memory Access (NUMA)
     machines was like, since he's having problems and can't find much
     useful information online. Nobody seems to have answered this one,
     are there any suggestions?

* [Parallel compilation and execution? (26 May 2011)][15]

     Michael Rice was trying to figure out how to compile and run a
     simple program that outputs the result of a parallel fibonacci
     algorithm. After a quick reminder to use `pseq` rather than `seq`
     to force sequential evaluation, Daniel Fischer suggested that
     recompilation might be required, and that passing
     `--fforce-recomp` would be a good way to ensure that this occurred.

     Michael was also keen to know whether Control.Parallel was
     comparable to OpenMP. Alex Mason gave a detailed reply and gave an
     example of parallel mergesort as a means of comparison.

* [parMap doesn't work fine (12 May 2011)][16]

     After just starting out with parallel computations in Haskell,
     Grigory Sarnitskiy ran into troubles making parMap work with lazy
     structures. To resolve these issues, Brandon Moore pointed to
     using `rdeepseq`, and Maciej Piechotka suggested `deepseq`.

* [efficient parallel foldMap for lists/sequences (17 June 2011)][17]

     Sebastian Fischer re-posted his question about efficient parallel
     `foldMap` for lists to the parallel mailing list. In essence he
     was seeking an efficient implementation of `foldMap`, where a list
     is folded into a single value before a map is applied to the
     result. Johannes Waldmann advised against using ordinary lists,
     and mentioned that he was using `Data.Vector` instead.
     Additionally, he recommended switching to a sequential fold once a
     parallel fold had been used to a certain depth. Christopher Brown
     further confirmed that it was a good idea to spark off
     computations when the granularity is high enough to make it
     worthwhile, and also mentioned that it was best to spark
     computations that were evaluated to normal form.

* [Wanted: parallel Haskell tutorial/talk/demonstration in Leipzig, 
Germany, October 7 (8 July 2011)][18]

     Johannes Waldmann is looking for volunteers who might be able to
     present at their local Haskell Workshop, and welcomes submissions
     on parallel and distributed computing using Haskell. The
     submission deadline is 20 August.

Stack Overflow
----------------------------------------------------------------------

* [How to write nested loop problem using parallel strategies in Haskell][8]
* [How to measure sequential and parallel runtimes of Haskell program][9]
* [Poor performance / lockup with STM][10]

Help and Feedback
----------------------------------------------------------------------

If you'd like to make an announcement in the next Haskell Parallel
Digest, then get in touch with me, Nicolas Wu, at
<parallel at well-typed.com>. Please feel free to leave any comments and
feedback!

[0]: http://www.haskell.org/haskellwiki/Parallel_GHC_Project
[1]: http://research.microsoft.com/apps/pubs/default.aspx?id=138042
[2]: http://hackage.haskell.org/package/parallel
[3]: http://tomasp.net/blog/comprefun.aspx
[4]: 
http://jaspervdj.be/posts/2011-07-05-parallelizing-a-nonogram-solver.html
[5]: http://blog.ezyang.com/2011/06/the-iva-monad/
[6]: 
http://themonadreader.wordpress.com/2011/07/11/call-for-copy-issue-19-special-issue-on-parallelism-and-concurrency/
[7]: http://themonadreader.files.wordpress.com/2011/07/issue18.pdf
[8]: 
http://stackoverflow.com/questions/6444716/how-to-write-nested-loop-problem-using-parallel-strategies-in-haskell
[9]: 
http://stackoverflow.com/questions/6623316/how-to-measure-sequential-and-parallel-runtimes-of-haskell-program
[10]: 
http://stackoverflow.com/questions/6439925/poor-performance-lockup-with-stm
[11]: http://www.haskell.org/pipermail/haskell-cafe/2011-July/093751.html
[12]: http://www.haskell.org/pipermail/haskell-cafe/2011-June/092661.html
[13]: http://www.haskell.org/pipermail/haskell-cafe/2011-July/093689.html
[14]: http://www.haskell.org/pipermail/haskell-cafe/2011-June/093231.html
[15]: http://www.haskell.org/pipermail/haskell-cafe/2011-May/092389.html
[16]: http://www.haskell.org/pipermail/haskell-cafe/2011-May/091843.html
[17]: http://groups.google.com/group/parallel-haskell/t/b34d796e6672fd6b
[18]: http://groups.google.com/group/parallel-haskell/t/f5f0946d0780b59b
[19]: http://community.haskell.org/~simonmar/par-tutorial.pdf
[20]: http://hackage.haskell.org/package/cuda
[21]: http://hackage.haskell.org/package/language-c-quote
[22]: http://hackage.haskell.org/package/OpenCLRaw
[23]: http://hackage.haskell.org/package/accelerate
[24]: http://community.haskell.org/~simonmar/papers/threadscope.pdf