[Haskell-cafe] Fast parallel binary-trees for the shootout: Control.Parallel.Strategies FTW!

Don Stewart dons at galois.com
Mon Sep 1 18:18:13 EDT 2008


The Computer Language Benchmarks Game recently got a quad core 64 bit
machine. Initial results just porting the single-threaded Haskell
benchmarks to 64 bit look rather pleasing,

    http://shootout.alioth.debian.org/u64q/benchmark.php?test=all&lang=all

Note that C++, D, Clean et al aren't ported to the 64 bit machine yet
(and perhaps some implementations won't survive the 64 bit switch).

Now, none of the Haskell programs are annotated for parallelism yet,
despite that quad core sitting there ready to go. I've a conjecture that 
fast, smp capable, concise code, of the kind GHC produces, could
dominate the shootout for a while to come, if we can effectively utilise
those cores. This is a chance to make an impression!

So, the first parallel benchmark has been submitted by Tom Davies and
me, a parallel implementation of the binary-trees benchmark, using the
Control.Parallel.Strategies lib to set up some top level parallel tasks.

Initial results look promising, with the old entry:

* Single core, no heap hints: 26.9s

    $ ghc -O2 -fasm Original.hs --make
    $ time ./Original 20
    ./Original 20  26.79s user 0.16s system 100% cpu 26.933 total

    Looking at the GC stats (-sstderr) we see its spending too much time
    doing GC, 57.8%, so we can improve that by setting a default heap size:

* Single core, with heap hints: 13.7s

    $ time ./Original 20 -A350M
    ...
    ./Original 20 +RTS -A350M  13.54s user 0.27s system 100% cpu 13.791 total

    Halves runtime in comparison to the current entry, and reduces GC to 3.8%.

* Now, the parallel version, changing one line to use a parMap strategy,
and using the N+1 capabilities rule,
    
    $ ghc -O2 -fasm Parallel2.hs --make -threaded
    $ time ./Parallel2 20 +RTS -N5 -A350M
    ...
    ./Parallel2 20 +RTS -N5 -A350M  15.38s user 1.57s system 305% cpu 5.684 total

So 305% productive (not too bad for this memory-heavy program), and a
good speedup.

Observations:

    * check GC stats with -sstderr, then use -AxxxM to set a default heap size
    * parallelise the top level control using speculative strategies
    * Use -N+1 capabilities.

-- Don
-------------- next part --------------
{-# OPTIONS -fbang-patterns -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart
--

import System
import Data.Bits
import Text.Printf

--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree

minN = 4

io s n t = printf "%s of depth %d\t check: %d\n" s n t

main = do
    n <- getArgs >>= readIO . head
    let maxN     = max (minN + 2) n
        stretchN = maxN + 1

    -- stretch memory tree
    let c = check (make 0 stretchN)
    io "stretch tree" stretchN c

    -- allocate a long lived tree
    let !long    = make 0 maxN

    -- allocate, walk, and deallocate many bottom-up binary trees
    let vs = depth minN maxN
    mapM_ (\((m,d,i)) -> io (show m ++ "\t trees") d i) vs

    -- confirm the the long-lived binary tree still exists
    io "long lived tree" maxN (check long)

-- generate many trees
depth :: Int -> Int -> [(Int,Int,Int)]
depth d m
    | d <= m    = (2*n,d,sumT d n 0) : depth (d+2) m
    | otherwise = []
  where n = 1 `shiftL` (m - d + minN)

-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT  d i t = sumT d (i-1) (t + a + b)
  where a = check (make i    d)
        b = check (make (-i) d)

-- traverse the tree, counting up the nodes
check :: Tree -> Int
check Nil          = 0
check (Node i l r) = i + check l - check r

-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i; d2 = d-1
-------------- next part --------------
{-# OPTIONS -fbang-patterns -funbox-strict-fields #-}
--
-- The Computer Language Shootout
-- http://shootout.alioth.debian.org/
--
-- Contributed by Don Stewart and Thomas Davie
--
-- This implementation uses a parallel strategy to exploit the quad core machine.
-- For more information about Haskell parallel strategies, see,
--
--  http://www.macs.hw.ac.uk/~dsg/gph/papers/html/Strategies/strategies.html
--

import System
import Data.Bits
import Text.Printf
import Control.Parallel.Strategies
import Control.Parallel

--
-- an artificially strict tree.
--
-- normally you would ensure the branches are lazy, but this benchmark
-- requires strict allocation.
--
data Tree = Nil | Node !Int !Tree !Tree

minN = 4

io s n t = printf "%s of depth %d\t check: %d\n" s n t

main = do
    n <- getArgs >>= readIO . head
    let maxN     = max (minN + 2) n
        stretchN = maxN + 1

    -- stretch memory tree
    let c = check (make 0 stretchN)
    io "stretch tree" stretchN c

    -- allocate a long lived tree
    let !long    = make 0 maxN

    -- allocate, walk, and deallocate many bottom-up binary trees
    let vs = (parMap rnf) (depth' maxN) [minN,minN+2..maxN]
    mapM_ (\((m,d,i)) -> io (show m ++ "\t trees") d i) vs

    -- confirm the the long-lived binary tree still exists
    io "long lived tree" maxN (check long)

-- generate many trees
depth' :: Int -> Int -> (Int,Int,Int)
depth' m d =
  (2*n,d,sumT d n 0)
  where
    n = 1 `shiftL` (m - d + minN)

-- allocate and check lots of trees
sumT :: Int -> Int -> Int -> Int
sumT d 0 t = t
sumT  d i t = sumT d (i-1) (t + a + b)
  where a = check (make i    d)
        b = check (make (-i) d)

-- traverse the tree, counting up the nodes
check :: Tree -> Int
check Nil          = 0
check (Node i l r) = i + check l - check r

-- build a tree
make :: Int -> Int -> Tree
make i 0 = Node i Nil Nil
make i d = Node i (make (i2-1) d2) (make i2 d2)
  where i2 = 2*i; d2 = d-1


More information about the Haskell-Cafe mailing list