[Haskell-cafe] Problem parallelising simple algorithm
Jens Blanck
jens.blanck at gmail.com
Thu Aug 20 14:51:16 UTC 2015
So, I tried to use the Par monad instead of the Eval monad (only splitting
the work into two parts, see below).
I now get (with 1 and 4 cores):
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.994s ( 0.998s elapsed)
GC time 0.048s ( 0.051s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 1.042s ( 1.050s elapsed)
------------
TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.083s ( 0.922s elapsed)
GC time 0.116s ( 0.038s elapsed)
EXIT time 0.001s ( 0.000s elapsed)
Total time 1.200s ( 0.961s elapsed)
So, some slight gain, but not a lot. Admittedly, the call to fromRational
seems to take a good chunk of time.
Any comments?
Jens
---------------
import Control.Monad.Par hiding (parMap)
import Data.Ratio
pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer)
pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)
pq :: (Integer, Integer) -> (Integer, Integer)
pq (a, b)
| d > 5 = let m = (a+b+1) `div` 2
pql = pq (a, m)
pqr = pq (m, b)
in pqCombine pql pqr
| otherwise = (sum $ scanl1 (*) [b,b-1..a+1], product [a+1..b])
where d = b - a
main = print . flip seq () . (\(p,q) -> fromRational (p%q)) . runPar $ do
i <- new
j <- new
fork (put i (pq (0,160000)))
fork (put j (pq (160000,320000)))
a <- get i
b <- get j
return (pqCombine a b)
On Fri, 14 Aug 2015 at 10:31 Jens Blanck <jens.blanck at gmail.com> wrote:
> The following code simply computes the Euler constant by binary
> splitting. I'm struggling to get any speed-up on multi-core.
>
> Why do I get a fair number of fizzled?
> Why don't I get any real speed-up even if it claims to run tasks in
> parallel?
>
> Could I have exhausted the integer arithmetic unit(s) on my chip
> (i7-4790)?
> How would I verify that?
>
> The following is the code and typical runs on 1 and 4 cores respectively.
>
> ---------
>
> import Control.Parallel.Strategies
> import Control.DeepSeq
>
> import Data.Ratio
>
> divConq :: (NFData b) => (a -> b)
> -> a
> -> (a -> Bool)
> -> (b -> b -> b)
> -> (a -> Maybe (a,a))
> -> b
> divConq f arg threshold conquer divide = go arg
> where
> go arg =
> case divide arg of
> Nothing -> f arg
> Just (l0,r0) -> conquer l1 r1 `using` strat
> where
> l1 = go l0
> r1 = go r0
> strat x = do r l1; r r1; return x
> where r | threshold arg = rdeepseq
> | otherwise = rpar
>
> pqCombine :: (Integer, Integer) -> (Integer, Integer) -> (Integer, Integer)
> pqCombine (pl, ql) (pr, qr) = (pl*qr+pr, ql*qr)
>
> pq :: (Integer, Integer) -> (Integer, Integer)
> pq (a, b) = (\t -> (sum t, last t)) $ scanl1 (*) [b,b-1..a+1]
>
> euler :: Integer -> Rational
> euler n =
> let (p,q) = divConq pq
> (0,n)
> (\(a,b) -> b-a < 10000)
> pqCombine
> (\(a,b) -> if b-a > 5
> then let m = (a+b+1) `div` 2 in Just
> ((a,m), (m, b))
> else Nothing)
> in p%q
>
> main = print $ euler 320000 `seq` ()
>
> ----------
>
> > ./BinSplit +RTS -s -N1
> ()
> 178,375,880 bytes allocated in the heap
> 2,452,040 bytes copied during GC
> 3,222,696 bytes maximum residency (7 sample(s))
> 883,040 bytes maximum slop
> 11 MB total memory in use (2 MB lost due to fragmentation)
>
> Tot time (elapsed) Avg pause Max
> pause
> Gen 0 333 colls, 0 par 0.004s 0.002s 0.0000s
> 0.0000s
> Gen 1 7 colls, 0 par 0.001s 0.001s 0.0001s
> 0.0003s
>
> TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)
>
> SPARKS: 126 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 126 fizzled)
>
> INIT time 0.001s ( 0.000s elapsed)
> MUT time 0.928s ( 0.936s elapsed)
> GC time 0.005s ( 0.003s elapsed)
> EXIT time 0.001s ( 0.000s elapsed)
> Total time 0.935s ( 0.939s elapsed)
>
> Alloc rate 192,215,387 bytes per MUT second
>
> Productivity 99.4% of total user, 98.9% of total elapsed
>
> gc_alloc_block_sync: 0
> whitehole_spin: 0
> gen[0].sync: 0
> gen[1].sync: 0
>
>
>
> > ./BinSplit +RTS -s -N4
> ()
> 178,727,480 bytes allocated in the heap
> 3,506,488 bytes copied during GC
> 3,650,032 bytes maximum residency (7 sample(s))
> 934,976 bytes maximum slop
> 12 MB total memory in use (1 MB lost due to fragmentation)
>
> Tot time (elapsed) Avg pause Max
> pause
> Gen 0 141 colls, 141 par 0.009s 0.002s 0.0000s
> 0.0001s
> Gen 1 7 colls, 6 par 0.003s 0.001s 0.0001s
> 0.0001s
>
> Parallel GC work balance: 38.80% (serial 0%, perfect 100%)
>
> TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)
>
> SPARKS: 126 (12 converted, 0 overflowed, 0 dud, 0 GC'd, 114 fizzled)
>
> INIT time 0.002s ( 0.002s elapsed)
> MUT time 2.104s ( 0.946s elapsed)
> GC time 0.012s ( 0.003s elapsed)
> EXIT time 0.001s ( 0.000s elapsed)
> Total time 2.119s ( 0.951s elapsed)
>
> Alloc rate 84,946,520 bytes per MUT second
>
> Productivity 99.3% of total user, 221.3% of total elapsed
>
> gc_alloc_block_sync: 600
> whitehole_spin: 0
> gen[0].sync: 9
> gen[1].sync: 1073
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150820/9f3acbd2/attachment.html>
More information about the Haskell-Cafe
mailing list