SMP parallelism gains inferior than expected

Tue Mar 1 12:13:54 CET 2011

On 24/02/2011 13:26, José Pedro Magalhães wrote:
> (Forwarding to haskell-cafe)
>
> Hi,
>
> I have a program that computes a matrix of Floats of m rows by n
> columns. Computing each Float is relatively expensive. Each line is
> completely independent of the others, so I thought I'd try some simple
> SMP parallelism on this code:
>
> myFun :: FilePath -> IO ()
> myFun fp =
>    do fs <- readDataDir fp
>       let process f = readFile' f >>= parse
>           printLine = putStrLn . foldr (\a b -> show a ++ "\t" ++ b) ""
>           runDiff l = [ [ diff x y | y <- l ]
>                       | (x,i) <- zip l (map getId fs), myFilter i ]
>       ps <- mapM process fs
>       sequence_ [ printLine x | x <- runDiff ps _`using` parList rdeepseq_ ]
>
> So, I'm using parList to evaluate the rows in parallel, and fully
> evaluating each row. Here are the timings on a Dual Quad Core AMD 2378
> @2.4 GHz, ghc-6.12.3, parallel-2.2.0.1:
>
> -N      time (ms)
> none    1m50
> 2       1m33
> 3       1m35
> 4       1m22
> 5       1m11
> 6       1m06
> 7       1m45
>
> The increase at 7 is justified by the fact that there were two other
> processes running. I don't know how to justify the small increase at N3,
> though, but that doesn't matter too much. The problem is that I am not
> getting the gains I expected (halving at N2, a third at N3, etc.). Is
> this the best one can achieve with this implicit parallelism, or am I
> doing something wrong? In particular, is the way I'm printing the
> results at the end destroying potential parallel gains?
>
> Any insights on this are appreciated.

I haven't investigated your program in detail, but here's how I would go 
about analysing it:

   - first compile with -eventlog, run with +RTS -ls, and analyse
     it with ThreadScope.  This will tell you whether the program
     is actually using more than one core, and for what portion of
     its runtime.  Often you find that it is sequential for a large
     fraction of the execution time, such as reading in the input,
     or writing the output.

   - run with +RTS -s, and check the SPARKS.  Make sure you aren't
     generating too many (more than 4096 will be discarded) or too
     few (not enough parallelism).

   - generating large amounts of live data in the heap will tend to
     hurt parallelism, because GC doesn't parallelise well, especially
     when the heap data is linear (i.e. lists rather than trees).
     Look for a lot of time being spent in GC, and a figure for
     "balance" significantly less than the -N value.

Cheers,
	Simon