[Haskell-cafe] Parallel parsing & multicore
Anakim Border
akborder at gmail.com
Wed Sep 9 09:35:32 EDT 2009
Hi,
I've put together a parser that works like this:
- the full input is read into a strict ByteString;
- such string is split into a number of lines;
- each line is fed (independently) to a parser based on Parsec (version 3).
Running the serial version of the parser (compiled with -O2
optimizations) on a rather large file, I get the following:
$ ./Parser +RTS -s
2,175,464,100 bytes allocated in the heap
49,293,632 bytes copied during GC
5,297,560 bytes maximum residency (6 sample(s))
1,406,800 bytes maximum slop
16 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 4140 collections, 0 parallel, 0.39s, 0.42s elapsed
Generation 1: 6 collections, 0 parallel, 0.05s, 0.07s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.93s ( 1.99s elapsed)
GC time 0.44s ( 0.48s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.37s ( 2.47s elapsed)
%GC time 18.7% (19.5% elapsed)
Alloc rate 1,127,336,508 bytes per MUT second
Productivity 81.2% of total user, 78.1% of total elapsed
The profile report:
COST CENTRE MODULE %time %alloc
>>=_a6OH Text.Parsec.Prim 20.6 13.0
tokenPrimEx Text.Parsec.Prim 16.5 20.8
fmap_a6Qj Text.Parsec.Prim 15.5 3.6
parserBind Text.Parsec.Prim 12.4 9.3
mplus_a6KS Text.Parsec.Prim 10.3 10.6
notFollowedBy Text.Parsec.Combinator 3.1 4.4
[...]
shows that the biggest part of the execution time is spent in the
parser (not in the read & line split operations, as expected).
Given that each line is parsed independently, I used the following code:
map parse lines `using` parListChunk 250 rnf
to spark a separate computation for each group of 250 lines. I made
the datatype used by the parser an instance of NFData, so that the
"rnf" strategy should force a complete evaluation.
The threaded version running on 2 cores is moderately faster than the
serial one:
$ ./Parser +RTS -s -N2
2,377,165,256 bytes allocated in the heap
36,320,944 bytes copied during GC
6,020,720 bytes maximum residency (6 sample(s))
6,933,928 bytes maximum slop
21 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 2410 collections, 0 parallel, 0.33s, 0.34s elapsed
Generation 1: 6 collections, 4 parallel, 0.06s, 0.05s elapsed
Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2)
Task 0 (worker) : MUT time: 2.43s ( 1.19s elapsed)
GC time: 0.02s ( 0.02s elapsed)
Task 1 (worker) : MUT time: 2.15s ( 1.19s elapsed)
GC time: 0.29s ( 0.30s elapsed)
Task 2 (worker) : MUT time: 2.37s ( 1.19s elapsed)
GC time: 0.07s ( 0.08s elapsed)
Task 3 (worker) : MUT time: 2.45s ( 1.19s elapsed)
GC time: 0.00s ( 0.00s elapsed)
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.06s ( 1.19s elapsed)
GC time 0.39s ( 0.39s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.45s ( 1.58s elapsed)
%GC time 15.7% (24.9% elapsed)
Alloc rate 1,151,990,234 bytes per MUT second
Productivity 84.2% of total user, 130.2% of total elapsed
The speedup is smaller than what I was expecting given that each unit
of work (250 input lines) is completely independent from the others.
Changing the size of each work unit did not help; garbage collection
times are small enough that increasing the minimum heap size did not
produce any speedup either.
Is there anything else I can do to understand why the parallel map
does not provide a significant speedup?
Thanks,
AB
More information about the Haskell-Cafe
mailing list