[Haskell-cafe] Parallel parsing & multicore

Wed Sep 9 09:35:32 EDT 2009

Hi,

I've put together a parser that works like this:

- the full input is read into a strict ByteString;
- such string is split into a number of lines;
- each line is fed (independently) to a parser based on Parsec (version 3).

Running the serial version of the parser (compiled with -O2
optimizations) on a rather large file, I get the following:

$ ./Parser +RTS -s
   2,175,464,100 bytes allocated in the heap
      49,293,632 bytes copied during GC
       5,297,560 bytes maximum residency (6 sample(s))
       1,406,800 bytes maximum slop
              16 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  4140 collections,     0 parallel,  0.39s,  0.42s elapsed
  Generation 1:     6 collections,     0 parallel,  0.05s,  0.07s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    1.93s  (  1.99s elapsed)
  GC    time    0.44s  (  0.48s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.37s  (  2.47s elapsed)

  %GC time      18.7%  (19.5% elapsed)

  Alloc rate    1,127,336,508 bytes per MUT second

  Productivity  81.2% of total user, 78.1% of total elapsed

The profile report:

COST CENTRE                    MODULE               %time %alloc

>>=_a6OH                       Text.Parsec.Prim      20.6   13.0
tokenPrimEx                    Text.Parsec.Prim      16.5   20.8
fmap_a6Qj                      Text.Parsec.Prim      15.5    3.6
parserBind                     Text.Parsec.Prim      12.4    9.3
mplus_a6KS                     Text.Parsec.Prim      10.3   10.6
notFollowedBy                  Text.Parsec.Combinator   3.1    4.4
[...]

shows that the biggest part of the execution time is spent in the
parser (not in the read & line split operations, as expected).

Given that each line is parsed independently, I used the following code:

map parse lines `using` parListChunk 250 rnf

to spark a separate computation for each group of 250 lines. I made
the datatype used by the parser an instance of NFData, so that the
"rnf" strategy should force a complete evaluation.

The threaded version running on 2 cores is moderately faster than the
serial one:

$ ./Parser +RTS -s -N2
   2,377,165,256 bytes allocated in the heap
      36,320,944 bytes copied during GC
       6,020,720 bytes maximum residency (6 sample(s))
       6,933,928 bytes maximum slop
              21 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  2410 collections,     0 parallel,  0.33s,  0.34s elapsed
  Generation 1:     6 collections,     4 parallel,  0.06s,  0.05s elapsed

  Parallel GC work balance: 1.83 (2314641 / 1265968, ideal 2)

  Task  0 (worker) :  MUT time:   2.43s  (  1.19s elapsed)
                      GC  time:   0.02s  (  0.02s elapsed)

  Task  1 (worker) :  MUT time:   2.15s  (  1.19s elapsed)
                      GC  time:   0.29s  (  0.30s elapsed)

  Task  2 (worker) :  MUT time:   2.37s  (  1.19s elapsed)
                      GC  time:   0.07s  (  0.08s elapsed)

  Task  3 (worker) :  MUT time:   2.45s  (  1.19s elapsed)
                      GC  time:   0.00s  (  0.00s elapsed)

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    2.06s  (  1.19s elapsed)
  GC    time    0.39s  (  0.39s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    2.45s  (  1.58s elapsed)

  %GC time      15.7%  (24.9% elapsed)

  Alloc rate    1,151,990,234 bytes per MUT second

  Productivity  84.2% of total user, 130.2% of total elapsed

The speedup is smaller than what I was expecting given that each unit
of work (250 input lines) is completely independent from the others.
Changing the size of each work unit did not help; garbage collection
times are small enough that increasing the minimum heap size did not
produce any speedup either.

Is there anything else I can do to understand why the parallel map
does not provide a significant speedup?

Thanks,
  AB