[Haskell-beginners] Increasing capabilities dramatically increases execution time

Fri Jan 22 06:33:33 UTC 2016

Hi friends,

I have encountered a situation in a concurrent program where I see an
unexpected, dramatic increase in execution time when I add additional
capabilities. On a multi-core CPU, "-N1" is fastest by an order of
magnitude and the program increasingly slows for an increasing number
of capabilities (still fewer than the number of real cores, of
course).

My question is, why is this so and what can I do to improve this?

Some details:

I have a shared data structure which I call a "store", which is
basically a strict HashMap String Value. This structure is shared
between Haskell threads using an IORef/MVar pair:

data Store = Store (IORef (HashMap Key Value)) (MVar ())

Focus on the IORef/MVar pair - the HashMap itself is not important. My
intention is that read-only transactions can take the pure value
straight from the IORef without blocking writers or other readers,
whereas mutating transactions (those that will update the IORef when
they complete) are serialized using the MVar. Alternatively, you can
regard the read-only transaction as working with a private snapshot of
the store that is discarded after it completes.

It is my hope that this technique will allow my program to exploit a
multi-core CPU by running several read-only transactions and at most
one mutating transaction in parallel. I am also assuming that this
technique is marginally more efficient than STM for this use case,
especially for write-heavy loads where I am assuming STM would waste
some time on retries (I did not test this).

-- | Execute a read-only transaction that returns a value.
withStore :: Store -> (HashMap Key Value -> a) -> IO a
withStore (Store ioref _) f = do
  store <- readIORef ioref
  return (f store)

-- | Execute a transaction that updates the store and returns a value.
modifyStore :: Store -> (HashMap Key Value -> (HashMap Key Value, a)) -> IO a
modifyStore (Store ioref mvar) f =
  modifyMVar mvar $ \ z -> do
    store <- readIORef ioref
    let (store', x) = f store
    store' `seq` writeIORef ioref store'
    return (z, x)

Stop me right here if this is a silly way to do this.

I have a test that forks 4 threads that each increment a counter in
the store 100000 times, with the expectation that the final answer is
400000. I use the "async" package for this. This test is not
necessarily pathological. I expect simple operations like incrementing
counters and concatenating text to be typical transactions.

  threads <- replicateM 4 $ async $
               replicateM_ 100000 $
                 modifyStore store (...increment a counter...)
  forM_ threads wait

In this test, while any thread is busy modifying the store, the other
three are blocked on the empty MVar, irrespective of how many
capabilities I have. Therefore, I expect the execution time with -N4
to be similar to -N1. I expect the only difference to be attributable
to the runtime's scheduling overheads, which I assume are relatively
insignificant. Instead, I see a dramatic increase in execution time
with increasing capabilities (typical measurements below).

By the way, I say I assume scheduling overheads are "relatively
insignificant" compared to incrementing the counter because in my
program, the counter is incremented by interpreting an imperative
EDSL, which I assume is relatively inefficient compared to e.g.
"succ", but perhaps I am mistaken. In a way, my whole question
probably centres around this assumption.

I would be grateful is anyone can illuminate the reason for this
dramatic increase in execution time when I increase the number of
capabilities, and any hints on how I can mitigate it.

I am using GHC 7.10.2 and compile with -O -threaded. All library
versions are as at Stackage LTS 3.2.

A typical measurement, with -N1:

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       516 colls,     0 par    0.000s   0.003s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.001s     0.0003s    0.0003s

  TASKS: 4 (1 bound, 3 peak workers (3 total), using -N1)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.001s elapsed)
  MUT     time    0.136s  (  0.139s elapsed)
  GC      time    0.000s  (  0.003s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.136s  (  0.144s elapsed)

With -N2:

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       334 colls,   334 par    0.012s   0.006s     0.0000s    0.0001s
  Gen  1         2 colls,     1 par    0.000s   0.000s     0.0002s    0.0003s

  Parallel GC work balance: 39.33% (serial 0%, perfect 100%)

  TASKS: 6 (1 bound, 5 peak workers (5 total), using -N2)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.002s elapsed)
  MUT     time    2.032s  (  2.456s elapsed)
  GC      time    0.012s  (  0.007s elapsed)
  EXIT    time    0.000s  (  0.000s elapsed)
  Total   time    2.044s  (  2.465s elapsed)

With -N4:

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0       133 colls,   133 par    0.032s   0.005s     0.0000s    0.0001s
  Gen  1         2 colls,     1 par    0.000s   0.001s     0.0003s    0.0003s

  Parallel GC work balance: 41.33% (serial 0%, perfect 100%)

  TASKS: 10 (1 bound, 9 peak workers (9 total), using -N4)

  SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    3.516s  (  4.431s elapsed)
  GC      time    0.032s  (  0.005s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    3.548s  (  4.439s elapsed)

Thanks,
Thomas Koster