[Haskell-cafe] Right approach to profiling and optimizing a concurrent data structure?

Wed Jan 8 01:49:52 UTC 2014

First, be aware of https://ghc.haskell.org/trac/ghc/ticket/8453, which
causes programs compiled with -threaded and -prof to occasionally die with
an assertion failure (there are a few other, possibly related, tickets
about rts problems with -threaded and non-vanilla ways).

Next, define what you mean by "faster": more throughput?  Lower latency?
 Something else?

One approach is to build with profiling and try to optimize the functions
exposed by your API.  You could do this on one core.  The optimizations
you'd get from this would be generally useful, but they wouldn't be
optimizations to reduce contention.

To look into contention issues, I think the best way is to build with the
eventlog enabled and use threadscope.  This will show pretty clearly where
threads are blocked, for how long, etc.  I've also had success with timing
actions within my test executable and adding that information to the
eventlog with Debug.Trace.traceEventIO.  Then you can see that information
within threadscope, or grep it out of the eventlog for extra processing
(min/max/mean, that sort of thing).

Running with -N1 can be faster because there essentially is no contention:
only a single Haskell thread will be executing at any given time.  If -N1
is markedly faster than -N2 (as in, the runtime is longer to complete the
same amount of work), I would try debugging with Threadscope first.

One example of a test driver I used is
https://github.com/JohnLato/kickchan/blob/master/bench/bench_t3.hs .
(KickChan is similar to a bounded Chan, but heavily biased towards fast
writes)

I'd appreciate any further suggestions also.

John L.

On Tue, Jan 7, 2014 at 5:19 PM, Brandon Simmons <brandon.m.simmons at gmail.com
> wrote:

> Happy New Year, all,
>
> I started what I thought would be a pretty straightforward project to
> implement a concurrent queue (with semantics like Chan) which I hoped would
> be faster, but the process of trying to measure and test performance has
> been super frustrating.
>
> I started with a really big criterion benchmark suite that ran through a
> bunch of Chan-like implementations as well as comparing different var
> primitives; I was compiling that with `-O2  -threaded` and running with
> +RTS -N (as that seemed realistic, and results were very consistent).
>
> Short version: at some point I realized I had (in my cabal config) enabled
> executable-profiling, which when disabled completely changed all timing and
> actually *hurt* performance. Then after a lot more head-banging I realized
> that +RTS -N seems to run on only one core when compiled with -prof (I
> didn't see that documented anywhere) although I could *force* the -prof
> version to use more with -N2, and so apparently for my tests[1], running on
> a single core just *happened* to be faster (I can see why it might; I
> probably can't expect a speedup when I'm just measuring throughput).
>
> I'd be interested in any comments on above, but mostly I'm trying to
> understand what my approach should be at this point; should I be
> benchmarking on 1 core and trying to maximize throughput? Should I also
> profile on just 1 core? How should I benchmark the effects of lots of
> contention and interpret the results? How can I avoid benchmarking
> arbitrary decisions of the thread scheduler, while still having my
> benchmarks be realistic? Are there any RTS flags or compile-time settings
> that I should *definitely* have on?
>
> Thanks for any clarity on this,
> Brandon
> http://brandon.si
>
>
> [1] Here's the test I used while most of the forehead-bloodying occurred,
> here using `Control.Concurrent.Chan`; for no combination of
> readers/writers/messages could I manage to get this going as fast on 2
> cores as on the single-core bound -prof version
>
> runC :: Int -> Int -> Int -> IO ()
> runC writers readers n = do
>   let nNice = n - rem n (lcm writers readers)
>       perReader = nNice `quot` readers
>       perWriter = (nNice `quot` writers)
>   vs <- replicateM readers newEmptyMVar
>   c <- C.newChan
>   let doRead = replicateM_ perReader $ theRead
>       theRead = C.readChan c
>       doWrite = replicateM_ perWriter $ theWrite
>       theWrite = C.writeChan c (1 :: Int)
>   mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
>   replicateM writers $ forkIO $ doWrite
>   mapM_ takeMVar vs -- await readers
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140107/0721b881/attachment.html>