[Haskell-cafe] Right approach to profiling and optimizing a concurrent data structure?
Brandon Simmons
brandon.m.simmons at gmail.com
Wed Jan 8 01:19:14 UTC 2014
Happy New Year, all,
I started what I thought would be a pretty straightforward project to
implement a concurrent queue (with semantics like Chan) which I hoped would
be faster, but the process of trying to measure and test performance has
been super frustrating.
I started with a really big criterion benchmark suite that ran through a
bunch of Chan-like implementations as well as comparing different var
primitives; I was compiling that with `-O2 -threaded` and running with
+RTS -N (as that seemed realistic, and results were very consistent).
Short version: at some point I realized I had (in my cabal config) enabled
executable-profiling, which when disabled completely changed all timing and
actually *hurt* performance. Then after a lot more head-banging I realized
that +RTS -N seems to run on only one core when compiled with -prof (I
didn't see that documented anywhere) although I could *force* the -prof
version to use more with -N2, and so apparently for my tests[1], running on
a single core just *happened* to be faster (I can see why it might; I
probably can't expect a speedup when I'm just measuring throughput).
I'd be interested in any comments on above, but mostly I'm trying to
understand what my approach should be at this point; should I be
benchmarking on 1 core and trying to maximize throughput? Should I also
profile on just 1 core? How should I benchmark the effects of lots of
contention and interpret the results? How can I avoid benchmarking
arbitrary decisions of the thread scheduler, while still having my
benchmarks be realistic? Are there any RTS flags or compile-time settings
that I should *definitely* have on?
Thanks for any clarity on this,
Brandon
http://brandon.si
[1] Here's the test I used while most of the forehead-bloodying occurred,
here using `Control.Concurrent.Chan`; for no combination of
readers/writers/messages could I manage to get this going as fast on 2
cores as on the single-core bound -prof version
runC :: Int -> Int -> Int -> IO ()
runC writers readers n = do
let nNice = n - rem n (lcm writers readers)
perReader = nNice `quot` readers
perWriter = (nNice `quot` writers)
vs <- replicateM readers newEmptyMVar
c <- C.newChan
let doRead = replicateM_ perReader $ theRead
theRead = C.readChan c
doWrite = replicateM_ perWriter $ theWrite
theWrite = C.writeChan c (1 :: Int)
mapM_ (\v-> forkIO (doRead >> putMVar v ())) vs
replicateM writers $ forkIO $ doWrite
mapM_ takeMVar vs -- await readers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20140107/a5d4fae5/attachment.html>
More information about the Haskell-Cafe
mailing list