[Haskell-cafe] Re: Optimizing a high-traffic network architecture
Simon Marlow
simonmar at microsoft.com
Thu Jan 5 05:01:03 EST 2006
Bulat Ziganshin wrote:
> Hello Simon,
>
> Thursday, December 15, 2005, 4:53:27 PM, you wrote:
>
> SM> The 3k threads are still GC'd, but they are not actually *copied* during
> SM> GC.
>
> SM> It'll increase the memory overhead per thread from 2k (1k * 2 for
> SM> copying) to 4k (4k block, no overhead for copying).
>
> Simon, why not to include this in the "base package"? either change
> something so that a 1k-threads will be not copied during GC, or at
> least increment default stack size? this will improve performance of
> other hyper-threaded programs. memory expenses seems not so great
Because it doesn't always improve things. This is a slightly modified
version of the "cheap concurrency" benchmark from the shootout, first
without tweaking -k:
> ./threads003 10000 +RTS -sstderr
./threads003 10000 +RTS -sstderr
93,908,920 bytes allocated in the heap
159,724,208 bytes copied during GC (scavenged)
1,559,376 bytes copied during GC (not scavenged)
10,415,848 bytes maximum residency (4 sample(s))
177 collections in generation 0 ( 1.05s)
4 collections in generation 1 ( 0.02s)
21 Mb total memory in use
INIT time 0.00s ( 0.00s elapsed)
MUT time 1.28s ( 1.28s elapsed)
GC time 1.06s ( 1.09s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.35s ( 2.37s elapsed)
%GC time 45.3% (45.9% elapsed)
Alloc rate 73,149,011 bytes per MUT second
Productivity 54.7% of total user, 54.1% of total elapsed
and now tweaking -k (using -k6k, because this is a 64-bit machine and
storage manager blocks are 8k):
> ./threads003 10000 +RTS -sstderr -k6k
./threads003 10000 +RTS -sstderr -k6k
168,837,736 bytes allocated in the heap
109,203,160 bytes copied during GC (scavenged)
1,497,728 bytes copied during GC (not scavenged)
71,180,464 bytes maximum residency (2 sample(s))
156 collections in generation 0 ( 1.06s)
2 collections in generation 1 ( 0.01s)
86 Mb total memory in use
INIT time 0.00s ( 0.00s elapsed)
MUT time 2.48s ( 2.58s elapsed)
GC time 1.08s ( 1.08s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.56s ( 3.65s elapsed)
%GC time 30.3% (29.5% elapsed)
Alloc rate 68,007,748 bytes per MUT second
Productivity 69.7% of total user, 67.9% of total elapsed
My hypothesis is that when we give each thread its own memory block, all
the thread stacks occupy the same cache lines and we end up with a lot
more cache misses (notice it's the MUT time that increased, not the GC
time).
Test program attached, if anyone's interested in digging further.
Cheers,
Simon
-------------- next part --------------
-- $Id: message-ghc-2.code,v 1.3 2005/09/17 04:36:26 bfulgham Exp $
-- The Great Computer Language Shootout
-- http://shootout.alioth.debian.org/
-- Contributed by Einar Karttunen
-- Modified by Simon Marlow
-- This is the shootout "cheap concurrency" benchmark, modified
-- slightly. Modification noted below (***) to add more concurrency
-- and make a speedup on multiple processors available.
-- Creates 500 threads arranged in a sequence where each takes a value
-- from the left, adds 1, and passes it to the right (via MVars).
-- N more threads pump zeros in at the left. A sub-thread
-- takes N values from the right and sums them.
--
import Control.Concurrent
import Control.Monad
import System
thread :: MVar Int -> MVar Int -> IO ()
thread inp out = do x <- takeMVar inp; putMVar out $! x+1; thread inp out
spawn cur _ = do next <- newEmptyMVar
forkIO $ thread cur next
return next
main = do n <- getArgs >>= readIO.head
s <- newEmptyMVar
e <- foldM spawn s [1..500]
f <- newEmptyMVar
forkIO $ replicateM n (takeMVar e) >>= putMVar f . sum
replicateM n (forkIO $ putMVar s 0)
-- *** replicateM n (putMVar s 0)
takeMVar f
-- vim: ts=4 ft=haskell
More information about the Haskell-Cafe
mailing list