[Haskell-cafe] multithreading speedup
Fawzi Mohamed
fmohamed at mac.com
Sat Apr 14 06:41:36 EDT 2007
Thanks again to the answers Stefan,
Il giorno Apr 14, 2007, alle ore 1:41 AM, Stefan O'Rear ha scritto:
> On Sat, Apr 14, 2007 at 01:31:58AM +0200, Fawzi Mohamed wrote:
>>
>> Il giorno Apr 14, 2007, alle ore 12:33 AM, Stefan O'Rear ha scritto:
>>
>>> On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
>>>> I was trying to speed up a program that I wrote and so I thought
>>>> about using multiple threads.
>>>> I have a quite easy parallel program and I did the following
>>>>
>>>> do
>>>> subRes <- MVar.newMVar []
>>>> putStrLn "starting threads"
>>>> subV <- flip mapM [0 .. (nThreads - 1)] $
>>>> ( \i -> do
>>>> subR <- MVar.newEmptyMVar
>>>> let writeRes r = do { MVar.putMVar subR r }
>>>> forkOS (do
>
> Getting rid of that forkOS will help a LOT - forkOS is very expensive
> and gives no benefit. It exists only for the benefit of FFI libs like
> OpenGL that use thread local state.
>
> Plain forkIO uses a thread pool in the runtime; it is even more
> parallel than forkOS, and extremely cheap.
Yes indeed using forkIO (that I had removed only trying to find a way
to make my program faster) one gets basically the same results as
with forkOS.
thread(IO)
3.10user 0.02system 0:01.68elapsed 184%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+1037minor)pagefaults 0swaps
> The fact that you are using forkOS is a dead giveaway of a
> documentation bug somewhere. Please report it, so nobody else makes
> this mistake!
No actually I had used forkIO before, but seeing no speedup I tried
also forkOS hoping that using it would speed things up...
>
>>>> let r=eval (startData !! i)
>>>> writeRes $! r
>>>> putStr $ "writtenRes")
>>>> return subR
>>>> )
>>>> putStrLn "started threads"
>>>> toFold <- mapM MVar.takeMVar subV
>>>> putStrLn "about to fold"
>>>> return $ foldl1 mergeRes toFold
>>>>
>> [...]Actually my code should be equivalent to parMap rwhnf, but I
>> get the
>> following results:
>>
>> parMap
>> 3.63user 0.02system 0:01.97elapsed 185%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 0inputs+0outputs (0major+1039minor)pagefaults 0swaps
>>
>> threads(OS)
>> 3.14user 0.02system 0:01.68elapsed 187%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 0inputs+0outputs (0major+1041minor)pagefaults 0swaps
>
> Those look equally parallel to me :)
exactly the same code, and parmap is 17% slower with respect to
creating threads manually for each element in the list and waiting
for them all to finish...
I can understand that maybe this way one is guaranteed that if the
original program ends than the par version will end too (one can
spark bottom), but it is just less efficient if it is known that one
needs the data of the threads to partially calculate it together with
them.
actually a parMap that executes the last element of the list before
going on to evaluate it, should be very close to the example with
explicit threads (if the workload is more or less balanced).
Sure enough using
mParMap :: (a->b) -> [a] -> [b]
mParMap f (l0:l1:lr) =
let n0 = f l0
nr = mParMap f (l1:lr)
in n0 `par` (nr `seq` (n0:nr))
mParMap f (l0:[]) =
let n0 = f l0
in n0 `seq` n0:[]
mParMap f [] = []
I get timings similar to the thread example.
mParMap
3.15user 0.02system 0:01.72elapsed 184%CPU (0avgtext+0avgdata
0maxresident)k
0inputs+0outputs (0major+1040minor)pagefaults 0swaps
Actually I think that 99% of the time when one wants a parallel map
he wants something like this, not like the parallel map in
strategies, maybe there should be a parList' defined this way and a
corresponding parMap'
>> I suppose that it is because I have a thread for each element in the
>> list plus a main thread vs just one thread per element in the list,
>> but I am not sure, if someone has some ideas...
>>
>> With threads (now the I managed to have some speedup) I can use a
>> workers/queue approach and have a better load balancing.
>
> Threads are supposed to be cheap. If implementing a thread pool by
> hand ever gives a speedup, it's a bug.
but making my worker threads if I know the number of worker will be
more efficient, my program is extremely parallel, but putting a par
everywhere would be very memory costly and would probably break the
program in the wrong places, I know where I should spark threads so
that I have few high-level tasks and coarse grain parallelism, and if
I know the number of workers I can do much more efficiently by hand.
Furthermore I can put the tasks in a queue in order of decreasing
cost and get a rather good load balancing without having to think too
much about a static distribution.
So having a thread pool by hand does make sense, doing it with OS
threads and trying to beat the GHC runtime does not.
>> by the way is there a way to know how many processors are available
>> to the program (to make the strategy or thread control depend on it)?
anyone can answer to this?
thanks
Fawzi
More information about the Haskell-Cafe
mailing list