[Haskell-cafe] multithreading speedup

Sat Apr 14 06:41:36 EDT 2007

Thanks again to the answers Stefan,

Il giorno Apr 14, 2007, alle ore 1:41 AM, Stefan O'Rear ha scritto:

> On Sat, Apr 14, 2007 at 01:31:58AM +0200, Fawzi Mohamed wrote:
>>
>> Il giorno Apr 14, 2007, alle ore 12:33 AM, Stefan O'Rear ha scritto:
>>
>>> On Sat, Apr 14, 2007 at 12:27:10AM +0200, Fawzi Mohamed wrote:
>>>> I was trying to speed up a program that I wrote and so I thought
>>>> about using multiple threads.
>>>> I  have a quite easy parallel program and I did the following
>>>>
>>>> do
>>>>      subRes <- MVar.newMVar []
>>>>      putStrLn "starting threads"
>>>>      subV <- flip mapM [0 .. (nThreads - 1)] $
>>>>              ( \i -> do
>>>>                  subR <- MVar.newEmptyMVar
>>>>                  let writeRes r = do { MVar.putMVar subR r }
>>>>                  forkOS (do
>
> Getting rid of that forkOS will help a LOT - forkOS is very expensive
> and gives no benefit.  It exists only for the benefit of FFI libs like
> OpenGL that use thread local state.
>
> Plain forkIO uses a thread pool in the runtime; it is even more
> parallel than forkOS, and extremely cheap.

Yes indeed using forkIO (that I had removed only trying to find a way  
to make my program faster) one gets basically the same results as  
with forkOS.

thread(IO)
3.10user 0.02system 0:01.68elapsed 184%CPU (0avgtext+0avgdata  
0maxresident)k
0inputs+0outputs (0major+1037minor)pagefaults 0swaps

> The fact that you are using forkOS is a dead giveaway of a
> documentation bug somewhere.  Please report it, so nobody else makes
> this mistake!

No actually I had used forkIO before, but seeing no speedup I tried  
also forkOS hoping that using it would speed things up...

>
>>>>                           let r=eval (startData !! i)
>>>>                           writeRes $! r
>>>>                           putStr $ "writtenRes")
>>>>                  return subR
>>>>              )
>>>>      putStrLn "started threads"
>>>>      toFold <- mapM MVar.takeMVar subV
>>>>      putStrLn "about to fold"
>>>>      return $ foldl1 mergeRes toFold
>>>>
>> [...]Actually my code should be equivalent to parMap rwhnf, but I  
>> get the
>> following results:
>>
>> parMap
>> 3.63user 0.02system 0:01.97elapsed 185%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 0inputs+0outputs (0major+1039minor)pagefaults 0swaps
>>
>> threads(OS)
>> 3.14user 0.02system 0:01.68elapsed 187%CPU (0avgtext+0avgdata
>> 0maxresident)k
>> 0inputs+0outputs (0major+1041minor)pagefaults 0swaps
>
> Those look equally parallel to me :)

exactly the same code, and parmap is 17% slower with respect to  
creating threads manually for each element in the list and waiting  
for them all to finish...
I can understand that maybe this way one is guaranteed that if the  
original program ends than the par version will end too (one can  
spark bottom), but it is just less efficient if it is known that one  
needs the data of the threads to partially calculate it together with  
them.

actually a parMap that executes the last element of the list before  
going on to evaluate it, should be very close to the example with  
explicit threads (if the workload is more or less balanced).
Sure enough using

mParMap ::  (a->b) -> [a] -> [b]
mParMap f (l0:l1:lr) =
     let n0 = f l0
         nr = mParMap f (l1:lr)
     in n0 `par` (nr `seq` (n0:nr))
mParMap f (l0:[]) =
     let n0 = f l0
     in n0 `seq` n0:[]
mParMap f [] = []

I get timings similar to the thread example.

mParMap
3.15user 0.02system 0:01.72elapsed 184%CPU (0avgtext+0avgdata  
0maxresident)k
0inputs+0outputs (0major+1040minor)pagefaults 0swaps

Actually I think that 99% of the time when one wants a parallel map   
he wants something like this, not like the parallel map in  
strategies, maybe there should be a parList' defined this way and a  
corresponding parMap'

>> I suppose that it is because I have a thread for each element in the
>> list plus a main thread vs just one thread per element in the list,
>> but I am not sure, if someone has some ideas...
>>
>> With threads (now the I managed to have some speedup) I can use a
>> workers/queue approach and have a better load balancing.
>
> Threads are supposed to be cheap.  If implementing a thread pool by
> hand ever gives a speedup, it's a bug.

but making my worker threads if I know the number of worker will be  
more efficient, my program is extremely parallel, but putting a par  
everywhere would be very memory costly and would probably break the  
program in the wrong places, I know where I should spark threads so  
that I have few high-level tasks and coarse grain parallelism, and if  
I know the number of workers I can do much more efficiently by hand.
Furthermore I can put the tasks in a queue in order of decreasing  
cost and get a rather good load balancing without having to think too  
much about a static distribution.
So having a thread pool by hand does make sense, doing it with OS  
threads and trying to beat the GHC runtime does not.

>> by the way is there a way to know how many processors are available
>> to the program (to make the strategy or thread control depend on it)?

anyone can answer to this?

thanks
Fawzi