[Haskell-cafe] SMP parallelism increasing GC time dramatically

Mon Oct 10 17:22:04 CEST 2011

On 08/10/2011 01:47, austin seipp wrote:
> It's GHC, and partly the OS scheduler in some sense. Oversaturating,
> i.e. using an -N option>  your number of logical cores (including
> hyperthreads) will slow down your program typically. This isn't
> uncommon, and is well known - GHC's lightweight threads have an M:N
> threading model, but for good performance, you typically want the OS
> threads and cores to have a 1:1 correspondence. Creating a massive
> amount of OS threads will cause much context switching between them,
> which is going to slow things down. When GHC needs to synchronize OS
> threads in order to do a GC, there are going to be a lot of threads
> being context swapped in/out in order to achieve this (because the GC
> must halt all mutator threads to do its thing.)
>
> Furthermore, oversaturation isn't the only problem - having the same
> number of threads as cores will mean *some* thread is going to get
> de-scheduled. Many times this means that the thread in which GHC does
> GC will get descheduled by the OS. A corollary of this descheduling
> phenomenon is that even using the same # of OS threads as you have
> cores could result in -worse- performance than N-1 OS threads. This
> was mitigated a bit in 7.2.1 I think. Linux was affected much more
> drastically than others (OS X and Solaris have vastly different
> schedulers, and as a result the performance wouldn't just tank - it
> would actually get better IIRC, it just wouldn't scale as well at that
> point.) At the very least, on Linux, using an -N option equivalent to
> your number of logical cores should not drastically slow things down
> anymore - but it won't make them faster either. This is the "dreaded
> last core slowdown" bug that's been known about for a while, and as a
> result, you typically only see parallel speedup on Linux up to N-1
> threads, where N = the number of cores you have.

Incidentally, I don't think that's true any more with recent versions of 
GHC and Linux, I typically see speedup increasing all the way to the 
total number of cores, although sometimes the speedup when adding the 
last core is less.  Take a look at the graphs in our recent papers for 
some concrete results.

> As a result, with dual-core only (and no hyperthreading,) on Linux,
> you're very unlikely to be able to get parallel speedup in any case.
> There's work to fix this in the garbage collector among other things,
> but it's not clear if it's going into GHC just yet.

It probably depends on how much other activity is happening on the 
system.  I get pretty good speedups for most benchmarks I try on my 
dual-core laptop running either Linux or Windows.  Typically with 
Windows I have to wait a while after booting for all the background 
activity to die down, before I can use both cores reliably.

Cheers,
	Simon

> On Fri, Oct 7, 2011 at 2:31 PM, Oliver Batchelor<saulzar at gmail.com>  wrote:
>> I'm not sure if this is at all related, but if I run a small Repa program
>> with more threads than I have cores/CPUs then it gets drastically slower, I
>> have a dual core laptop - and -N2 makes my small program take approximately
>> 0.6 of the time. Increasing to -N4 and we're running about 2x the time, -N8
>> and it's taking 20x or more. I guess this is probably more down to the
>> design of Repa rather than GHC itself?
>> Oliver
>>
>> On Sat, Oct 8, 2011 at 1:21 AM, Tom Thorne<thomas.thorne21 at gmail.com>
>> wrote:
>>>
>>> I have made a dummy program that seems to exhibit the same GC
>>> slowdown behavior, minus the segmentation faults. Compiling with -threaded
>>> and running with -N12 I get very bad performance (3x slower than -N1),
>>> running with -N12 -qg it runs approximately 3 times faster than -N1. I don't
>>> know if I should submit this as a bug or not? I'd certainly like to know why
>>> this is happening!
>>> import Numeric.LinearAlgebra
>>> import Numeric.GSL.Special.Gamma
>>> import Control.Parallel.Strategies
>>> import Control.Monad
>>> import Data.IORef
>>> import Data.Random
>>> import Data.Random.Source.PureMT
>>> import Debug.Trace
>>> --
>>> subsets s n = (subsets_stream s) !! n
>>> subsets_stream [] = [[]] : repeat []
>>> subsets_stream (x:xs) =
>>> let r = subsets_stream xs
>>>     s = map (map (x:)) r
>>> in [[]] : zipWith (++) s (tail r)
>>> testfun :: Matrix Double ->  Int ->  [Int] ->  Double
>>> testfun x k cs = lngamma (det (c+u))
>>> where
>>> (m,c) = meanCov xx
>>> m' = fromRows [m]
>>> u = (trans m')<>  m'
>>> xx = fromColumns ( [(toColumns x)!!i] ++ [(toColumns x)!!j] ++ [(toColumns
>>> x)!!k] )
>>> i = cs !! 0
>>> j = cs !! 1
>>>
>>> test :: Matrix Double ->  Int ->  Double
>>> test x i = sum p
>>> where
>>> p = parMap (rdeepseq) (testfun x (i+1)) (subsets [0..i] 2)
>>>
>>>
>>> ranMatrix :: Int ->  RVar (Matrix Double)
>>> ranMatrix n = do
>>> xs<- mapM (\_ ->  mapM (\_ ->  uniform 0 1.0) [1..n]) [1..n]
>>> return (fromLists xs)
>>>
>>> loop :: Int ->  Double ->  Int ->  RVar Double
>>> loop n s i = traceShow i $ do
>>> x<- ranMatrix n
>>> let r = sum $ parMap (rdeepseq) (test x) [2..(n-2)]
>>> return (r+s)
>>> main = do
>>> let n = 100
>>> let iter = 5
>>> rng<- newPureMT
>>> rngr<- newIORef rng
>>> p<- runRVar (foldM (loop n) 0.0 [1..iter]) rngr
>>> print p
>>> I have also found that the segmentation faults in my code disappear if I
>>> switch from Control.Parallel to Control.Monad.Par, which is quite strange. I
>>> get slightly better performance with Control.Parallel when it completes
>>> without a seg. fault, and the frequency with which it does so seems to
>>> depend on the number of sparks that are being created.
>>> On Thu, Oct 6, 2011 at 1:56 PM, Tom Thorne<thomas.thorne21 at gmail.com>
>>> wrote:
>>>>
>>>> I'm trying to narrow it down so that I can submit a meaningful bug
>>>> report, and it seems to be something to do with switching off parallel GC
>>>> using -qg, whilst also passing -Nx.
>>>> Are there any known issues with this that people are aware of? At the
>>>> moment I am using the latest haskell platform release on arch.
>>>> I'd like to give 7.2 a try in case that fixes it, but I'm using rather a
>>>> lot of libraries (hmatrix, fclabels, random fu) and I don't know how to
>>>> install them for multiple ghc versions
>>>> On Wed, Oct 5, 2011 at 10:43 PM, Johan Tibell<johan.tibell at gmail.com>
>>>> wrote:
>>>>>
>>>>> On Wed, Oct 5, 2011 at 2:37 PM, Tom Thorne<thomas.thorne21 at gmail.com>
>>>>> wrote:
>>>>>>
>>>>>> The only problem is that now I am getting random occasional
>>>>>> segmentation faults that I was not been getting before, and once got a
>>>>>> message saying:
>>>>>> Main: schedule: re-entered unsafely
>>>>>> Perhaps a 'foreign import unsafe' should be 'safe'?
>>>>>> I think this may be something to do with creating a lot of sparks
>>>>>> though, since this occurs whether I have the parallel GC on or not.
>>>>>
>>>>> Unless you (or some library you're using) is doing what the error
>>>>> message says then you should file a GHC bug here:
>>>>>
>>>>> http://hackage.haskell.org/trac/ghc/
>>>>>
>>>>> -- Johan
>>>>>
>>>>
>>>
>>>
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe at haskell.org
>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>>
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>>
>
>
>