[Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Thu Jul 30 05:31:38 UTC 2020

Thanks Ryan, and I'm honored to get Simon's attention.

I did have some worry about package tskiplist, that its github 
repository seems withdrawn, I emailed the maintainer Peter Robinson 
lately but have gotten no response by far. What particularly worrying me 
is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which 
is current) as:

 > - This package provides an implementation of a skip list in STM.

 >+ This package provides a proof-of-concept implementation of a skip 
list in STM

This has to mean something but I can't figure out yet.

Dear Peter Robinson, I hope you can see this message and get in the loop 
of discussion.

Despite that, I don't think overhead of TVar itself the most serious 
issue in my situation, as before GC engagement, there are as many TVars 
being allocated and updated without stuck at business progressing. And 
now I realize what presuring GC in my situation is not only the large 
number of pointers (TVars), and at the same time, they form many 
circular structures, that might be nightmare for a GC. As I model my 
data after graph model, in my test workload, there are many FeatureSet 
instances each being an entity/node object, then there are many Feature 
instances per FeatureSet object, each Feature instance being an unary 
relationship/edge object, with a reference attribute (via TVar) pointing 
to the FeatureSet object it belongs to, circular structures form because 
I maintain an index at each FeatureSet object, sorted by weight etc., 
but ultimately pointing back (via TVar) to all Feature objects belonging 
to the set.

I'm still curious why the new non-moving GC in 8.10.1 still don't get 
obvious business progressing in my situation. I tested it on my Mac 
yesterday and there I don't know how to see how CPU time is distributed 
over threads within a process, I'll further test it with some Linux 
boxes to try understand it better.

Best regards,

Compl

On 2020/7/30 上午10:05, Ryan Yates wrote:
> Simon, I certainly want to help get to the bottom of the performance 
> issue at hand :D.  Sorry if my reply was misleading.  The constant 
> factor overhead of pushing `TVar`s into the internal structure may be 
> pressuring unacceptable GC behavior to happen sooner.  My impression 
> was that given the same size problem performance loss shifted from 
> synchronization to GC.
>
> Compl, I'm not aware of mutable heap objects being problematic in 
> particular for GHC's GC.  There are lots of special cases to handle 
> them of course.  I have successfully written Haskell programs that get 
> good performance from the GC with the dominant fraction of heap 
> objects being mutable.  I looked a little more at `TSkipList` and one 
> tricky aspect of an STM based skip list is how to manage randomness.  
> In `TSkipList`'s code there is the following comment:
>
> -- | Returns a randomly chosen level. Used for inserting new elements. 
> /O(1)./
> -- For performance reasons, this function uses 'unsafePerformIO' to 
> access the
> -- random number generator. (It would be possible to store the random 
> number
> -- generator in a 'TVar' and thus be able to access it safely from 
> within the
> -- STM monad. This, however, might cause high contention among threads.)
> chooseLevel  ::  TSkipList  <http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#TSkipList>  k  <http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#local-6989586621679028835>  a  <http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#local-6989586621679028836>  ->  Int
>
> This level is chosen on insertion to determine the height of the 
> node.  When writing my own STM skiplist I found that the details in 
> unsafely accessing randomness had a significant impact on 
> performance.  We went with an unboxed array of PCG states that had an 
> entry for each capability giving constant memory overhead in the 
> number of capabilities.  `TSkipList` uses `newStdGen` which involves 
> allocation and synchronization.
>
> Again, I'm not pointing this out to say that this is the entirety of 
> the issue you are encountering, rather, I do think the `TSkipList` 
> library could be improved to allocate much less.  Others can speak to 
> how to tell where the time is going in GC (my knowledge of this is 
> likely out of date).
>
> Ryan
>
>
> On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones 
> <simonpj at microsoft.com <mailto:simonpj at microsoft.com>> wrote:
>
>     Compl’s problem is (apparently) that execution becomes dominated
>     by GC.  That doesn’t sound like a constant-factor overhead from
>     TVars, no matter how efficient (or otherwise) they are.  It sounds
>     more like a space leak to me; perhaps you need some strict
>     evaluation or something.
>
>     My point is only: before re-engineering STM it would make sense to
>     get a much more detailed insight into what is actually happening,
>     and where the space and time is going.  We have tools to do this
>     (heap profiling, Threadscope, …) but I know they need some skill
>     and insight to use well.  But we don’t have nearly enough insight
>     to draw meaningful conclusions yet.
>
>     Maybe someone with experience of performance debugging might feel
>     able to help Compl?
>
>     Simon
>
>     *From:*Haskell-Cafe <haskell-cafe-bounces at haskell.org
>     <mailto:haskell-cafe-bounces at haskell.org>> *On Behalf Of *Ryan Yates
>     *Sent:* 29 July 2020 20:41
>     *To:* YueCompl <compl.yue at icloud.com <mailto:compl.yue at icloud.com>>
>     *Cc:* Haskell Cafe <haskell-cafe at haskell.org
>     <mailto:haskell-cafe at haskell.org>>
>     *Subject:* Re: [Haskell-cafe] STM friendly TreeMap (or similar
>     with range scan api) ? WAS: Best ways to achieve throughput, for
>     large M:N ratio of STM threads, with hot TVar updates?
>
>     Hi Compl,
>
>     There is a lot of overhead with TVars.  My thesis work addresses
>     this by incorporating mutable constructor fields with STM.  I
>     would like to get all that into GHC as soon as I can :D.  I
>     haven't looked closely at the `tskiplist` package, I'll take a
>     look and see if I see any potential issues.  There was some recent
>     work on concurrent B-tree that may be interesting to try.
>
>     Ryan
>
>     On Wed, Jul 29, 2020 at 10:24 AM YueCompl <compl.yue at icloud.com
>     <mailto:compl.yue at icloud.com>> wrote:
>
>         Hi Cafe and Ryan,
>
>         I tried Map/Set from stm-containers and TSkipList (added range
>         scan api against its internal data structure) from
>         http://hackage.haskell.org/package/tskiplist
>         <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0> ,
>         with them I've got quite improved at scalability on concurrency.
>
>         But unfortunately then I hit another wall at single thread
>         scalability over working memory size, I suspect it's because
>         massively more TVars (those being pointers per se) are
>         introduced by those "contention-free" data structures, they
>         need to mutate separate pointers concurrently in avoiding
>         contentions anyway, but such pointer-intensive heap seems
>         imposing extraordinary pressure to GHC's garbage collector,
>         that GC will dominate CPU utilization with poor business
>         progress.
>
>         For example in my test, I use `+RTS -H2g` for the Haskell
>         server process, so GC is not triggered until after a while,
>         then spin off 3 Python client to insert new records
>         concurrently, in the first stage each Python process happily
>         taking ~90% CPU filling (through local mmap) the arrays
>         allocated from the server and logs of success scroll quickly,
>         while the server process utilizes only 30~40% CPU to serve
>         those 3 clients (insert meta data records into unique indices
>         merely); then the client processes' CPU utilization drop
>         drastically once Haskell server process' private memory
>         reached around 2gb, i.e. GC started engaging, the server
>         process's CPU utilization quickly approaches ~300%, while all
>         client processes' drop to 0% for most of the time, and
>         occasionally burst a tiny while with some log output showing
>         progress. And I disable parallel GC lately, enabling parallel
>         GC only makes it worse.
>
>         If I comment out the code updating the indices (those creating
>         many TVars), the overall throughput only drop slowly as more
>         data are inserted, the parallelism feels steady even after the
>         server process' private memory takes several GBs.
>
>         I didn't expect this, but appears to me that GC of GHC is
>         really not good at handling massive number of pointers in the
>         heap, while those pointers are essential to reduce contention
>         (and maybe expensive data copying too) at heavy
>         parallelism/concurrency.
>
>         Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious
>         different behavior compared to 8.8.3; and also tried tweaking
>         GC related RTS options a bit, including increasing -G up to
>         10, no much difference too.
>
>         I feel hopeless at the moment, wondering if I'll have to
>         rewrite this in-memory db in Go/Rust or some other runtime ...
>
>         Btw I read
>         https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html
>         <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0> in
>         searching about the symptoms, and don't feel likely to convert
>         my DB managed data into immutable types thus to fit into
>         Compact Regions, not quite likely a live in-mem database
>         instance can do.
>
>         So seems there are good reasons no successful DBMS, at least
>         in-memory ones have been written in Haskell.
>
>         Best regards,
>
>         Compl
>
>
>
>             On 2020-07-25, at 22:07, Ryan Yates <fryguybob at gmail.com
>             <mailto:fryguybob at gmail.com>> wrote:
>
>             Unfortunately my STM benchmarks are rather disorganized. 
>             The most relevant paper using them is:
>
>             Leveraging hardware TM in Haskell (PPoPP '19)
>
>             https://dl.acm.org/doi/10.1145/3293883.3295711
>             <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0>
>
>             Or my thesis:
>
>             https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931
>             <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0>
>
>
>              The PPoPP benchmarks are on a branch (or the releases tab
>             on github):
>
>             https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src
>             <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0>
>
>
>              All that to say, without an implementation of mutable
>             constructor fields (which I'm working on getting into GHC)
>             the scaling is limited.
>
>             Ryan
>
>             On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe
>             <haskell-cafe at haskell.org
>             <mailto:haskell-cafe at haskell.org>> wrote:
>
>                 Dear Cafe,
>
>                 As Chris Allen has suggested, I learned that
>                 https://hackage.haskell.org/package/stm-containers
>                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0>
>                 and https://hackage.haskell.org/package/ttrie
>                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0>
>                 can help a lot when used in place of traditional
>                 HashMap for stm tx processing, under heavy
>                 concurrency, yet still with automatic parallelism as
>                 GHC implemented them. Then I realized that in addition
>                 to hash map (used to implement dicts and scopes), I
>                 also need to find a TreeMap replacement data structure
>                 to implement the db index. I've been focusing on the
>                 uniqueness constraint aspect, but it's still an index,
>                 needs to provide range scan api for db clients, so
>                 hash map is not sufficient for the index.
>
>                 I see Ryan shared the code benchmarking RBTree with
>                 stm in mind:
>
>                 https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput
>                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0>
>
>
>                 https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree
>                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0>
>
>                 But can't find conclusion or interpretation of that
>                 benchmark suite. And here's a followup question:
>
>                 Where are some STM contention optimized data
>                 structures, that having keys ordered, with sub-range
>                 traversing api ?
>
>                 (of course production ready libraries most desirable)
>
>                 Thanks with regards,
>
>                 Compl
>
>                 On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:
>
>                     Shame on me for I have neither experienced with
>                     `perf`, I'd learn these essential tools soon to
>                     put them into good use.
>
>                     It's great to learn about how `orElse` actually
>                     works, I did get confused why there are so little
>                     retries captured, and now I know. So that little
>                     trick should definitely be removed before going
>                     production, as it does no much useful things at
>                     excessive cost. I put it there to help me
>                     understand internal working of stm, now I get even
>                     better knowledge ;-)
>
>                     I think a debugger will trap every single abort,
>                     isn't it annoying when many aborts would occur? If
>                     I'd like to count the number of aborts, ideally
>                     accounted per service endpoints, time periods,
>                     source modules etc. there some tricks for that?
>
>                     Thanks with best regards,
>
>                     Compl
>
>                     On 2020/7/25 上午2:02, Ryan Yates wrote:
>
>                         To be clear, I was trying to refer to Linux
>                         `perf` [^1]. Sampling based profiling can do a
>                         good job with concurrent and parallel programs
>                         where other methods are problematic.  For
>                         instance,
>
>                          changing the size of heap objects can
>                         drastically change cache performance and
>                         completely different behavior can show up.
>
>                         [^1]:
>                         https://en.wikipedia.org/wiki/Perf_(Linux)
>                         <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0>
>
>                         The spinning in `readTVar` should always be
>                         very short and it typically shows up as
>                         intensive CPU use, though it may not be high
>                         energy use with `pause` in the loop on x86
>                         (looks like we don't have it [^2], I thought
>                         we did, but maybe that was only in some of my
>                         code... )
>
>                         [^2]:
>                         https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275
>                         <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0>
>
>
>                         All that to say, I doubt that you are spending
>                         much time spinning (but it would certainly be
>                         interesting to know if you are!  You would see
>                         `perf` attribute a large amount of time to
>                         `read_current_value`). The amount of code to
>                         execute for commit (the time when locks are
>                         held) is always much shorter than it takes to
>                         execute the transaction body.  As you add more
>                         conflicting threads this gets worse of course
>                         as commits sequence.
>
>                         The code you have will count commits of
>                         executions of `retry`.  Note that `retry` is a
>                         user level idea, that is, you are counting
>                         user level *explicit* retries.  This is
>                         different from a transaction failing to commit
>                         and starting again.  These are invisible to
>                         the user. Also using your trace will convert
>                         `retry` from the efficient wake on write
>                         implementation, to an active retry that will
>                         always attempt again.  We don't have cheap
>                         logging of transaction aborts in GHC, but I
>                         have built such logging in my work.  You can
>                         observe these aborts with a debugger by
>                         looking for execution of this line:
>
>                         https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123
>                         <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0>
>
>                         Ryan
>
>                         On Fri, Jul 24, 2020 at 12:35 PM Compl Yue
>                         <compl.yue at icloud.com
>                         <mailto:compl.yue at icloud.com>> wrote:
>
>                             I'm not familiar with profiling GHC yet,
>                             may need more time to get myself
>                             proficient with it.
>
>                             And a bit more details of my test workload
>                             for diagnostic: the db clients are Python
>                             processes from a cluster of worker nodes,
>                             consulting the db server to register some
>                             path for data files, under a data dir
>                             within a shared filesystem, then mmap
>                             those data files and fill in actual array
>                             data. So the db server don't have much
>                             computation to perform, but puts the data
>                             file path into a global index, which at
>                             the same validates its uniqueness. As
>                             there are many client processes trying to
>                             insert one meta data record concurrently,
>                             with my naive implementation, the global
>                             index's TVar will almost always in locked
>                             state by one client after another, from a
>                             queue never fall empty.
>
>                             So if `readTVar` should spinning waiting,
>                             I doubt the threads should actually make
>                             high CPU utilization, because at any
>                             instant of time, all threads except the
>                             committing one will be doing that one thing.
>
>                             And I have something in my code to track
>                             STM retry like this:
>
>                             ```
>
>                             -- blocking wait not expected, track stm
>                             retries explicitly
>
>                             trackSTM:: Int-> IO(Either() a)
>
>                             trackSTM !rtc = do
>
>                             when -- todo increase the threshold of
>                             reporting?
>
>                             (rtc > 0) $ do
>
>                             -- trace out the retries so the end users
>                             can be aware of them
>
>                             tid <- myThreadId
>
>                             trace
>
>                             ( "🔙\n"
>
>                             <> show callCtx
>
>                             <> "🌀"
>
>                             <> show tid
>
>                             <> " stm retry #"
>
>                             <> show rtc
>
>                             )
>
>                             $ return ()
>
>                             atomically ((Just <$> stmJob) `orElse`
>                             return Nothing) >>= \case
>
>                             Nothing -> -- stm failed, do a tracked retry
>
>                             trackSTM (rtc + 1)
>
>                             Just ... -> ...
>
>                             ```
>
>                             No such trace msg fires during my test,
>                             neither in single thread run, nor in runs
>                             with pressure. I'm sure this tracing
>                             mechanism works, as I can see such traces
>                             fire, in case e.g. posting a TMVar to a
>                             TQueue for some other thread to fill it,
>                             then read the result out, if these 2 ops
>                             are composed into a single tx, then of
>                             course it's infinite retry loop, and a
>                             sequence of such msgs are logged with ever
>                             increasing rtc #.
>
>                             So I believe no retry has ever been triggered.
>
>                             What can going on there?
>
>                             On 2020/7/24 下午11:46, Ryan Yates wrote:
>
>                                 > Then to explain the low CPU
>                                 utilization (~10%), am I right to
>                                 understand it as that upon reading a
>                                 TVar locked by another committing tx,
>                                 a lightweight thread will put itself
>                                 into `waiting STM` and descheduled
>                                 state, so the CPUs can only stay idle
>                                 as not so many threads are willing to
>                                 proceed?
>
>                                 Since the commit happens in finite
>                                 steps, the expectation is that the
>                                 lock will be released very soon. Given
>                                 this when the body of a transaction
>                                 executes `readTVar` it spins (active
>                                 CPU!) until the `TVar` is observed
>                                 unlocked.  If a lock is observed while
>                                 commiting, it immediately starts the
>                                 transaction again from the beginning. 
>                                 To get the behavior of suspending a
>                                 transaction you have to successfully
>                                 commit a transaction that executed
>                                 `retry`.  Then the transaction is put
>                                 on the wakeup lists of its read set
>                                 and subsequent commits will wake it up
>                                 if its write set overlaps.
>
>                                 I don't think any of these things
>                                 would explain low CPU utilization. You
>                                 could try running with `perf` and see
>                                 if lots of time is spent in some
>                                 recognizable part of the RTS.
>
>                                 Ryan
>
>                                 On Fri, Jul 24, 2020 at 11:22 AM Compl
>                                 Yue <compl.yue at icloud.com
>                                 <mailto:compl.yue at icloud.com>> wrote:
>
>                                     Thanks very much for the
>                                     insightful information Ryan! I'm
>                                     glad my suspect was wrong about
>                                     the Haskell scheduler:
>
>                                     > The Haskell capability that is
>                                     committing a transaction will not
>                                     yield to another Haskell thread
>                                     while it is doing the commit.  The
>                                     OS thread may be preempted, but
>                                     once commit starts the haskell
>                                     scheduler is not invoked until
>                                     after locks are released.
>
>                                     So best effort had already been
>                                     made in GHC and I just need to
>                                     cooperate better with its design.
>                                     Then to explain the low CPU
>                                     utilization (~10%), am I right to
>                                     understand it as that upon reading
>                                     a TVar locked by another
>                                     committing tx, a lightweight
>                                     thread will put itself into
>                                     `waiting STM` and descheduled
>                                     state, so the CPUs can only stay
>                                     idle as not so many threads are
>                                     willing to proceed?
>
>                                     Anyway, I see light with better
>                                     data structures to improve my
>                                     situation, let me try them and
>                                     report back. Actually I later
>                                     changed `TVar (HaskMap k v)` to be
>                                     `TVar (HashMap k Int)` where the
>                                     `Int` being array index into `TVar
>                                     (Vector (TVar (Maybe v)))`, in
>                                     pursuing insertion order
>                                     preservation semantic of dict
>                                     entries (like that in Python
>                                     3.7+), then it's very hopeful to
>                                     incorporate stm-containers' Map or
>                                     ttrie to approach free of contention.
>
>                                     Thanks with regards,
>
>                                     Compl
>
>                                     On 2020/7/24 下午10:03, Ryan Yates
>                                     wrote:
>
>                                         Hi Compl,
>
>                                         Having a pool of
>                                         transaction processing threads
>                                         can be helpful in a certain
>                                         way. If the body of the
>                                         transaction takes more time to
>                                         execute then the Haskell
>                                         thread is allowed and it
>                                         yields, the suspended thread
>                                         won't get in the way of other
>                                         thread, but when it is
>                                         rescheduled, will have a low
>                                         probability of success.  Even
>                                         worse, it will probably not
>                                         discover that it is doomed to
>                                         failure until commit time.  If
>                                         transactions are more likely
>                                         to reach commit without
>                                         yielding, they are more likely
>                                         to succeed.  If the
>                                         transactions are not
>                                         conflicting, it doesn't make
>                                         much difference other than
>                                         cache churn.
>
>                                         The Haskell capability that is
>                                         committing a transaction will
>                                         not yield to another Haskell
>                                         thread while it is doing the
>                                         commit.  The OS thread may be
>                                         preempted, but once commit
>                                         starts the haskell scheduler
>                                         is not invoked until after
>                                         locks are released.
>
>                                         To get good performance from
>                                         STM you must pay attention to
>                                         what TVars are involved in a
>                                         commit.  All STM systems are
>                                         working under the assumption
>                                         of low contention, so you want
>                                         to minimize "false" conflicts
>                                         (conflicts that are not
>                                         essential to the computation).
>                                           Something like `TVar
>                                         (HashMap k v)` will work
>                                         pretty well for a low thread
>                                         count, but every transaction
>                                         that writes to that structure
>                                         will be in conflict with every
>                                         other transaction that
>                                         accesses it.  Pushing the
>                                         `TVar` into the nodes of the
>                                         structure reduces the
>                                         possibilities for conflict,
>                                         while increasing the amount of
>                                         bookkeeping STM has to do.  I
>                                         would like to reduce the cost
>                                         of that bookkeeping using
>                                         better structures, but we need
>                                         to do so without harming
>                                         performance in the low TVar
>                                         count case. Right now it is
>                                         optimized for good cache
>                                         performance with a handful of
>                                         TVars.
>
>                                         There is another way to play
>                                         with performance by moving
>                                         work into and out of the
>                                         transaction body.  A
>                                         transaction body that executes
>                                         quickly will reach commit
>                                         faster.  But it may be
>                                         delaying work that moves into
>                                         another transaction. Forcing
>                                         values at the right time can
>                                         make a big difference.
>
>                                         Ryan
>
>                                         On Fri, Jul 24, 2020 at 2:14
>                                         AM Compl Yue via Haskell-Cafe
>                                         <haskell-cafe at haskell.org
>                                         <mailto:haskell-cafe at haskell.org>>
>                                         wrote:
>
>                                             Thanks Chris, I confess I
>                                             didn't pay enough
>                                             attention to STM
>                                             specialized container
>                                             libraries by far, I
>                                             skimmed through the
>                                             description of
>                                             stm-containers and ttrie,
>                                             and feel they would
>                                             definitely improve my
>                                             code's performance in case
>                                             I limit the server's
>                                             parallelism within
>                                             hardware capabilities.
>                                             That may because I'm still
>                                             prototyping the api and
>                                             infrastructure for
>                                             correctness, so even `TVar
>                                             (HashMap k v)` performs
>                                             okay for me at the moment,
>                                             only if at low contention
>                                             (surely there're plenty of
>                                             CPU cycles to be optimized
>                                             out in next steps). I
>                                             model my data after graph
>                                             model, so most data, even
>                                             most indices are localized
>                                             to nodes and edges, those
>                                             can be manipulated without
>                                             conflict, that's why I
>                                             assumed I have a low
>                                             contention use case since
>                                             the very beginning - until
>                                             I found there are still
>                                             (though minor) needs for
>                                             global indices to
>                                             guarantee global
>                                             uniqueness, I feel
>                                             faithful with
>                                             stm-containers/ttrie to
>                                             implement a more scalable
>                                             global index data
>                                             structure, thanks for
>                                             hinting me.
>
>                                             So an evident solution
>                                             comes into my mind now, is
>                                             to run the server with a
>                                             pool of tx processing
>                                             threads, matching number
>                                             of CPU cores, client RPC
>                                             requests then get queued
>                                             to be executed in some
>                                             thread from the pool. But
>                                             I'm really fond of the
>                                             mechanism of M:N scheduler
>                                             which solves
>                                             massive/dynamic
>                                             concurrency so elegantly.
>                                             I had some good result
>                                             with Go in this regard,
>                                             and see GHC at par in
>                                             doing this, I don't want
>                                             to give up this enjoyable
>                                             machinery.
>
>                                             But looked at the stm
>                                             implementation in GHC, it
>                                             seems written TVars are
>                                             exclusively locked during
>                                             commit of a tx, I suspect
>                                             this is the culprit when
>                                             there're large M
>                                             lightweight threads
>                                             scheduled upon a small N
>                                             hardware capabilities,
>                                             that is when a lightweight
>                                             thread yield control
>                                             during an stm transaction
>                                             commit, the TVars it
>                                             locked will stay so until
>                                             it's scheduled again (and
>                                             again) till it can finish
>                                             the commit. This way,
>                                             descheduled threads could
>                                             hold live threads from
>                                             progressing. I haven't
>                                             gone into more details
>                                             there, but wonder if there
>                                             can be some improvement
>                                             for GHC RTS to keep an stm
>                                             committing thread from
>                                             descheduled, but seemingly
>                                             that may impose more
>                                             starvation potential; or
>                                             stm can be improved to
>                                             have its TVar locks
>                                             preemptable when the owner
>                                             trec/thread is in
>                                             descheduled state? Neither
>                                             should be easy but I'd
>                                             really love massive
>                                             lightweight threads doing
>                                             STM practically well.
>
>                                             Best regards,
>
>                                             Compl
>
>                                             On 2020/7/24 上午12:57,
>                                             Christopher Allen wrote:
>
>                                                 It seems like you know
>                                                 how to run practical
>                                                 tests for tuning
>                                                 thread count and
>                                                 contention for
>                                                 throughput. Part of
>                                                 the reason you haven't
>                                                 gotten a super clear
>                                                 answer is "it
>                                                 depends." You give up
>                                                 fairness when you use
>                                                 STM instead of MVars
>                                                 or equivalent
>                                                 structures. That means
>                                                 a long running
>                                                 transaction might get
>                                                 stampeded by many
>                                                 small ones
>                                                 invalidating it over
>                                                 and over. The
>                                                 long-running
>                                                 transaction might
>                                                 never clear if
>                                                 the small transactions
>                                                 keep moving the
>                                                 cheese. I mention this
>                                                 because transaction
>                                                 runtime and size and
>                                                 count all affect
>                                                 throughput and
>                                                 latency. What might be
>                                                 ideal for one pattern
>                                                 of work might not be
>                                                 ideal for another.
>                                                 Optimizing for overall
>                                                 throughput might make
>                                                 the contention and
>                                                 fairness problems
>                                                 worse too. I've done
>                                                 practical tests to
>                                                 optimize this in the
>                                                 past, both for STM in
>                                                 Haskell and for RDBMS
>                                                 workloads.
>
>                                                 The next step is
>                                                 sometimes figuring out
>                                                 whether you really
>                                                 need a data structure
>                                                 within a single STM
>                                                 container or if
>                                                 perhaps you can break
>                                                 up your STM container
>                                                 boundaries into zones
>                                                 or regions that
>                                                 roughly map onto
>                                                 update boundaries.
>                                                 That should make the
>                                                 transactions churn
>                                                 less. On the outside
>                                                 chance you do need to
>                                                 touch more than one
>                                                 container in a
>                                                 transaction, well,
>                                                 they compose.
>
>                                                 e.g.
>                                                 https://hackage.haskell.org/package/stm-containers
>                                                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0>
>
>                                                 https://hackage.haskell.org/package/ttrie
>                                                 <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0>
>
>                                                 It also sounds a bit
>                                                 like your question
>                                                 bumps into Amdahl's
>                                                 Law a bit.
>
>                                                 All else fails, stop
>                                                 using STM and find
>                                                 something more tuned
>                                                 to your problem space.
>
>                                                 Hope this helps,
>
>                                                 Chris Allen
>
>                                                 On Thu, Jul 23, 2020
>                                                 at 9:53 AM YueCompl
>                                                 via Haskell-Cafe
>                                                 <haskell-cafe at haskell.org
>                                                 <mailto:haskell-cafe at haskell.org>>
>                                                 wrote:
>
>                                                     Hello Cafe,
>
>                                                     I'm working on an
>                                                     in-memory
>                                                     database, in
>                                                     Client/Server mode
>                                                     I just let each
>                                                     connected client
>                                                     submit remote
>                                                     procedure call
>                                                     running in its
>                                                     dedicated
>                                                     lightweight
>                                                     thread, modifying
>                                                     TVars in RAM per
>                                                     its business
>                                                     needs, then in
>                                                     case many clients
>                                                     connected
>                                                     concurrently and
>                                                     trying to insert
>                                                     new data, if they
>                                                     are triggering
>                                                     global index (some
>                                                     TVar) update, the
>                                                     throughput would
>                                                     drop drastically.
>                                                     I reduced the
>                                                     shared state to a
>                                                     simple int counter
>                                                     by TVar, got same
>                                                     symptom. While the
>                                                     parallelism feels
>                                                     okay when there's
>                                                     no hot TVar
>                                                     conflicting, or M
>                                                     is not much
>                                                     greater than N.
>
>                                                     As an empirical
>                                                     test workload, I
>                                                     have a `+RTS -N10`
>                                                     server process, it
>                                                     handles 10
>                                                     concurrent clients
>                                                     okay, got ~5x of
>                                                     single thread
>                                                     throughput; but in
>                                                     handling 20
>                                                     concurrent
>                                                     clients, each of
>                                                     the 10 CPUs can
>                                                     only be driven to
>                                                     ~10% utilization,
>                                                     the throughput
>                                                     seems even worse
>                                                     than single
>                                                     thread. More
>                                                     clients can even
>                                                     drive it thrashing
>                                                     without much
>                                                      progressing.
>
>                                                      I can understand
>                                                     that pure STM
>                                                     doesn't scale well
>                                                     after reading [1],
>                                                     and I see it
>                                                     suggested [7]
>                                                     attractive and
>                                                     planned future
>                                                     work toward that
>                                                     direction.
>
>                                                     But I can't find
>                                                     certain libraries
>                                                     or frameworks
>                                                     addressing large M
>                                                     over small N
>                                                     scenarios, [1]
>                                                     experimented with
>                                                     designated N
>                                                     parallelism, and
>                                                     [7] is rather
>                                                     theoretical to my
>                                                     empirical needs.
>
>                                                     Can you direct me
>                                                     to some available
>                                                     library
>                                                     implementing the
>                                                     methodology
>                                                     proposed in [7] or
>                                                     other ways
>                                                     tackling this problem?
>
>                                                     I think the most
>                                                     difficult one is
>                                                     that a transaction
>                                                     should commit with
>                                                     global indices
>                                                     (with possibly
>                                                     unique
>                                                     constraints)
>                                                     atomically
>                                                     updated, and
>                                                     rollback with any
>                                                     violation of
>                                                     constraints, i.e.
>                                                     transactions have
>                                                     to cover global
>                                                     states like
>                                                     indices. Other
>                                                     problems seem more
>                                                     trivial than this.
>
>                                                     Specifically, [7]
>                                                     states:
>
>                                                     > It must be
>                                                     emphasized that
>                                                     all of the
>                                                     mechanisms we
>                                                     deploy originate,
>                                                     in one form or
>                                                     another, in the
>                                                     database
>                                                     literature from
>                                                     the 70s and 80s.
>                                                     Our contribution
>                                                     is to adapt these
>                                                     techniques to
>                                                     software
>                                                     transactional
>                                                     memory, providing
>                                                     more effective
>                                                     solutions to
>                                                     important STM
>                                                     problems than
>                                                     prior proposals.
>
>                                                     I wonder any STM
>                                                     based library has
>                                                     simplified those
>                                                     techniques to be
>                                                     composed right
>                                                     away? I don't
>                                                     really want to
>                                                     implement those
>                                                     mechanisms by
>                                                     myself, rebuilding
>                                                     many wheels from
>                                                     scratch.
>
>                                                     Best regards,
>
>                                                     Compl
>
>                                                     [1] Comparing the
>                                                     performance of
>                                                     concurrent
>                                                     linked-list
>                                                     implementations in
>                                                     Haskell
>
>                                                     https://simonmar.github.io/bib/papers/concurrent-data.pdf
>                                                     <https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsimonmar.github.io%2Fbib%2Fpapers%2Fconcurrent-data.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=41Jaz8ZRmRfBHyGKxfhJlm4xR7q0pOtJShtO0jTlOwQ%3D&reserved=0>
>
>                                                     [7] M. Herlihy and
>                                                     E. Koskinen.
>                                                     Transactional
>                                                     boosting: a
>                                                     methodology for
>                                                     highly-concurrent
>                                                     transactional
>                                                     objects. In Proc.
>                                                     of PPoPP ’08,
>                                                     pages 207–216. ACM
>                                                     Press, 2008.
>
>                                                     https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf
>                                                     <https://nam06.safelinks.protection.outlook.com/?url=https:%2F%2Fwww.cs.stevens.edu%2F~ejk%2Fpapers%2Fboosting-ppopp08.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=ya8Az1oC6f2xoMb90S9HCH57UTQ0nV9sg6SW%2B5JCPC4%3D&reserved=0>
>
>                                                     _______________________________________________
>                                                     Haskell-Cafe
>                                                     mailing list
>                                                     To (un)subscribe,
>                                                     modify options or
>                                                     view archives go to:
>                                                     http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>                                                     <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=c2AV7CO42o3tcw0EuMzqedKkBCtQjWjvdMoUsb4llbY%3D&reserved=0>
>                                                     Only members
>                                                     subscribed via the
>                                                     mailman list are
>                                                     allowed to post.
>
>
>                                                 -- 
>
>                                                 Chris Allen
>
>                                                 Currently working on
>                                                 http://haskellbook.com
>                                                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhaskellbook.com%2F&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=tIHFQFZPIQgRp8oqGRvyebm1YQdCvGD0VoMcflzJwKc%3D&reserved=0>
>
>                                             _______________________________________________
>                                             Haskell-Cafe mailing list
>                                             To (un)subscribe, modify
>                                             options or view archives
>                                             go to:
>                                             http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>                                             <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0>
>                                             Only members subscribed
>                                             via the mailman list are
>                                             allowed to post.
>
>                     _______________________________________________
>
>                     Haskell-Cafe mailing list
>
>                     To (un)subscribe, modify options or view archives go to:
>
>                     http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe  <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0>
>
>                     Only members subscribed via the mailman list are allowed to post.
>
>                 _______________________________________________
>                 Haskell-Cafe mailing list
>                 To (un)subscribe, modify options or view archives go to:
>                 http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>                 <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0>
>                 Only members subscribed via the mailman list are
>                 allowed to post.
>
>             _______________________________________________
>             Haskell-Cafe mailing list
>             To (un)subscribe, modify options or view archives go to:
>             http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>             <https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0>
>             Only members subscribed via the mailman list are allowed
>             to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200730/5c919c52/attachment-0001.html>