[Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Wed Jul 29 19:40:58 UTC 2020

Hi Compl,

There is a lot of overhead with TVars.  My thesis work addresses this by
incorporating mutable constructor fields with STM.  I would like to get all
that into GHC as soon as I can :D.  I haven't looked closely at the
`tskiplist` package, I'll take a look and see if I see any potential
issues.  There was some recent work on concurrent B-tree that may be
interesting to try.

Ryan

On Wed, Jul 29, 2020 at 10:24 AM YueCompl <compl.yue at icloud.com> wrote:

> Hi Cafe and Ryan,
>
> I tried Map/Set from stm-containers and TSkipList (added range scan api
> against its internal data structure) from
> http://hackage.haskell.org/package/tskiplist , with them I've got quite
> improved at scalability on concurrency.
>
> But unfortunately then I hit another wall at single thread scalability
> over working memory size, I suspect it's because massively more TVars
> (those being pointers per se) are introduced by those "contention-free"
> data structures, they need to mutate separate pointers concurrently in
> avoiding contentions anyway, but such pointer-intensive heap seems imposing
> extraordinary pressure to GHC's garbage collector, that GC will dominate
> CPU utilization with poor business progress.
>
> For example in my test, I use `+RTS -H2g` for the Haskell server process,
> so GC is not triggered until after a while, then spin off 3 Python client
> to insert new records concurrently, in the first stage each Python process
> happily taking ~90% CPU filling (through local mmap) the arrays allocated
> from the server and logs of success scroll quickly, while the server
> process utilizes only 30~40% CPU to serve those 3 clients (insert meta data
> records into unique indices merely); then the client processes' CPU
> utilization drop drastically once Haskell server process' private memory
> reached around 2gb, i.e. GC started engaging, the server process's CPU
> utilization quickly approaches ~300%, while all client processes' drop to
> 0% for most of the time, and occasionally burst a tiny while with some log
> output showing progress. And I disable parallel GC lately, enabling
> parallel GC only makes it worse.
>
> If I comment out the code updating the indices (those creating many
> TVars), the overall throughput only drop slowly as more data are inserted,
> the parallelism feels steady even after the server process' private memory
> takes several GBs.
>
> I didn't expect this, but appears to me that GC of GHC is really not good
> at handling massive number of pointers in the heap, while those pointers
> are essential to reduce contention (and maybe expensive data copying too)
> at heavy parallelism/concurrency.
>
> Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior
> compared to 8.8.3; and also tried tweaking GC related RTS options a bit,
> including increasing -G up to 10, no much difference too.
>
> I feel hopeless at the moment, wondering if I'll have to rewrite this
> in-memory db in Go/Rust or some other runtime ...
>
> Btw I read
> https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in
> searching about the symptoms, and don't feel likely to convert my DB
> managed data into immutable types thus to fit into Compact Regions, not
> quite likely a live in-mem database instance can do.
>
> So seems there are good reasons no successful DBMS, at least in-memory
> ones have been written in Haskell.
>
> Best regards,
> Compl
>
>
> On 2020-07-25, at 22:07, Ryan Yates <fryguybob at gmail.com> wrote:
>
> Unfortunately my STM benchmarks are rather disorganized.  The most
> relevant paper using them is:
>
> Leveraging hardware TM in Haskell (PPoPP '19)
> https://dl.acm.org/doi/10.1145/3293883.3295711
>
> Or my thesis:
>
> https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931
>
>
>  The PPoPP benchmarks are on a branch (or the releases tab on github):
>
> https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src
>
>
>
>  All that to say, without an implementation of mutable constructor fields
> (which I'm working on getting into GHC) the scaling is limited.
>
> Ryan
>
>
> On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe <
> haskell-cafe at haskell.org> wrote:
>
>> Dear Cafe,
>>
>> As Chris Allen has suggested, I learned that
>> https://hackage.haskell.org/package/stm-containers and
>> https://hackage.haskell.org/package/ttrie can help a lot when used in
>> place of traditional HashMap for stm tx processing, under heavy
>> concurrency, yet still with automatic parallelism as GHC implemented them.
>> Then I realized that in addition to hash map (used to implement dicts and
>> scopes), I also need to find a TreeMap replacement data structure to
>> implement the db index. I've been focusing on the uniqueness constraint
>> aspect, but it's still an index, needs to provide range scan api for db
>> clients, so hash map is not sufficient for the index.
>>
>> I see Ryan shared the code benchmarking RBTree with stm in mind:
>>
>>
>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput
>>
>>
>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree
>>
>> But can't find conclusion or interpretation of that benchmark suite. And
>> here's a followup question:
>>
>>
>> Where are some STM contention optimized data structures, that having keys
>> ordered, with sub-range traversing api ?
>>
>> (of course production ready libraries most desirable)
>>
>>
>> Thanks with regards,
>>
>> Compl
>>
>>
>> On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote:
>>
>> Shame on me for I have neither experienced with `perf`, I'd learn these
>> essential tools soon to put them into good use.
>>
>> It's great to learn about how `orElse` actually works, I did get confused
>> why there are so little retries captured, and now I know. So that little
>> trick should definitely be removed before going production, as it does no
>> much useful things at excessive cost. I put it there to help me understand
>> internal working of stm, now I get even better knowledge ;-)
>>
>> I think a debugger will trap every single abort, isn't it annoying when
>> many aborts would occur? If I'd like to count the number of aborts, ideally
>> accounted per service endpoints, time periods, source modules etc. there
>> some tricks for that?
>>
>> Thanks with best regards,
>>
>> Compl
>>
>>
>> On 2020/7/25 上午2:02, Ryan Yates wrote:
>>
>> To be clear, I was trying to refer to Linux `perf` [^1].  Sampling based
>> profiling can do a good job with concurrent and parallel programs where
>> other methods are problematic.  For instance,
>>  changing the size of heap objects can drastically change cache
>> performance and completely different behavior can show up.
>>
>> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux)
>>
>> The spinning in `readTVar` should always be very short and it typically
>> shows up as intensive CPU use, though it may not be high energy use with
>> `pause` in the loop on x86 (looks like we don't have it [^2], I thought we
>> did, but maybe that was only in some of my code... )
>>
>> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275
>>
>> All that to say, I doubt that you are spending much time spinning (but it
>> would certainly be interesting to know if you are!  You would see `perf`
>> attribute a large amount of time to `read_current_value`).  The amount of
>> code to execute for commit (the time when locks are held) is always much
>> shorter than it takes to execute the transaction body.  As you add more
>> conflicting threads this gets worse of course as commits sequence.
>>
>> The code you have will count commits of executions of `retry`.  Note that
>> `retry` is a user level idea, that is, you are counting user level
>> *explicit* retries.  This is different from a transaction failing to commit
>> and starting again.  These are invisible to the user.  Also using your
>> trace will convert `retry` from the efficient wake on write implementation,
>> to an active retry that will always attempt again.  We don't have cheap
>> logging of transaction aborts in GHC, but I have built such logging in my
>> work.  You can observe these aborts with a debugger by looking for
>> execution of this line:
>>
>> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123
>>
>> Ryan
>>
>>
>>
>> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue <compl.yue at icloud.com> wrote:
>>
>>> I'm not familiar with profiling GHC yet, may need more time to get
>>> myself proficient with it.
>>>
>>> And a bit more details of my test workload for diagnostic: the db
>>> clients are Python processes from a cluster of worker nodes, consulting the
>>> db server to register some path for data files, under a data dir within a
>>> shared filesystem, then mmap those data files and fill in actual array
>>> data. So the db server don't have much computation to perform, but puts the
>>> data file path into a global index, which at the same validates its
>>> uniqueness. As there are many client processes trying to insert one meta
>>> data record concurrently, with my naive implementation, the global index's
>>> TVar will almost always in locked state by one client after another, from a
>>> queue never fall empty.
>>>
>>> So if `readTVar` should spinning waiting, I doubt the threads should
>>> actually make high CPU utilization, because at any instant of time, all
>>> threads except the committing one will be doing that one thing.
>>>
>>> And I have something in my code to track STM retry like this:
>>>
>>> ```
>>> -- blocking wait not expected, track stm retries explicitly
>>> trackSTM :: Int -> IO (Either () a)
>>> trackSTM !rtc = do
>>> when -- todo increase the threshold of reporting?
>>> (rtc > 0) $ do
>>> -- trace out the retries so the end users can be aware of them
>>> tid <- myThreadId
>>> trace
>>> ( "🔙\n"
>>> <> show callCtx
>>> <> "🌀 "
>>> <> show tid
>>> <> " stm retry #"
>>> <> show rtc
>>> )
>>> $ return ()
>>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case
>>> Nothing -> -- stm failed, do a tracked retry
>>> trackSTM (rtc + 1)
>>> Just ... -> ...
>>>
>>> ```
>>>
>>> No such trace msg fires during my test, neither in single thread run,
>>> nor in runs with pressure. I'm sure this tracing mechanism works, as I can
>>> see such traces fire, in case e.g. posting a TMVar to a TQueue for some
>>> other thread to fill it, then read the result out, if these 2 ops are
>>> composed into a single tx, then of course it's infinite retry loop, and a
>>> sequence of such msgs are logged with ever increasing rtc #.
>>>
>>> So I believe no retry has ever been triggered.
>>>
>>> What can going on there?
>>>
>>>
>>> On 2020/7/24 下午11:46, Ryan Yates wrote:
>>>
>>> > Then to explain the low CPU utilization (~10%), am I right to
>>> understand it as that upon reading a TVar locked by another committing tx,
>>> a lightweight thread will put itself into `waiting STM` and descheduled
>>> state, so the CPUs can only stay idle as not so many threads are willing to
>>> proceed?
>>>
>>> Since the commit happens in finite steps, the expectation is that the
>>> lock will be released very soon.  Given this when the body of a transaction
>>> executes `readTVar` it spins (active CPU!) until the `TVar` is observed
>>> unlocked.  If a lock is observed while commiting, it immediately starts the
>>> transaction again from the beginning.  To get the behavior of suspending a
>>> transaction you have to successfully commit a transaction that executed
>>> `retry`.  Then the transaction is put on the wakeup lists of its read set
>>> and subsequent commits will wake it up if its write set overlaps.
>>>
>>> I don't think any of these things would explain low CPU utilization.
>>> You could try running with `perf` and see if lots of time is spent in some
>>> recognizable part of the RTS.
>>>
>>> Ryan
>>>
>>>
>>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue <compl.yue at icloud.com> wrote:
>>>
>>>> Thanks very much for the insightful information Ryan! I'm glad my
>>>> suspect was wrong about the Haskell scheduler:
>>>>
>>>> > The Haskell capability that is committing a transaction will not
>>>> yield to another Haskell thread while it is doing the commit.  The OS
>>>> thread may be preempted, but once commit starts the haskell scheduler is
>>>> not invoked until after locks are released.
>>>> So best effort had already been made in GHC and I just need to
>>>> cooperate better with its design. Then to explain the low CPU utilization
>>>> (~10%), am I right to understand it as that upon reading a TVar locked by
>>>> another committing tx, a lightweight thread will put itself into `waiting
>>>> STM` and descheduled state, so the CPUs can only stay idle as not so many
>>>> threads are willing to proceed?
>>>>
>>>> Anyway, I see light with better data structures to improve my
>>>> situation, let me try them and report back. Actually I later changed `TVar
>>>> (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array
>>>> index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order
>>>> preservation semantic of dict entries (like that in Python 3.7+), then it's
>>>> very hopeful to incorporate stm-containers' Map or ttrie to approach free
>>>> of contention.
>>>>
>>>> Thanks with regards,
>>>>
>>>> Compl
>>>>
>>>>
>>>> On 2020/7/24 下午10:03, Ryan Yates wrote:
>>>>
>>>> Hi Compl,
>>>>
>>>> Having a pool of transaction processing threads can be helpful in a
>>>> certain way.  If the body of the transaction takes more time to execute
>>>> then the Haskell thread is allowed and it yields, the suspended thread
>>>> won't get in the way of other thread, but when it is rescheduled, will have
>>>> a low probability of success.  Even worse, it will probably not discover
>>>> that it is doomed to failure until commit time.  If transactions are more
>>>> likely to reach commit without yielding, they are more likely to succeed.
>>>> If the transactions are not conflicting, it doesn't make much difference
>>>> other than cache churn.
>>>>
>>>> The Haskell capability that is committing a transaction will not yield
>>>> to another Haskell thread while it is doing the commit.  The OS thread may
>>>> be preempted, but once commit starts the haskell scheduler is not invoked
>>>> until after locks are released.
>>>>
>>>> To get good performance from STM you must pay attention to what TVars
>>>> are involved in a commit.  All STM systems are working under the assumption
>>>> of low contention, so you want to minimize "false" conflicts (conflicts
>>>> that are not essential to the computation).    Something like `TVar
>>>> (HashMap k v)` will work pretty well for a low thread count, but every
>>>> transaction that writes to that structure will be in conflict with every
>>>> other transaction that accesses it.  Pushing the `TVar` into the nodes of
>>>> the structure reduces the possibilities for conflict, while increasing the
>>>> amount of bookkeeping STM has to do.  I would like to reduce the cost of
>>>> that bookkeeping using better structures, but we need to do so without
>>>> harming performance in the low TVar count case.  Right now it is optimized
>>>> for good cache performance with a handful of TVars.
>>>>
>>>> There is another way to play with performance by moving work into and
>>>> out of the transaction body.  A transaction body that executes quickly will
>>>> reach commit faster.  But it may be delaying work that moves into another
>>>> transaction.  Forcing values at the right time can make a big difference.
>>>>
>>>> Ryan
>>>>
>>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe <
>>>> haskell-cafe at haskell.org> wrote:
>>>>
>>>>> Thanks Chris, I confess I didn't pay enough attention to STM
>>>>> specialized container libraries by far, I skimmed through the description
>>>>> of stm-containers and ttrie, and feel they would definitely improve my
>>>>> code's performance in case I limit the server's parallelism within hardware
>>>>> capabilities. That may because I'm still prototyping the api and
>>>>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay
>>>>> for me at the moment, only if at low contention (surely there're plenty of
>>>>> CPU cycles to be optimized out in next steps). I model my data after graph
>>>>> model, so most data, even most indices are localized to nodes and edges,
>>>>> those can be manipulated without conflict, that's why I assumed I have a
>>>>> low contention use case since the very beginning - until I found there are
>>>>> still (though minor) needs for global indices to guarantee global
>>>>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more
>>>>> scalable global index data structure, thanks for hinting me.
>>>>>
>>>>> So an evident solution comes into my mind now, is to run the server
>>>>> with a pool of tx processing threads, matching number of CPU cores, client
>>>>> RPC requests then get queued to be executed in some thread from the pool.
>>>>> But I'm really fond of the mechanism of M:N scheduler which solves
>>>>> massive/dynamic concurrency so elegantly. I had some good result with Go in
>>>>> this regard, and see GHC at par in doing this, I don't want to give up this
>>>>> enjoyable machinery.
>>>>>
>>>>> But looked at the stm implementation in GHC, it seems written TVars
>>>>> are exclusively locked during commit of a tx, I suspect this is the culprit
>>>>> when there're large M lightweight threads scheduled upon a small N hardware
>>>>> capabilities, that is when a lightweight thread yield control during an stm
>>>>> transaction commit, the TVars it locked will stay so until it's scheduled
>>>>> again (and again) till it can finish the commit. This way, descheduled
>>>>> threads could hold live threads from progressing. I haven't gone into more
>>>>> details there, but wonder if there can be some improvement for GHC RTS to
>>>>> keep an stm committing thread from descheduled, but seemingly that may
>>>>> impose more starvation potential; or stm can be improved to have its TVar
>>>>> locks preemptable when the owner trec/thread is in descheduled state?
>>>>> Neither should be easy but I'd really love massive lightweight threads
>>>>> doing STM practically well.
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Compl
>>>>>
>>>>>
>>>>> On 2020/7/24 上午12:57, Christopher Allen wrote:
>>>>>
>>>>> It seems like you know how to run practical tests for tuning thread
>>>>> count and contention for throughput. Part of the reason you haven't gotten
>>>>> a super clear answer is "it depends." You give up fairness when you use STM
>>>>> instead of MVars or equivalent structures. That means a long running
>>>>> transaction might get stampeded by many small ones invalidating it over and
>>>>> over. The long-running transaction might never clear if the small
>>>>> transactions keep moving the cheese. I mention this because transaction
>>>>> runtime and size and count all affect throughput and latency. What might be
>>>>> ideal for one pattern of work might not be ideal for another. Optimizing
>>>>> for overall throughput might make the contention and fairness problems
>>>>> worse too. I've done practical tests to optimize this in the past, both for
>>>>> STM in Haskell and for RDBMS workloads.
>>>>>
>>>>> The next step is sometimes figuring out whether you really need a data
>>>>> structure within a single STM container or if perhaps you can break up your
>>>>> STM container boundaries into zones or regions that roughly map onto update
>>>>> boundaries. That should make the transactions churn less. On the outside
>>>>> chance you do need to touch more than one container in a transaction, well,
>>>>> they compose.
>>>>>
>>>>> e.g. https://hackage.haskell.org/package/stm-containers
>>>>> https://hackage.haskell.org/package/ttrie
>>>>>
>>>>> It also sounds a bit like your question bumps into Amdahl's Law a bit.
>>>>>
>>>>> All else fails, stop using STM and find something more tuned to your
>>>>> problem space.
>>>>>
>>>>> Hope this helps,
>>>>> Chris Allen
>>>>>
>>>>>
>>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe <
>>>>> haskell-cafe at haskell.org> wrote:
>>>>>
>>>>>> Hello Cafe,
>>>>>>
>>>>>> I'm working on an in-memory database, in Client/Server mode I just
>>>>>> let each connected client submit remote procedure call running in its
>>>>>> dedicated lightweight thread, modifying TVars in RAM per its business
>>>>>> needs, then in case many clients connected concurrently and trying to
>>>>>> insert new data, if they are triggering global index (some TVar) update,
>>>>>> the throughput would drop drastically. I reduced the shared state to a
>>>>>> simple int counter by TVar, got same symptom. While the parallelism feels
>>>>>> okay when there's no hot TVar conflicting, or M is not much greater than N.
>>>>>>
>>>>>> As an empirical test workload, I have a `+RTS -N10` server process,
>>>>>> it handles 10 concurrent clients okay, got ~5x of single thread throughput;
>>>>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be
>>>>>> driven to ~10% utilization, the throughput seems even worse than single
>>>>>> thread. More clients can even drive it thrashing without much  progressing.
>>>>>>
>>>>>>  I can understand that pure STM doesn't scale well after reading [1],
>>>>>> and I see it suggested [7] attractive and planned future work toward that
>>>>>> direction.
>>>>>>
>>>>>> But I can't find certain libraries or frameworks addressing large M
>>>>>> over small N scenarios, [1] experimented with designated N parallelism, and
>>>>>> [7] is rather theoretical to my empirical needs.
>>>>>>
>>>>>> Can you direct me to some available library implementing the
>>>>>> methodology proposed in [7] or other ways tackling this problem?
>>>>>>
>>>>>> I think the most difficult one is that a transaction should commit
>>>>>> with global indices (with possibly unique constraints) atomically updated,
>>>>>> and rollback with any violation of constraints, i.e. transactions have to
>>>>>> cover global states like indices. Other problems seem more trivial than
>>>>>> this.
>>>>>>
>>>>>> Specifically, [7] states:
>>>>>>
>>>>>> > It must be emphasized that all of the mechanisms we deploy
>>>>>> originate, in one form or another, in the database literature from the 70s
>>>>>> and 80s. Our contribution is to adapt these techniques to software
>>>>>> transactional memory, providing more effective solutions to important STM
>>>>>> problems than prior proposals.
>>>>>>
>>>>>> I wonder any STM based library has simplified those techniques to be
>>>>>> composed right away? I don't really want to implement those mechanisms by
>>>>>> myself, rebuilding many wheels from scratch.
>>>>>>
>>>>>> Best regards,
>>>>>> Compl
>>>>>>
>>>>>>
>>>>>> [1] Comparing the performance of concurrent linked-list
>>>>>> implementations in Haskell
>>>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf
>>>>>>
>>>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology
>>>>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages
>>>>>> 207–216. ACM Press, 2008.
>>>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf
>>>>>>
>>>>>> _______________________________________________
>>>>>> Haskell-Cafe mailing list
>>>>>> To (un)subscribe, modify options or view archives go to:
>>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>>>> Only members subscribed via the mailman list are allowed to post.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chris Allen
>>>>> Currently working on http://haskellbook.com
>>>>>
>>>>> _______________________________________________
>>>>> Haskell-Cafe mailing list
>>>>> To (un)subscribe, modify options or view archives go to:
>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>>> Only members subscribed via the mailman list are allowed to post.
>>>>
>>>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200729/5f2f6658/attachment-0001.html>