[Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Ryan Yates fryguybob at gmail.com
Sat Jul 25 14:07:10 UTC 2020

Unfortunately my STM benchmarks are rather disorganized.  The most relevant
paper using them is:

Leveraging hardware TM in Haskell (PPoPP '19)

Or my thesis:

 The PPoPP benchmarks are on a branch (or the releases tab on github):

 All that to say, without an implementation of mutable constructor fields
(which I'm working on getting into GHC) the scaling is limited.


On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe <
haskell-cafe at haskell.org> wrote:

> Dear Cafe,
> As Chris Allen has suggested, I learned that
> https://hackage.haskell.org/package/stm-containers and
> https://hackage.haskell.org/package/ttrie can help a lot when used in
> place of traditional HashMap for stm tx processing, under heavy
> concurrency, yet still with automatic parallelism as GHC implemented them.
> Then I realized that in addition to hash map (used to implement dicts and
> scopes), I also need to find a TreeMap replacement data structure to
> implement the db index. I've been focusing on the uniqueness constraint
> aspect, but it's still an index, needs to provide range scan api for db
> clients, so hash map is not sufficient for the index.
> I see Ryan shared the code benchmarking RBTree with stm in mind:
> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput
> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree
> But can't find conclusion or interpretation of that benchmark suite. And
> here's a followup question:
> Where are some STM contention optimized data structures, that having keys
> ordered, with sub-range traversing api ?
> (of course production ready libraries most desirable)
> Thanks with regards,
> Compl
> On 2020/7/25 δΈ‹εˆ2:04, Compl Yue via Haskell-Cafe wrote:
> Shame on me for I have neither experienced with `perf`, I'd learn these
> essential tools soon to put them into good use.
> It's great to learn about how `orElse` actually works, I did get confused
> why there are so little retries captured, and now I know. So that little
> trick should definitely be removed before going production, as it does no
> much useful things at excessive cost. I put it there to help me understand
> internal working of stm, now I get even better knowledge ;-)
> I think a debugger will trap every single abort, isn't it annoying when
> many aborts would occur? If I'd like to count the number of aborts, ideally
> accounted per service endpoints, time periods, source modules etc. there
> some tricks for that?
> Thanks with best regards,
> Compl
> On 2020/7/25 上午2:02, Ryan Yates wrote:
> To be clear, I was trying to refer to Linux `perf` [^1].  Sampling based
> profiling can do a good job with concurrent and parallel programs where
> other methods are problematic.  For instance,
>  changing the size of heap objects can drastically change cache
> performance and completely different behavior can show up.
> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux)
> The spinning in `readTVar` should always be very short and it typically
> shows up as intensive CPU use, though it may not be high energy use with
> `pause` in the loop on x86 (looks like we don't have it [^2], I thought we
> did, but maybe that was only in some of my code... )
> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275
> All that to say, I doubt that you are spending much time spinning (but it
> would certainly be interesting to know if you are!  You would see `perf`
> attribute a large amount of time to `read_current_value`).  The amount of
> code to execute for commit (the time when locks are held) is always much
> shorter than it takes to execute the transaction body.  As you add more
> conflicting threads this gets worse of course as commits sequence.
> The code you have will count commits of executions of `retry`.  Note that
> `retry` is a user level idea, that is, you are counting user level
> *explicit* retries.  This is different from a transaction failing to commit
> and starting again.  These are invisible to the user.  Also using your
> trace will convert `retry` from the efficient wake on write implementation,
> to an active retry that will always attempt again.  We don't have cheap
> logging of transaction aborts in GHC, but I have built such logging in my
> work.  You can observe these aborts with a debugger by looking for
> execution of this line:
> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123
> Ryan
> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue <compl.yue at icloud.com> wrote:
>> I'm not familiar with profiling GHC yet, may need more time to get myself
>> proficient with it.
>> And a bit more details of my test workload for diagnostic: the db clients
>> are Python processes from a cluster of worker nodes, consulting the db
>> server to register some path for data files, under a data dir within a
>> shared filesystem, then mmap those data files and fill in actual array
>> data. So the db server don't have much computation to perform, but puts the
>> data file path into a global index, which at the same validates its
>> uniqueness. As there are many client processes trying to insert one meta
>> data record concurrently, with my naive implementation, the global index's
>> TVar will almost always in locked state by one client after another, from a
>> queue never fall empty.
>> So if `readTVar` should spinning waiting, I doubt the threads should
>> actually make high CPU utilization, because at any instant of time, all
>> threads except the committing one will be doing that one thing.
>> And I have something in my code to track STM retry like this:
>> ```
>> -- blocking wait not expected, track stm retries explicitly
>> trackSTM :: Int -> IO (Either () a)
>> trackSTM !rtc = do
>> when -- todo increase the threshold of reporting?
>> (rtc > 0) $ do
>> -- trace out the retries so the end users can be aware of them
>> tid <- myThreadId
>> trace
>> ( "πŸ”™\n"
>> <> show callCtx
>> <> "πŸŒ€ "
>> <> show tid
>> <> " stm retry #"
>> <> show rtc
>> )
>> $ return ()
>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case
>> Nothing -> -- stm failed, do a tracked retry
>> trackSTM (rtc + 1)
>> Just ... -> ...
>> ```
>> No such trace msg fires during my test, neither in single thread run, nor
>> in runs with pressure. I'm sure this tracing mechanism works, as I can see
>> such traces fire, in case e.g. posting a TMVar to a TQueue for some other
>> thread to fill it, then read the result out, if these 2 ops are composed
>> into a single tx, then of course it's infinite retry loop, and a sequence
>> of such msgs are logged with ever increasing rtc #.
>> So I believe no retry has ever been triggered.
>> What can going on there?
>> On 2020/7/24 δΈ‹εˆ11:46, Ryan Yates wrote:
>> > Then to explain the low CPU utilization (~10%), am I right to
>> understand it as that upon reading a TVar locked by another committing tx,
>> a lightweight thread will put itself into `waiting STM` and descheduled
>> state, so the CPUs can only stay idle as not so many threads are willing to
>> proceed?
>> Since the commit happens in finite steps, the expectation is that the
>> lock will be released very soon.  Given this when the body of a transaction
>> executes `readTVar` it spins (active CPU!) until the `TVar` is observed
>> unlocked.  If a lock is observed while commiting, it immediately starts the
>> transaction again from the beginning.  To get the behavior of suspending a
>> transaction you have to successfully commit a transaction that executed
>> `retry`.  Then the transaction is put on the wakeup lists of its read set
>> and subsequent commits will wake it up if its write set overlaps.
>> I don't think any of these things would explain low CPU utilization.  You
>> could try running with `perf` and see if lots of time is spent in some
>> recognizable part of the RTS.
>> Ryan
>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue <compl.yue at icloud.com> wrote:
>>> Thanks very much for the insightful information Ryan! I'm glad my
>>> suspect was wrong about the Haskell scheduler:
>>> > The Haskell capability that is committing a transaction will not yield
>>> to another Haskell thread while it is doing the commit.  The OS thread may
>>> be preempted, but once commit starts the haskell scheduler is not invoked
>>> until after locks are released.
>>> So best effort had already been made in GHC and I just need to cooperate
>>> better with its design. Then to explain the low CPU utilization (~10%), am
>>> I right to understand it as that upon reading a TVar locked by another
>>> committing tx, a lightweight thread will put itself into `waiting STM` and
>>> descheduled state, so the CPUs can only stay idle as not so many threads
>>> are willing to proceed?
>>> Anyway, I see light with better data structures to improve my situation,
>>> let me try them and report back. Actually I later changed `TVar (HaskMap k
>>> v)` to be `TVar (HashMap k Int)` where the `Int` being array index into
>>> `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation
>>> semantic of dict entries (like that in Python 3.7+), then it's very hopeful
>>> to incorporate stm-containers' Map or ttrie to approach free of contention.
>>> Thanks with regards,
>>> Compl
>>> On 2020/7/24 δΈ‹εˆ10:03, Ryan Yates wrote:
>>> Hi Compl,
>>> Having a pool of transaction processing threads can be helpful in a
>>> certain way.  If the body of the transaction takes more time to execute
>>> then the Haskell thread is allowed and it yields, the suspended thread
>>> won't get in the way of other thread, but when it is rescheduled, will have
>>> a low probability of success.  Even worse, it will probably not discover
>>> that it is doomed to failure until commit time.  If transactions are more
>>> likely to reach commit without yielding, they are more likely to succeed.
>>> If the transactions are not conflicting, it doesn't make much difference
>>> other than cache churn.
>>> The Haskell capability that is committing a transaction will not yield
>>> to another Haskell thread while it is doing the commit.  The OS thread may
>>> be preempted, but once commit starts the haskell scheduler is not invoked
>>> until after locks are released.
>>> To get good performance from STM you must pay attention to what TVars
>>> are involved in a commit.  All STM systems are working under the assumption
>>> of low contention, so you want to minimize "false" conflicts (conflicts
>>> that are not essential to the computation).    Something like `TVar
>>> (HashMap k v)` will work pretty well for a low thread count, but every
>>> transaction that writes to that structure will be in conflict with every
>>> other transaction that accesses it.  Pushing the `TVar` into the nodes of
>>> the structure reduces the possibilities for conflict, while increasing the
>>> amount of bookkeeping STM has to do.  I would like to reduce the cost of
>>> that bookkeeping using better structures, but we need to do so without
>>> harming performance in the low TVar count case.  Right now it is optimized
>>> for good cache performance with a handful of TVars.
>>> There is another way to play with performance by moving work into and
>>> out of the transaction body.  A transaction body that executes quickly will
>>> reach commit faster.  But it may be delaying work that moves into another
>>> transaction.  Forcing values at the right time can make a big difference.
>>> Ryan
>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe <
>>> haskell-cafe at haskell.org> wrote:
>>>> Thanks Chris, I confess I didn't pay enough attention to STM
>>>> specialized container libraries by far, I skimmed through the description
>>>> of stm-containers and ttrie, and feel they would definitely improve my
>>>> code's performance in case I limit the server's parallelism within hardware
>>>> capabilities. That may because I'm still prototyping the api and
>>>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay
>>>> for me at the moment, only if at low contention (surely there're plenty of
>>>> CPU cycles to be optimized out in next steps). I model my data after graph
>>>> model, so most data, even most indices are localized to nodes and edges,
>>>> those can be manipulated without conflict, that's why I assumed I have a
>>>> low contention use case since the very beginning - until I found there are
>>>> still (though minor) needs for global indices to guarantee global
>>>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more
>>>> scalable global index data structure, thanks for hinting me.
>>>> So an evident solution comes into my mind now, is to run the server
>>>> with a pool of tx processing threads, matching number of CPU cores, client
>>>> RPC requests then get queued to be executed in some thread from the pool.
>>>> But I'm really fond of the mechanism of M:N scheduler which solves
>>>> massive/dynamic concurrency so elegantly. I had some good result with Go in
>>>> this regard, and see GHC at par in doing this, I don't want to give up this
>>>> enjoyable machinery.
>>>> But looked at the stm implementation in GHC, it seems written TVars are
>>>> exclusively locked during commit of a tx, I suspect this is the culprit
>>>> when there're large M lightweight threads scheduled upon a small N hardware
>>>> capabilities, that is when a lightweight thread yield control during an stm
>>>> transaction commit, the TVars it locked will stay so until it's scheduled
>>>> again (and again) till it can finish the commit. This way, descheduled
>>>> threads could hold live threads from progressing. I haven't gone into more
>>>> details there, but wonder if there can be some improvement for GHC RTS to
>>>> keep an stm committing thread from descheduled, but seemingly that may
>>>> impose more starvation potential; or stm can be improved to have its TVar
>>>> locks preemptable when the owner trec/thread is in descheduled state?
>>>> Neither should be easy but I'd really love massive lightweight threads
>>>> doing STM practically well.
>>>> Best regards,
>>>> Compl
>>>> On 2020/7/24 上午12:57, Christopher Allen wrote:
>>>> It seems like you know how to run practical tests for tuning thread
>>>> count and contention for throughput. Part of the reason you haven't gotten
>>>> a super clear answer is "it depends." You give up fairness when you use STM
>>>> instead of MVars or equivalent structures. That means a long running
>>>> transaction might get stampeded by many small ones invalidating it over and
>>>> over. The long-running transaction might never clear if the small
>>>> transactions keep moving the cheese. I mention this because transaction
>>>> runtime and size and count all affect throughput and latency. What might be
>>>> ideal for one pattern of work might not be ideal for another. Optimizing
>>>> for overall throughput might make the contention and fairness problems
>>>> worse too. I've done practical tests to optimize this in the past, both for
>>>> STM in Haskell and for RDBMS workloads.
>>>> The next step is sometimes figuring out whether you really need a data
>>>> structure within a single STM container or if perhaps you can break up your
>>>> STM container boundaries into zones or regions that roughly map onto update
>>>> boundaries. That should make the transactions churn less. On the outside
>>>> chance you do need to touch more than one container in a transaction, well,
>>>> they compose.
>>>> e.g. https://hackage.haskell.org/package/stm-containers
>>>> https://hackage.haskell.org/package/ttrie
>>>> It also sounds a bit like your question bumps into Amdahl's Law a bit.
>>>> All else fails, stop using STM and find something more tuned to your
>>>> problem space.
>>>> Hope this helps,
>>>> Chris Allen
>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe <
>>>> haskell-cafe at haskell.org> wrote:
>>>>> Hello Cafe,
>>>>> I'm working on an in-memory database, in Client/Server mode I just let
>>>>> each connected client submit remote procedure call running in its dedicated
>>>>> lightweight thread, modifying TVars in RAM per its business needs, then in
>>>>> case many clients connected concurrently and trying to insert new data, if
>>>>> they are triggering global index (some TVar) update, the throughput would
>>>>> drop drastically. I reduced the shared state to a simple int counter by
>>>>> TVar, got same symptom. While the parallelism feels okay when there's no
>>>>> hot TVar conflicting, or M is not much greater than N.
>>>>> As an empirical test workload, I have a `+RTS -N10` server process, it
>>>>> handles 10 concurrent clients okay, got ~5x of single thread throughput;
>>>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be
>>>>> driven to ~10% utilization, the throughput seems even worse than single
>>>>> thread. More clients can even drive it thrashing without much  progressing.
>>>>>  I can understand that pure STM doesn't scale well after reading [1],
>>>>> and I see it suggested [7] attractive and planned future work toward that
>>>>> direction.
>>>>> But I can't find certain libraries or frameworks addressing large M
>>>>> over small N scenarios, [1] experimented with designated N parallelism, and
>>>>> [7] is rather theoretical to my empirical needs.
>>>>> Can you direct me to some available library implementing the
>>>>> methodology proposed in [7] or other ways tackling this problem?
>>>>> I think the most difficult one is that a transaction should commit
>>>>> with global indices (with possibly unique constraints) atomically updated,
>>>>> and rollback with any violation of constraints, i.e. transactions have to
>>>>> cover global states like indices. Other problems seem more trivial than
>>>>> this.
>>>>> Specifically, [7] states:
>>>>> > It must be emphasized that all of the mechanisms we deploy
>>>>> originate, in one form or another, in the database literature from the 70s
>>>>> and 80s. Our contribution is to adapt these techniques to software
>>>>> transactional memory, providing more effective solutions to important STM
>>>>> problems than prior proposals.
>>>>> I wonder any STM based library has simplified those techniques to be
>>>>> composed right away? I don't really want to implement those mechanisms by
>>>>> myself, rebuilding many wheels from scratch.
>>>>> Best regards,
>>>>> Compl
>>>>> [1] Comparing the performance of concurrent linked-list
>>>>> implementations in Haskell
>>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf
>>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology
>>>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages
>>>>> 207–216. ACM Press, 2008.
>>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf
>>>>> _______________________________________________
>>>>> Haskell-Cafe mailing list
>>>>> To (un)subscribe, modify options or view archives go to:
>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>>> Only members subscribed via the mailman list are allowed to post.
>>>> --
>>>> Chris Allen
>>>> Currently working on http://haskellbook.com
>>>> _______________________________________________
>>>> Haskell-Cafe mailing list
>>>> To (un)subscribe, modify options or view archives go to:
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>> Only members subscribed via the mailman list are allowed to post.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200725/cb0d6d9a/attachment-0001.html>

More information about the Haskell-Cafe mailing list