[Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?
Compl Yue
compl.yue at icloud.com
Fri Jul 24 15:22:32 UTC 2020
Thanks very much for the insightful information Ryan! I'm glad my
suspect was wrong about the Haskell scheduler:
> The Haskell capability that is committing a transaction will not
yield to another Haskell thread while it is doing the commit. The OS
thread may be preempted, but once commit starts the haskell scheduler is
not invoked until after locks are released.
So best effort had already been made in GHC and I just need to cooperate
better with its design. Then to explain the low CPU utilization (~10%),
am I right to understand it as that upon reading a TVar locked by
another committing tx, a lightweight thread will put itself into
`waiting STM` and descheduled state, so the CPUs can only stay idle as
not so many threads are willing to proceed?
Anyway, I see light with better data structures to improve my situation,
let me try them and report back. Actually I later changed `TVar (HaskMap
k v)` to be `TVar (HashMap k Int)` where the `Int` being array index
into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order
preservation semantic of dict entries (like that in Python 3.7+), then
it's very hopeful to incorporate stm-containers' Map or ttrie to
approach free of contention.
Thanks with regards,
Compl
On 2020/7/24 下午10:03, Ryan Yates wrote:
> Hi Compl,
>
> Having a pool of transaction processing threads can be helpful in a
> certain way. If the body of the transaction takes more time to
> execute then the Haskell thread is allowed and it yields, the
> suspended thread won't get in the way of other thread, but when it is
> rescheduled, will have a low probability of success. Even worse, it
> will probably not discover that it is doomed to failure until commit
> time. If transactions are more likely to reach commit without
> yielding, they are more likely to succeed. If the transactions are
> not conflicting, it doesn't make much difference other than cache churn.
>
> The Haskell capability that is committing a transaction will not yield
> to another Haskell thread while it is doing the commit. The OS thread
> may be preempted, but once commit starts the haskell scheduler is not
> invoked until after locks are released.
>
> To get good performance from STM you must pay attention to what TVars
> are involved in a commit. All STM systems are working under the
> assumption of low contention, so you want to minimize "false"
> conflicts (conflicts that are not essential to the computation).
> Something like `TVar (HashMap k v)` will work pretty well for a low
> thread count, but every transaction that writes to that structure will
> be in conflict with every other transaction that accesses it. Pushing
> the `TVar` into the nodes of the structure reduces the possibilities
> for conflict, while increasing the amount of bookkeeping STM has to
> do. I would like to reduce the cost of that bookkeeping using better
> structures, but we need to do so without harming performance in the
> low TVar count case. Right now it is optimized for good cache
> performance with a handful of TVars.
>
> There is another way to play with performance by moving work into and
> out of the transaction body. A transaction body that executes quickly
> will reach commit faster. But it may be delaying work that moves into
> another transaction. Forcing values at the right time can make a big
> difference.
>
> Ryan
>
> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe
> <haskell-cafe at haskell.org <mailto:haskell-cafe at haskell.org>> wrote:
>
> Thanks Chris, I confess I didn't pay enough attention to STM
> specialized container libraries by far, I skimmed through the
> description of stm-containers and ttrie, and feel they would
> definitely improve my code's performance in case I limit the
> server's parallelism within hardware capabilities. That may
> because I'm still prototyping the api and infrastructure for
> correctness, so even `TVar (HashMap k v)` performs okay for me at
> the moment, only if at low contention (surely there're plenty of
> CPU cycles to be optimized out in next steps). I model my data
> after graph model, so most data, even most indices are localized
> to nodes and edges, those can be manipulated without conflict,
> that's why I assumed I have a low contention use case since the
> very beginning - until I found there are still (though minor)
> needs for global indices to guarantee global uniqueness, I feel
> faithful with stm-containers/ttrie to implement a more scalable
> global index data structure, thanks for hinting me.
>
> So an evident solution comes into my mind now, is to run the
> server with a pool of tx processing threads, matching number of
> CPU cores, client RPC requests then get queued to be executed in
> some thread from the pool. But I'm really fond of the mechanism of
> M:N scheduler which solves massive/dynamic concurrency so
> elegantly. I had some good result with Go in this regard, and see
> GHC at par in doing this, I don't want to give up this enjoyable
> machinery.
>
> But looked at the stm implementation in GHC, it seems written
> TVars are exclusively locked during commit of a tx, I suspect this
> is the culprit when there're large M lightweight threads scheduled
> upon a small N hardware capabilities, that is when a lightweight
> thread yield control during an stm transaction commit, the TVars
> it locked will stay so until it's scheduled again (and again) till
> it can finish the commit. This way, descheduled threads could hold
> live threads from progressing. I haven't gone into more details
> there, but wonder if there can be some improvement for GHC RTS to
> keep an stm committing thread from descheduled, but seemingly that
> may impose more starvation potential; or stm can be improved to
> have its TVar locks preemptable when the owner trec/thread is in
> descheduled state? Neither should be easy but I'd really love
> massive lightweight threads doing STM practically well.
>
> Best regards,
>
> Compl
>
>
> On 2020/7/24 上午12:57, Christopher Allen wrote:
>> It seems like you know how to run practical tests for tuning
>> thread count and contention for throughput. Part of the reason
>> you haven't gotten a super clear answer is "it depends." You give
>> up fairness when you use STM instead of MVars or equivalent
>> structures. That means a long running transaction might get
>> stampeded by many small ones invalidating it over and over. The
>> long-running transaction might never clear if the small
>> transactions keep moving the cheese. I mention this because
>> transaction runtime and size and count all affect throughput and
>> latency. What might be ideal for one pattern of work might not be
>> ideal for another. Optimizing for overall throughput might make
>> the contention and fairness problems worse too. I've done
>> practical tests to optimize this in the past, both for STM in
>> Haskell and for RDBMS workloads.
>>
>> The next step is sometimes figuring out whether you really need a
>> data structure within a single STM container or if perhaps you
>> can break up your STM container boundaries into zones or regions
>> that roughly map onto update boundaries. That should make the
>> transactions churn less. On the outside chance you do need to
>> touch more than one container in a transaction, well, they compose.
>>
>> e.g. https://hackage.haskell.org/package/stm-containers
>> https://hackage.haskell.org/package/ttrie
>>
>> It also sounds a bit like your question bumps into Amdahl's Law a
>> bit.
>>
>> All else fails, stop using STM and find something more tuned to
>> your problem space.
>>
>> Hope this helps,
>> Chris Allen
>>
>>
>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe
>> <haskell-cafe at haskell.org <mailto:haskell-cafe at haskell.org>> wrote:
>>
>> Hello Cafe,
>>
>> I'm working on an in-memory database, in Client/Server mode I
>> just let each connected client submit remote procedure call
>> running in its dedicated lightweight thread, modifying TVars
>> in RAM per its business needs, then in case many clients
>> connected concurrently and trying to insert new data, if they
>> are triggering global index (some TVar) update, the
>> throughput would drop drastically. I reduced the shared state
>> to a simple int counter by TVar, got same symptom. While the
>> parallelism feels okay when there's no hot TVar conflicting,
>> or M is not much greater than N.
>>
>> As an empirical test workload, I have a `+RTS -N10` server
>> process, it handles 10 concurrent clients okay, got ~5x of
>> single thread throughput; but in handling 20 concurrent
>> clients, each of the 10 CPUs can only be driven to ~10%
>> utilization, the throughput seems even worse than single
>> thread. More clients can even drive it thrashing without much
>> progressing.
>>
>> I can understand that pure STM doesn't scale well after
>> reading [1], and I see it suggested [7] attractive and
>> planned future work toward that direction.
>>
>> But I can't find certain libraries or frameworks addressing
>> large M over small N scenarios, [1] experimented with
>> designated N parallelism, and [7] is rather theoretical to my
>> empirical needs.
>>
>> Can you direct me to some available library implementing the
>> methodology proposed in [7] or other ways tackling this problem?
>>
>> I think the most difficult one is that a transaction should
>> commit with global indices (with possibly unique constraints)
>> atomically updated, and rollback with any violation of
>> constraints, i.e. transactions have to cover global states
>> like indices. Other problems seem more trivial than this.
>>
>> Specifically, [7] states:
>>
>> > It must be emphasized that all of the mechanisms we deploy
>> originate, in one form or another, in the database literature
>> from the 70s and 80s. Our contribution is to adapt these
>> techniques to software transactional memory, providing more
>> effective solutions to important STM problems than prior
>> proposals.
>>
>> I wonder any STM based library has simplified those
>> techniques to be composed right away? I don't really want to
>> implement those mechanisms by myself, rebuilding many wheels
>> from scratch.
>>
>> Best regards,
>> Compl
>>
>>
>> [1] Comparing the performance of concurrent linked-list
>> implementations in Haskell
>> https://simonmar.github.io/bib/papers/concurrent-data.pdf
>>
>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a
>> methodology for highly-concurrent transactional objects. In
>> Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008.
>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
>>
>>
>>
>> --
>> Chris Allen
>> Currently working on http://haskellbook.com
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200724/67476b77/attachment.html>
More information about the Haskell-Cafe
mailing list