[Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?
compl.yue at icloud.com
Sat Jul 25 06:04:27 UTC 2020
Shame on me for I have neither experienced with `perf`, I'd learn these
essential tools soon to put them into good use.
It's great to learn about how `orElse` actually works, I did get
confused why there are so little retries captured, and now I know. So
that little trick should definitely be removed before going production,
as it does no much useful things at excessive cost. I put it there to
help me understand internal working of stm, now I get even better
I think a debugger will trap every single abort, isn't it annoying when
many aborts would occur? If I'd like to count the number of aborts,
ideally accounted per service endpoints, time periods, source modules
etc. there some tricks for that?
Thanks with best regards,
On 2020/7/25 上午2:02, Ryan Yates wrote:
> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling
> based profiling can do a good job with concurrent and parallel
> programs where other methods are problematic. For instance,
> changing the size of heap objects can drastically change cache
> performance and completely different behavior can show up.
> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux)
> The spinning in `readTVar` should always be very short and it
> typically shows up as intensive CPU use, though it may not be high
> energy use with `pause` in the loop on x86 (looks like we don't have
> it [^2], I thought we did, but maybe that was only in some of my code... )
> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275
> All that to say, I doubt that you are spending much time spinning (but
> it would certainly be interesting to know if you are! You would see
> `perf` attribute a large amount of time to `read_current_value`). The
> amount of code to execute for commit (the time when locks are held) is
> always much shorter than it takes to execute the transaction body. As
> you add more conflicting threads this gets worse of course as commits
> The code you have will count commits of executions of `retry`. Note
> that `retry` is a user level idea, that is, you are counting user
> level *explicit* retries. This is different from a transaction
> failing to commit and starting again. These are invisible to the
> user. Also using your trace will convert `retry` from the efficient
> wake on write implementation, to an active retry that will always
> attempt again. We don't have cheap logging of transaction aborts in
> GHC, but I have built such logging in my work. You can observe these
> aborts with a debugger by looking for execution of this line:
> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue <compl.yue at icloud.com
> <mailto:compl.yue at icloud.com>> wrote:
> I'm not familiar with profiling GHC yet, may need more time to get
> myself proficient with it.
> And a bit more details of my test workload for diagnostic: the db
> clients are Python processes from a cluster of worker nodes,
> consulting the db server to register some path for data files,
> under a data dir within a shared filesystem, then mmap those data
> files and fill in actual array data. So the db server don't have
> much computation to perform, but puts the data file path into a
> global index, which at the same validates its uniqueness. As there
> are many client processes trying to insert one meta data record
> concurrently, with my naive implementation, the global index's
> TVar will almost always in locked state by one client after
> another, from a queue never fall empty.
> So if `readTVar` should spinning waiting, I doubt the threads
> should actually make high CPU utilization, because at any instant
> of time, all threads except the committing one will be doing that
> one thing.
> And I have something in my code to track STM retry like this:
> -- blocking wait not expected, track stm retries explicitly
> trackSTM:: Int-> IO(Either() a)
> trackSTM !rtc = do
> when -- todo increase the threshold of reporting?
> (rtc > 0) $ do
> -- trace out the retries so the end users can be aware of them
> tid <- myThreadId
> ( "🔙\n"
> <> show callCtx
> <> "🌀 "
> <> show tid
> <> " stm retry #"
> <> show rtc
> $ return ()
> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case
> Nothing -> -- stm failed, do a tracked retry
> trackSTM (rtc + 1)
> Just ... -> ...
> No such trace msg fires during my test, neither in single thread
> run, nor in runs with pressure. I'm sure this tracing mechanism
> works, as I can see such traces fire, in case e.g. posting a TMVar
> to a TQueue for some other thread to fill it, then read the result
> out, if these 2 ops are composed into a single tx, then of course
> it's infinite retry loop, and a sequence of such msgs are logged
> with ever increasing rtc #.
> So I believe no retry has ever been triggered.
> What can going on there?
> On 2020/7/24 下午11:46, Ryan Yates wrote:
>> > Then to explain the low CPU utilization (~10%), am I right to
>> understand it as that upon reading a TVar locked by another
>> committing tx, a lightweight thread will put itself into `waiting
>> STM` and descheduled state, so the CPUs can only stay idle as not
>> so many threads are willing to proceed?
>> Since the commit happens in finite steps, the expectation is that
>> the lock will be released very soon. Given this when the body of
>> a transaction executes `readTVar` it spins (active CPU!) until
>> the `TVar` is observed unlocked. If a lock is observed while
>> commiting, it immediately starts the transaction again from the
>> beginning. To get the behavior of suspending a transaction you
>> have to successfully commit a transaction that executed `retry`.
>> Then the transaction is put on the wakeup lists of its read set
>> and subsequent commits will wake it up if its write set overlaps.
>> I don't think any of these things would explain low CPU
>> utilization. You could try running with `perf` and see if lots
>> of time is spent in some recognizable part of the RTS.
>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue <compl.yue at icloud.com
>> <mailto:compl.yue at icloud.com>> wrote:
>> Thanks very much for the insightful information Ryan! I'm
>> glad my suspect was wrong about the Haskell scheduler:
>> > The Haskell capability that is committing a transaction
>> will not yield to another Haskell thread while it is doing
>> the commit. The OS thread may be preempted, but once commit
>> starts the haskell scheduler is not invoked until after locks
>> are released.
>> So best effort had already been made in GHC and I just need
>> to cooperate better with its design. Then to explain the low
>> CPU utilization (~10%), am I right to understand it as that
>> upon reading a TVar locked by another committing tx, a
>> lightweight thread will put itself into `waiting STM` and
>> descheduled state, so the CPUs can only stay idle as not so
>> many threads are willing to proceed?
>> Anyway, I see light with better data structures to improve my
>> situation, let me try them and report back. Actually I later
>> changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)`
>> where the `Int` being array index into `TVar (Vector (TVar
>> (Maybe v)))`, in pursuing insertion order preservation
>> semantic of dict entries (like that in Python 3.7+), then
>> it's very hopeful to incorporate stm-containers' Map or ttrie
>> to approach free of contention.
>> Thanks with regards,
>> On 2020/7/24 下午10:03, Ryan Yates wrote:
>>> Hi Compl,
>>> Having a pool of transaction processing threads can be
>>> helpful in a certain way. If the body of the transaction
>>> takes more time to execute then the Haskell thread is
>>> allowed and it yields, the suspended thread won't get in the
>>> way of other thread, but when it is rescheduled, will have a
>>> low probability of success. Even worse, it will probably
>>> not discover that it is doomed to failure until commit
>>> time. If transactions are more likely to reach commit
>>> without yielding, they are more likely to succeed. If the
>>> transactions are not conflicting, it doesn't make much
>>> difference other than cache churn.
>>> The Haskell capability that is committing a transaction will
>>> not yield to another Haskell thread while it is doing the
>>> commit. The OS thread may be preempted, but once commit
>>> starts the haskell scheduler is not invoked until after
>>> locks are released.
>>> To get good performance from STM you must pay attention to
>>> what TVars are involved in a commit. All STM systems are
>>> working under the assumption of low contention, so you want
>>> to minimize "false" conflicts (conflicts that are not
>>> essential to the computation). Something like `TVar
>>> (HashMap k v)` will work pretty well for a low thread count,
>>> but every transaction that writes to that structure will be
>>> in conflict with every other transaction that accesses it.
>>> Pushing the `TVar` into the nodes of the structure reduces
>>> the possibilities for conflict, while increasing the amount
>>> of bookkeeping STM has to do. I would like to reduce the
>>> cost of that bookkeeping using better structures, but we
>>> need to do so without harming performance in the low TVar
>>> count case. Right now it is optimized for good cache
>>> performance with a handful of TVars.
>>> There is another way to play with performance by moving work
>>> into and out of the transaction body. A transaction body
>>> that executes quickly will reach commit faster. But it may
>>> be delaying work that moves into another transaction.
>>> Forcing values at the right time can make a big difference.
>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe
>>> <haskell-cafe at haskell.org <mailto:haskell-cafe at haskell.org>>
>>> Thanks Chris, I confess I didn't pay enough attention to
>>> STM specialized container libraries by far, I skimmed
>>> through the description of stm-containers and ttrie, and
>>> feel they would definitely improve my code's performance
>>> in case I limit the server's parallelism within hardware
>>> capabilities. That may because I'm still prototyping the
>>> api and infrastructure for correctness, so even `TVar
>>> (HashMap k v)` performs okay for me at the moment, only
>>> if at low contention (surely there're plenty of CPU
>>> cycles to be optimized out in next steps). I model my
>>> data after graph model, so most data, even most indices
>>> are localized to nodes and edges, those can be
>>> manipulated without conflict, that's why I assumed I
>>> have a low contention use case since the very beginning
>>> - until I found there are still (though minor) needs for
>>> global indices to guarantee global uniqueness, I feel
>>> faithful with stm-containers/ttrie to implement a more
>>> scalable global index data structure, thanks for hinting me.
>>> So an evident solution comes into my mind now, is to run
>>> the server with a pool of tx processing threads,
>>> matching number of CPU cores, client RPC requests then
>>> get queued to be executed in some thread from the pool.
>>> But I'm really fond of the mechanism of M:N scheduler
>>> which solves massive/dynamic concurrency so elegantly. I
>>> had some good result with Go in this regard, and see GHC
>>> at par in doing this, I don't want to give up this
>>> enjoyable machinery.
>>> But looked at the stm implementation in GHC, it seems
>>> written TVars are exclusively locked during commit of a
>>> tx, I suspect this is the culprit when there're large M
>>> lightweight threads scheduled upon a small N hardware
>>> capabilities, that is when a lightweight thread yield
>>> control during an stm transaction commit, the TVars it
>>> locked will stay so until it's scheduled again (and
>>> again) till it can finish the commit. This way,
>>> descheduled threads could hold live threads from
>>> progressing. I haven't gone into more details there, but
>>> wonder if there can be some improvement for GHC RTS to
>>> keep an stm committing thread from descheduled, but
>>> seemingly that may impose more starvation potential; or
>>> stm can be improved to have its TVar locks preemptable
>>> when the owner trec/thread is in descheduled state?
>>> Neither should be easy but I'd really love massive
>>> lightweight threads doing STM practically well.
>>> Best regards,
>>> On 2020/7/24 上午12:57, Christopher Allen wrote:
>>>> It seems like you know how to run practical tests for
>>>> tuning thread count and contention for throughput. Part
>>>> of the reason you haven't gotten a super clear answer
>>>> is "it depends." You give up fairness when you use STM
>>>> instead of MVars or equivalent structures. That means a
>>>> long running transaction might get stampeded by many
>>>> small ones invalidating it over and over. The
>>>> long-running transaction might never clear if the small
>>>> transactions keep moving the cheese. I mention this
>>>> because transaction runtime and size and count all
>>>> affect throughput and latency. What might be ideal for
>>>> one pattern of work might not be ideal for another.
>>>> Optimizing for overall throughput might make the
>>>> contention and fairness problems worse too. I've done
>>>> practical tests to optimize this in the past, both for
>>>> STM in Haskell and for RDBMS workloads.
>>>> The next step is sometimes figuring out whether you
>>>> really need a data structure within a single STM
>>>> container or if perhaps you can break up your STM
>>>> container boundaries into zones or regions that roughly
>>>> map onto update boundaries. That should make the
>>>> transactions churn less. On the outside chance you do
>>>> need to touch more than one container in a transaction,
>>>> well, they compose.
>>>> e.g. https://hackage.haskell.org/package/stm-containers
>>>> It also sounds a bit like your question bumps into
>>>> Amdahl's Law a bit.
>>>> All else fails, stop using STM and find something more
>>>> tuned to your problem space.
>>>> Hope this helps,
>>>> Chris Allen
>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via
>>>> Haskell-Cafe <haskell-cafe at haskell.org
>>>> <mailto:haskell-cafe at haskell.org>> wrote:
>>>> Hello Cafe,
>>>> I'm working on an in-memory database, in
>>>> Client/Server mode I just let each connected client
>>>> submit remote procedure call running in its
>>>> dedicated lightweight thread, modifying TVars in
>>>> RAM per its business needs, then in case many
>>>> clients connected concurrently and trying to insert
>>>> new data, if they are triggering global index (some
>>>> TVar) update, the throughput would drop
>>>> drastically. I reduced the shared state to a simple
>>>> int counter by TVar, got same symptom. While the
>>>> parallelism feels okay when there's no hot TVar
>>>> conflicting, or M is not much greater than N.
>>>> As an empirical test workload, I have a `+RTS -N10`
>>>> server process, it handles 10 concurrent clients
>>>> okay, got ~5x of single thread throughput; but in
>>>> handling 20 concurrent clients, each of the 10 CPUs
>>>> can only be driven to ~10% utilization, the
>>>> throughput seems even worse than single thread.
>>>> More clients can even drive it thrashing without
>>>> much progressing.
>>>> I can understand that pure STM doesn't scale well
>>>> after reading , and I see it suggested 
>>>> attractive and planned future work toward that
>>>> But I can't find certain libraries or frameworks
>>>> addressing large M over small N scenarios, 
>>>> experimented with designated N parallelism, and 
>>>> is rather theoretical to my empirical needs.
>>>> Can you direct me to some available library
>>>> implementing the methodology proposed in  or
>>>> other ways tackling this problem?
>>>> I think the most difficult one is that a
>>>> transaction should commit with global indices (with
>>>> possibly unique constraints) atomically updated,
>>>> and rollback with any violation of constraints,
>>>> i.e. transactions have to cover global states like
>>>> indices. Other problems seem more trivial than this.
>>>> Specifically,  states:
>>>> > It must be emphasized that all of the mechanisms
>>>> we deploy originate, in one form or another, in the
>>>> database literature from the 70s and 80s. Our
>>>> contribution is to adapt these techniques to
>>>> software transactional memory, providing more
>>>> effective solutions to important STM problems than
>>>> prior proposals.
>>>> I wonder any STM based library has simplified those
>>>> techniques to be composed right away? I don't
>>>> really want to implement those mechanisms by
>>>> myself, rebuilding many wheels from scratch.
>>>> Best regards,
>>>>  Comparing the performance of concurrent
>>>> linked-list implementations in Haskell
>>>>  M. Herlihy and E. Koskinen. Transactional
>>>> boosting: a methodology for highly-concurrent
>>>> transactional objects. In Proc. of PPoPP ’08, pages
>>>> 207–216. ACM Press, 2008.
>>>> Haskell-Cafe mailing list
>>>> To (un)subscribe, modify options or view archives
>>>> go to:
>>>> Only members subscribed via the mailman list are
>>>> allowed to post.
>>>> Chris Allen
>>>> Currently working on http://haskellbook.com
>>> Haskell-Cafe mailing list
>>> To (un)subscribe, modify options or view archives go to:
>>> Only members subscribed via the mailman list are allowed
>>> to post.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe