<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi Cafe and Ryan,<div class=""><br class=""></div><div class="">I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from <a href="http://hackage.haskell.org/package/tskiplist" class="">http://hackage.haskell.org/package/tskiplist</a> , with them I've got quite improved at scalability on concurrency. </div><div class=""><br class=""></div><div class="">But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. </div><div class=""><br class=""></div><div class="">For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.</div><div class=""><br class=""></div><div class="">If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.</div><div class=""><br class=""></div><div class="">I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.</div><div class=""><br class=""></div><div class="">Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.</div><div class=""><br class=""></div><div class="">I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...</div><div class=""><br class=""></div><div class="">Btw I read <a href="https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html" class="">https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html</a> in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.</div><div class=""><br class=""></div><div class="">So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.</div><div class=""><br class=""></div><div class="">Best regards,</div><div class="">Compl</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 2020-07-25, at 22:07, Ryan Yates <<a href="mailto:fryguybob@gmail.com" class="">fryguybob@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Unfortunately my STM benchmarks are rather disorganized.  The most relevant paper using them is:<div class=""><br class=""></div><div class="">Leveraging hardware TM in Haskell (PPoPP '19)<br class=""></div><div class=""><a href="https://dl.acm.org/doi/10.1145/3293883.3295711" target="_blank" class="">https://dl.acm.org/doi/10.1145/3293883.3295711</a></div><div class=""><br class=""></div><div class="">Or my thesis:</div><div class=""><a href="https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931" target="_blank" class="">https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931</a> <br class=""></div><div class=""><br class=""></div><div class=""> The PPoPP benchmarks are on a branch (or the releases tab on github):</div><div class=""><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src</a> </div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.</div><div class=""><br class=""></div><div class="">Ryan</div><div class=""> <br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div class=""><p class="">Dear Cafe,</p><p class="">As Chris Allen has suggested, I learned that  <a href="https://hackage.haskell.org/package/stm-containers" target="_blank" class="">https://hackage.haskell.org/package/stm-containers</a>

      and <a href="https://hackage.haskell.org/package/ttrie" target="_blank" class="">https://hackage.haskell.org/package/ttrie</a>

      can help a lot when used in place of traditional HashMap for stm

      tx processing, under heavy concurrency, yet still with automatic

      parallelism as GHC implemented them. Then I realized that in

      addition to hash map (used to implement dicts and scopes), I also

      need to find a TreeMap replacement data structure to implement the

      db index. I've been focusing on the uniqueness constraint aspect,

      but it's still an index, needs to provide range scan api for db

      clients, so hash map is not sufficient for the index.<br class="">

    </p><p class="">I see Ryan shared the code benchmarking RBTree with stm in mind:</p><p class=""> <a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput</a>

      <br class="">

    </p><p class=""><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree</a></p><p class="">But can't find conclusion or interpretation of that benchmark

      suite. And here's a followup question:</p><p class=""><br class="">

    </p><p class="">Where are some STM contention optimized data structures, that

      having keys ordered, with sub-range traversing api ? <br class="">

    </p><p class="">(of course production ready libraries most desirable)<br class="">

    </p><p class=""><br class="">

    </p><p class="">Thanks with regards,</p><p class="">Compl</p><p class=""><br class="">

    </p>

    <div class="">On 2020/7/25 下午2:04, Compl Yue via

      Haskell-Cafe wrote:<br class="">

    </div>

    <blockquote type="cite" class=""><p class="">Shame on me for I have neither experienced with `perf`, I'd

        learn these essential tools soon to put them into good use.</p><p class="">It's great to learn about how `orElse` actually works, I did

        get confused why there are so little retries captured, and now I

        know. So that little trick should definitely be removed before

        going production, as it does no much useful things at excessive

        cost. I put it there to help me understand internal working of

        stm, now I get even better knowledge ;-)<br class="">

      </p><p class="">I think a debugger will trap every single abort, isn't it

        annoying when many aborts would occur? If I'd like to count the

        number of aborts, ideally accounted per service endpoints, time

        periods, source modules etc. there some tricks for that?</p><p class="">Thanks with best regards,</p><p class="">Compl</p><p class=""><br class="">

      </p>

      <div class="">On 2020/7/25 上午2:02, Ryan Yates

        wrote:<br class="">

      </div>

      <blockquote type="cite" class="">

        <div dir="ltr" class="">To be clear, I was trying to refer to Linux

          `perf` [^1].  Sampling based profiling can do a good job with

          concurrent and parallel programs where other methods are

          problematic.  For instance,

          <div class=""> changing the size of heap objects can drastically change

            cache performance and completely different behavior can show

            up.</div>

          <div class=""><br class="">

          </div>

          <div class="">[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)" target="_blank" class="">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div>

          <div class=""><br class="">

          </div>

          <div class="">The spinning in `readTVar` should always be very short

            and it typically shows up as intensive CPU use, though it

            may not be high energy use with `pause` in the loop on x86

            (looks like we don't have it [^2], I thought we did, but

            maybe that was only in some of my code... )</div>

          <div class=""><br class="">

          </div>

          <div class="">[^2]: <a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275" target="_blank" class="">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br class="">

          </div>

          <div class=""><br class="">

          </div>

          <div class="">All that to say, I doubt that you are spending much time

            spinning (but it would certainly be interesting to know if

            you are!  You would see `perf` attribute a large amount of

            time to `read_current_value`).  The amount of code to

            execute for commit (the time when locks are held) is always

            much shorter than it takes to execute the transaction body. 

            As you add more conflicting threads this gets worse of

            course as commits sequence.</div>

          <div class=""><br class="">

          </div>

          <div class="">The code you have will count commits of executions of

            `retry`.  Note that `retry` is a user level idea, that is,

            you are counting user level *explicit* retries.  This is

            different from a transaction failing to commit and starting

            again.  These are invisible to the user.  Also using your

            trace will convert `retry` from the efficient wake on write

            implementation, to an active retry that will always attempt

            again.  We don't have cheap logging of transaction aborts in

            GHC, but I have built such logging in my work.  You can

            observe these aborts with a debugger by looking for

            execution of this line:</div>

          <div class=""><br class="">

          </div>

          <div class=""><a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123" target="_blank" class="">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div>

          <div class=""><br class="">

          </div>

          <div class="">Ryan </div>

          <div class=""><br class="">

          </div>

          <div class=""><br class="">

          </div>

        </div>

        <br class="">

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at

            12:35 PM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank" class="">compl.yue@icloud.com</a>> wrote:<br class="">

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div class=""><p class="">I'm not familiar with profiling GHC yet, may need more

                time to get myself proficient with it.</p><p class="">And a bit more details of my test workload for

                diagnostic: the db clients are Python processes from a

                cluster of worker nodes, consulting the db server to

                register some path for data files, under a data dir

                within a shared filesystem, then mmap those data files

                and fill in actual array data. So the db server don't

                have much computation to perform, but puts the data file

                path into a global index, which at the same validates

                its uniqueness. As there are many client processes

                trying to insert one meta data record concurrently, with

                my naive implementation, the global index's TVar will

                almost always in locked state by one client after

                another, from a queue never fall empty.</p><p class="">So if `readTVar` should spinning waiting, I doubt the

                threads should actually make high CPU utilization,

                because at any instant of time, all threads except the

                committing one will be doing that one thing.<br class="">

              </p><p class="">And I have something in my code to track STM retry like

                this:</p><p class="">```</p>

              <div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap" class="">

<div class=""><span style="color:rgb(212,212,212)" class="">    </span><span style="color:rgb(106,153,85)" class="">-- blocking wait not expected, track stm retries explicitly</span></div><div class=""><span style="color:rgb(212,212,212)" class="">    </span><span style="color:rgb(220,220,170)" class="">trackSTM</span><span style="color:rgb(212,212,212)" class=""> :: </span><span style="color:rgb(86,156,214)" class="">Int</span><span style="color:rgb(212,212,212)" class=""> -> </span><span style="color:rgb(86,156,214)" class="">IO</span><span style="color:rgb(212,212,212)" class=""> (</span><span style="color:rgb(86,156,214)" class="">Either</span><span style="color:rgb(212,212,212)" class=""> () </span><span style="color:rgb(156,220,254)" class="">a</span><span style="color:rgb(212,212,212)" class="">)</span></div><div class=""><span style="color:rgb(212,212,212)" class="">    trackSTM !rtc = </span><span style="color:rgb(197,134,192)" class="">do</span></div>

<div class=""><span style="color:rgb(212,212,212)" class="">      when </span><span style="color:rgb(106,153,85)" class="">-- todo increase the threshold of reporting?</span></div><div class=""><span style="color:rgb(212,212,212)" class="">           (rtc > </span><span style="color:rgb(181,206,168)" class="">0</span><span style="color:rgb(212,212,212)" class="">) $ </span><span style="color:rgb(197,134,192)" class="">do</span></div><div class=""><span style="color:rgb(212,212,212)" class="">        </span><span style="color:rgb(106,153,85)" class="">-- trace out the retries so the end users can be aware of them</span></div><div class=""><span style="color:rgb(212,212,212)" class="">        tid <- myThreadId</span></div><div class=""><span style="color:rgb(212,212,212)" class="">        trace</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            (  </span><span style="color:rgb(206,145,120)" class="">"🔙</span><span style="color:rgb(215,186,125)" class="">\n</span><span style="color:rgb(206,145,120)" class="">"</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            <> show callCtx</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            <> </span><span style="color:rgb(206,145,120)" class="">"🌀 "</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            <> show tid</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            <> </span><span style="color:rgb(206,145,120)" class="">" stm retry #"</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            <> show rtc</span></div><div class=""><span style="color:rgb(212,212,212)" class="">            )</span></div><div class=""><span style="color:rgb(212,212,212)" class="">          $ return </span><span style="color:rgb(86,156,214)" class="">()</span></div>

<div class=""><span style="color:rgb(212,212,212)" class="">      atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)" class="">case</span></div><div class=""><span style="color:rgb(212,212,212)" class="">         Nothing -> </span><span style="color:rgb(106,153,85)" class="">-- stm failed, do a tracked retry</span></div><div class=""><span style="color:rgb(212,212,212)" class="">          trackSTM (rtc + </span><span style="color:rgb(181,206,168)" class="">1</span><span style="color:rgb(212,212,212)" class="">)</span></div><div class=""><span style="color:rgb(212,212,212)" class="">        Just ... -> ...</span></div><div class=""><span style="color:rgb(212,212,212)" class="">

</span></div></div><p class="">```</p><p class="">No such trace msg fires during my test, neither in

                single thread run, nor in runs with pressure. I'm sure

                this tracing mechanism works, as I can see such traces

                fire, in case e.g. posting a TMVar to a TQueue for some

                other thread to fill it, then read the result out, if

                these 2 ops are composed into a single tx, then of

                course it's infinite retry loop, and a sequence of such

                msgs are logged with ever increasing rtc #.<br class="">

              </p><p class="">So I believe no retry has ever been triggered.</p><p class="">What can going on there?</p><p class=""><br class="">

              </p>

              <div class="">On 2020/7/24 下午11:46, Ryan Yates wrote:<br class="">

              </div>

              <blockquote type="cite" class="">

                <div dir="ltr" class="">

                  <div dir="ltr" class="">

                    <div class="">> Then to explain the low CPU utilization

                      (~10%), am I right to understand it as that upon

                      reading a TVar locked by another committing tx, a

                      lightweight thread will put itself into `waiting

                      STM` and descheduled state, so the CPUs can only

                      stay idle as not so many threads are willing to

                      proceed?<br class="">

                    </div>

                    <div class=""><br class="">

                    </div>

                    <div class="">Since the commit happens in finite steps, the

                      expectation is that the lock will be released very

                      soon.  Given this when the body of a transaction

                      executes `readTVar` it spins (active CPU!) until

                      the `TVar` is observed unlocked.  If a lock is

                      observed while commiting, it immediately starts

                      the transaction again from the beginning.  To get

                      the behavior of suspending a transaction you have

                      to successfully commit a transaction that executed

                      `retry`.  Then the transaction is put on the

                      wakeup lists of its read set and subsequent

                      commits will wake it up if its write set overlaps.</div>

                    <div class=""><br class="">

                    </div>

                    <div class="">I don't think any of these things would explain

                      low CPU utilization.  You could try running with

                      `perf` and see if lots of time is spent in some

                      recognizable part of the RTS.</div>

                    <div class=""><br class="">

                    </div>

                    <div class="">Ryan</div>

                  </div>

                  <div class=""><br class="">

                  </div>

                  <br class="">

                  <div class="gmail_quote">

                    <div dir="ltr" class="gmail_attr">On Fri, Jul 24,

                      2020 at 11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank" class="">compl.yue@icloud.com</a>>

                      wrote:<br class="">

                    </div>

                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                      <div class=""><p class="">Thanks very much for the insightful

                          information Ryan! I'm glad my suspect was

                          wrong about the Haskell scheduler:<br class="">

                        </p><p class="">> The Haskell capability that is

                          committing a transaction will not yield to

                          another Haskell thread while it is doing the

                          commit.  The OS thread may be preempted, but

                          once commit starts the haskell scheduler is

                          not invoked until after locks are released.</p>

                        <div class="">So best effort had already been made in GHC

                          and I just need to cooperate better with its

                          design. Then to explain the low CPU

                          utilization (~10%), am I right to understand

                          it as that upon reading a TVar locked by

                          another committing tx, a lightweight thread

                          will put itself into `waiting STM` and

                          descheduled state, so the CPUs can only stay

                          idle as not so many threads are willing to

                          proceed?<br class="">

                        </div>

                        <div class=""><br class="">

                        </div>

                        <div class="">Anyway, I see light with better data

                          structures to improve my situation, let me try

                          them and report back. Actually I later changed

                          `TVar (HaskMap k v)` to be `TVar (HashMap k

                          Int)` where the `Int` being array index into

                          `TVar (Vector (TVar (Maybe v)))`, in pursuing

                          insertion order preservation semantic of dict

                          entries (like that in Python 3.7+), then it's

                          very hopeful to incorporate stm-containers'

                          Map or ttrie to approach free of contention.<br class="">

                        </div><p class="">Thanks with regards,</p><p class="">Compl</p><p class=""><br class="">

                        </p>

                        <div class="">On 2020/7/24 下午10:03, Ryan Yates wrote:<br class="">

                        </div>

                        <blockquote type="cite" class="">

                          <div dir="ltr" class="">

                            <div class="">Hi Compl,</div>

                            <div class=""><br class="">

                            </div>

                            <div class="">Having a pool of transaction processing

                              threads can be helpful in a certain way. 

                              If the body of the transaction takes more

                              time to execute then the Haskell thread is

                              allowed and it yields, the

                              suspended thread won't get in the way of

                              other thread, but when it is rescheduled,

                              will have a low probability of success. 

                              Even worse, it will probably not discover

                              that it is doomed to failure until commit

                              time.  If transactions are more likely to

                              reach commit without yielding, they are

                              more likely to succeed.  If the

                              transactions are not conflicting, it

                              doesn't make much difference other than

                              cache churn.</div>

                            <div class=""><br class="">

                            </div>

                            <div class="">The Haskell capability that is

                              committing a transaction will not yield to

                              another Haskell thread while it is doing

                              the commit.  The OS thread may be

                              preempted, but once commit starts the

                              haskell scheduler is not invoked until

                              after locks are released.</div>

                            <div class=""><br class="">

                            </div>

                            <div class="">To get good performance from STM you

                              must pay attention to what TVars are

                              involved in a commit.  All STM systems are

                              working under the assumption of low

                              contention, so you want to minimize

                              "false" conflicts (conflicts that are not

                              essential to the computation).   

                              Something like `TVar (HashMap k v)` will

                              work pretty well for a low thread count,

                              but every transaction that writes to that

                              structure will be in conflict with every

                              other transaction that accesses it. 

                              Pushing the `TVar` into the nodes of the

                              structure reduces the possibilities for

                              conflict, while increasing the amount of

                              bookkeeping STM has to do.  I would like

                              to reduce the cost of that bookkeeping

                              using better structures, but we need to do

                              so without harming performance in the low

                              TVar count case.  Right now it is

                              optimized for good cache performance with

                              a handful of TVars.</div>

                            <div class=""><br class="">

                            </div>

                            <div class="">There is another way to play with

                              performance by moving work into and out of

                              the transaction body.  A transaction body

                              that executes quickly will reach commit

                              faster.  But it may be delaying work that

                              moves into another transaction.  Forcing

                              values at the right time can make a big

                              difference.</div>

                            <div class=""><br class="">

                            </div>

                            <div class="">Ryan</div>

                            <div dir="ltr" class=""><br class="">

                            </div>

                            <div dir="ltr" class="">On Fri, Jul 24, 2020 at 2:14

                              AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>>

                              wrote:<br class="">

                            </div>

                            <div class="gmail_quote">

                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                <div class=""><p class="">Thanks Chris, I confess I didn't

                                    pay enough attention to STM

                                    specialized container libraries by

                                    far, I skimmed through the

                                    description of stm-containers and

                                    ttrie, and feel they would

                                    definitely improve my code's

                                    performance in case I limit the

                                    server's parallelism within hardware

                                    capabilities. That may because I'm

                                    still prototyping the api and

                                    infrastructure for correctness, so

                                    even `TVar (HashMap k v)` performs

                                    okay for me at the moment, only if

                                    at low contention (surely there're

                                    plenty of CPU cycles to be optimized

                                    out in next steps). I model my data

                                    after graph model, so most data,

                                    even most indices are localized to

                                    nodes and edges, those can be

                                    manipulated without conflict, that's

                                    why I assumed I have a low

                                    contention use case since the very

                                    beginning - until I found there are

                                    still (though minor) needs for

                                    global indices to guarantee global

                                    uniqueness, I feel faithful with

                                    stm-containers/ttrie to implement a

                                    more scalable global index data

                                    structure, thanks for hinting me.</p><p class="">So an evident solution comes into

                                    my mind now, is to run the server

                                    with a pool of tx processing

                                    threads, matching number of CPU

                                    cores, client RPC requests then get

                                    queued to be executed in some thread

                                    from the pool. But I'm really fond

                                    of the mechanism of M:N scheduler

                                    which solves massive/dynamic

                                    concurrency so elegantly. I had some

                                    good result with Go in this regard,

                                    and see GHC at par in doing this, I

                                    don't want to give up this enjoyable

                                    machinery.</p><p class="">But looked at the stm

                                    implementation in GHC, it seems

                                    written TVars are exclusively locked

                                    during commit of a tx, I suspect

                                    this is the culprit when there're

                                    large M lightweight threads

                                    scheduled upon a small N hardware

                                    capabilities, that is when a

                                    lightweight thread yield control

                                    during an stm transaction commit,

                                    the TVars it locked will stay so

                                    until it's scheduled again (and

                                    again) till it can finish the

                                    commit. This way, descheduled

                                    threads could hold live threads from

                                    progressing. I haven't gone into

                                    more details there, but wonder if

                                    there can be some improvement for

                                    GHC RTS to keep an stm committing

                                    thread from descheduled, but

                                    seemingly that may impose more

                                    starvation potential; or stm can be

                                    improved to have its TVar locks

                                    preemptable when the owner

                                    trec/thread is in descheduled state?

                                    Neither should be easy but I'd

                                    really love massive lightweight

                                    threads doing STM practically well.<br class="">

                                  </p><p class="">Best regards,</p><p class="">Compl</p><p class=""><br class="">

                                  </p>

                                  <div class="">On 2020/7/24 上午12:57, Christopher

                                    Allen wrote:<br class="">

                                  </div>

                                  <blockquote type="cite" class="">

                                    <div dir="ltr" class="">It seems like you

                                      know how to run practical tests

                                      for tuning thread count and

                                      contention for throughput. Part of

                                      the reason you haven't gotten a

                                      super clear answer is "it

                                      depends." You give up fairness

                                      when you use STM instead of MVars

                                      or equivalent structures. That

                                      means a long running transaction

                                      might get stampeded by many small

                                      ones invalidating it over and

                                      over. The long-running transaction

                                      might never clear if the small

                                      transactions keep moving the

                                      cheese. I mention this because

                                      transaction runtime and size and

                                      count all affect throughput and

                                      latency. What might be ideal for

                                      one pattern of work might not be

                                      ideal for another. Optimizing for

                                      overall throughput might make the

                                      contention and fairness problems

                                      worse too. I've done practical

                                      tests to optimize this in the

                                      past, both for STM in Haskell and

                                      for RDBMS workloads.

                                      <div class=""><br class="">

                                      </div>

                                      <div class="">The next step is sometimes

                                        figuring out whether you really

                                        need a data structure within a

                                        single STM container or if

                                        perhaps you can break up your

                                        STM container boundaries into

                                        zones or regions that roughly

                                        map onto update boundaries. That

                                        should make the transactions

                                        churn less. On the outside

                                        chance you do need to touch more

                                        than one container in a

                                        transaction, well, they compose.

                                        <div class=""><br class="">

                                        </div>

                                        <div class="">e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank" class="">https://hackage.haskell.org/package/stm-containers</a></div>

                                        <div class=""><a href="https://hackage.haskell.org/package/ttrie" target="_blank" class="">https://hackage.haskell.org/package/ttrie</a><br class="">

                                        </div>

                                        <div class=""><br class="">

                                        </div>

                                        <div class="">It also sounds a bit like

                                          your question bumps into

                                          Amdahl's Law a bit.</div>

                                        <div class=""><br class="">

                                        </div>

                                        <div class="">All else fails, stop using

                                          STM and find something more

                                          tuned to your problem space.</div>

                                        <div class=""><br class="">

                                        </div>

                                        <div class="">Hope this helps,</div>

                                        <div class="">Chris Allen</div>

                                        <div class=""><br class="">

                                        </div>

                                      </div>

                                    </div>

                                    <br class="">

                                    <div class="gmail_quote">

                                      <div dir="ltr" class="gmail_attr">On

                                        Thu, Jul 23, 2020 at 9:53 AM

                                        YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>>

                                        wrote:<br class="">

                                      </div>

                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                                        <div class="">Hello Cafe,

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">I'm working on an

                                            in-memory database, in

                                            Client/Server mode I just

                                            let each connected client

                                            submit remote procedure call

                                            running in its dedicated

                                            lightweight thread,

                                            modifying TVars in RAM per

                                            its business needs, then in

                                            case many clients connected

                                            concurrently and trying to

                                            insert new data, if they are

                                            triggering global index

                                            (some TVar) update, the

                                            throughput would drop

                                            drastically. I reduced the

                                            shared state to a simple int

                                            counter by TVar, got same

                                            symptom. While the

                                            parallelism feels okay when

                                            there's no hot TVar

                                            conflicting, or M is not

                                            much greater than N.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">As an empirical test

                                            workload, I have a `+RTS

                                            -N10` server process, it

                                            handles 10 concurrent

                                            clients okay, got ~5x of

                                            single thread throughput;

                                            but in handling 20

                                            concurrent clients, each of

                                            the 10 CPUs can only be

                                            driven to ~10% utilization,

                                            the throughput seems even

                                            worse than single thread.

                                            More clients can even drive

                                            it thrashing without much

                                             progressing.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class=""> I can understand that

                                            pure STM doesn't scale well

                                            after reading [1], and I see

                                            it suggested [7] attractive

                                            and planned future work

                                            toward that direction.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">But I can't find certain

                                            libraries or frameworks

                                            addressing large M over

                                            small N scenarios, [1]

                                            experimented with designated

                                            N parallelism, and [7] is

                                            rather theoretical to my

                                            empirical needs.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">Can you direct me to some

                                            available library

                                            implementing the methodology

                                            proposed in [7] or other

                                            ways tackling this problem?</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">I think the most

                                            difficult one is that a

                                            transaction should commit

                                            with global indices (with

                                            possibly unique constraints)

                                            atomically updated, and

                                            rollback with any violation

                                            of constraints, i.e.

                                            transactions have to cover

                                            global states like indices.

                                            Other problems seem more

                                            trivial than this.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">Specifically, [7] states:</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">> It must be

                                            emphasized that all of the

                                            mechanisms we deploy

                                            originate, in one form or

                                            another, in the database

                                            literature from the 70s and

                                            80s. Our contribution is to

                                            adapt these techniques to

                                            software transactional

                                            memory, providing more

                                            effective solutions to

                                            important STM problems than

                                            prior proposals.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">I wonder any STM based

                                            library has simplified those

                                            techniques to be composed

                                            right away? I don't really

                                            want to implement those

                                            mechanisms by myself,

                                            rebuilding many wheels from

                                            scratch.</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">Best regards,</div>

                                          <div class="">Compl</div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">[1] Comparing the

                                            performance of concurrent

                                            linked-list implementations

                                            in Haskell </div>

                                          <div class=""><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank" class="">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>

                                          <div class=""><br class="">

                                          </div>

                                          <div class="">[7] M. Herlihy and E.

                                            Koskinen. Transactional

                                            boosting: a methodology for

                                            highly-concurrent

                                            transactional objects. In

                                            Proc. of PPoPP ’08, pages

                                            207–216. ACM Press, 2008.</div>

                                          <div class=""><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank" class="">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>

                                          <div class=""><br class="">

                                          </div>

                                        </div>

_______________________________________________<br class="">

                                        Haskell-Cafe mailing list<br class="">

                                        To (un)subscribe, modify options

                                        or view archives go to:<br class="">

                                        <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">

                                        Only members subscribed via the

                                        mailman list are allowed to

                                        post.</blockquote>

                                    </div>

                                    <br clear="all" class="">

                                    <div class=""><br class="">

                                    </div>

                                    -- <br class="">

                                    <div dir="ltr" class="">

                                      <div dir="ltr" class="">

                                        <div class="">

                                          <div dir="ltr" class="">

                                            <div dir="ltr" class="">Chris Allen<br class="">

                                              <div class=""><span style="font-size:12.8px" class="">Currently

                                                  working on </span><a href="http://haskellbook.com/" target="_blank" class="">http://haskellbook.com</a></div>

                                            </div>

                                          </div>

                                        </div>

                                      </div>

                                    </div>

                                  </blockquote>

                                </div>

_______________________________________________<br class="">

                                Haskell-Cafe mailing list<br class="">

                                To (un)subscribe, modify options or view

                                archives go to:<br class="">

                                <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">

                                Only members subscribed via the mailman

                                list are allowed to post.</blockquote>

                            </div>

                          </div>

                        </blockquote>

                      </div>

                    </blockquote>

                  </div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

      </blockquote>

      <br class="">

      <fieldset class=""></fieldset>

      <pre class="">_______________________________________________

Haskell-Cafe mailing list

To (un)subscribe, modify options or view archives go to:

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>

Only members subscribed via the mailman list are allowed to post.</pre>

    </blockquote>

  </div>

_______________________________________________<br class="">

Haskell-Cafe mailing list<br class="">

To (un)subscribe, modify options or view archives go to:<br class="">

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">

Only members subscribed via the mailman list are allowed to post.</blockquote></div>

_______________________________________________<br class="">Haskell-Cafe mailing list<br class="">To (un)subscribe, modify options or view archives go to:<br class=""><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br class=""></div></body></html>