<div dir="ltr">Hi Compl,<div><br></div><div>There is a lot of overhead with TVars.  My thesis work addresses this by incorporating mutable constructor fields with STM.  I would like to get all that into GHC as soon as I can :D.  I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues.  There was some recent work on concurrent B-tree that may be interesting to try.</div><div><br></div><div>Ryan</div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Wed, Jul 29, 2020 at 10:24 AM YueCompl <<a href="mailto:compl.yue@icloud.com">compl.yue@icloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div style="overflow-wrap: break-word;">Hi Cafe and Ryan,<div><br></div><div>I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from <a href="http://hackage.haskell.org/package/tskiplist" target="_blank">http://hackage.haskell.org/package/tskiplist</a> , with them I've got quite improved at scalability on concurrency. </div><div><br></div><div>But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. </div><div><br></div><div>For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.</div><div><br></div><div>If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.</div><div><br></div><div>I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.</div><div><br></div><div>Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.</div><div><br></div><div>I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...</div><div><br></div><div>Btw I read <a href="https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html" target="_blank">https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html</a> in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.</div><div><br></div><div>So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.</div><div><br></div><div>Best regards,</div><div>Compl</div><div><br><div><br><blockquote type="cite"><div>On 2020-07-25, at 22:07, Ryan Yates <<a href="mailto:fryguybob@gmail.com" target="_blank">fryguybob@gmail.com</a>> wrote:</div><br><div><div dir="ltr">Unfortunately my STM benchmarks are rather disorganized.  The most relevant paper using them is:<div><br></div><div>Leveraging hardware TM in Haskell (PPoPP '19)<br></div><div><a href="https://dl.acm.org/doi/10.1145/3293883.3295711" target="_blank">https://dl.acm.org/doi/10.1145/3293883.3295711</a></div><div><br></div><div>Or my thesis:</div><div><a href="https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931" target="_blank">https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931</a> <br></div><div><br></div><div> The PPoPP benchmarks are on a branch (or the releases tab on github):</div><div><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src" target="_blank">https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src</a> </div><div><br></div><div><br></div><div> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.</div><div><br></div><div>Ryan</div><div> <br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div><p>Dear Cafe,</p><p>As Chris Allen has suggested, I learned that  <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a>
      and <a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a>
      can help a lot when used in place of traditional HashMap for stm
      tx processing, under heavy concurrency, yet still with automatic
      parallelism as GHC implemented them. Then I realized that in
      addition to hash map (used to implement dicts and scopes), I also
      need to find a TreeMap replacement data structure to implement the
      db index. I've been focusing on the uniqueness constraint aspect,
      but it's still an index, needs to provide range scan api for db
      clients, so hash map is not sufficient for the index.<br>
    </p><p>I see Ryan shared the code benchmarking RBTree with stm in mind:</p><p> <a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput" target="_blank">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput</a>
      <br>
    </p><p><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree" target="_blank">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree</a></p><p>But can't find conclusion or interpretation of that benchmark
      suite. And here's a followup question:</p><p><br>
    </p><p>Where are some STM contention optimized data structures, that
      having keys ordered, with sub-range traversing api ? <br>
    </p><p>(of course production ready libraries most desirable)<br>
    </p><p><br>
    </p><p>Thanks with regards,</p><p>Compl</p><p><br>
    </p>
    <div>On 2020/7/25 下午2:04, Compl Yue via
      Haskell-Cafe wrote:<br>
    </div>
    <blockquote type="cite"><p>Shame on me for I have neither experienced with `perf`, I'd
        learn these essential tools soon to put them into good use.</p><p>It's great to learn about how `orElse` actually works, I did
        get confused why there are so little retries captured, and now I
        know. So that little trick should definitely be removed before
        going production, as it does no much useful things at excessive
        cost. I put it there to help me understand internal working of
        stm, now I get even better knowledge ;-)<br>
      </p><p>I think a debugger will trap every single abort, isn't it
        annoying when many aborts would occur? If I'd like to count the
        number of aborts, ideally accounted per service endpoints, time
        periods, source modules etc. there some tricks for that?</p><p>Thanks with best regards,</p><p>Compl</p><p><br>
      </p>
      <div>On 2020/7/25 上午2:02, Ryan Yates
        wrote:<br>
      </div>
      <blockquote type="cite">
        
        <div dir="ltr">To be clear, I was trying to refer to Linux
          `perf` [^1].  Sampling based profiling can do a good job with
          concurrent and parallel programs where other methods are
          problematic.  For instance,
          <div> changing the size of heap objects can drastically change
            cache performance and completely different behavior can show
            up.</div>
          <div><br>
          </div>
          <div>[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)" target="_blank">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div>
          <div><br>
          </div>
          <div>The spinning in `readTVar` should always be very short
            and it typically shows up as intensive CPU use, though it
            may not be high energy use with `pause` in the loop on x86
            (looks like we don't have it [^2], I thought we did, but
            maybe that was only in some of my code... )</div>
          <div><br>
          </div>
          <div>[^2]: <a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275" target="_blank">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br>
          </div>
          <div><br>
          </div>
          <div>All that to say, I doubt that you are spending much time
            spinning (but it would certainly be interesting to know if
            you are!  You would see `perf` attribute a large amount of
            time to `read_current_value`).  The amount of code to
            execute for commit (the time when locks are held) is always
            much shorter than it takes to execute the transaction body. 
            As you add more conflicting threads this gets worse of
            course as commits sequence.</div>
          <div><br>
          </div>
          <div>The code you have will count commits of executions of
            `retry`.  Note that `retry` is a user level idea, that is,
            you are counting user level *explicit* retries.  This is
            different from a transaction failing to commit and starting
            again.  These are invisible to the user.  Also using your
            trace will convert `retry` from the efficient wake on write
            implementation, to an active retry that will always attempt
            again.  We don't have cheap logging of transaction aborts in
            GHC, but I have built such logging in my work.  You can
            observe these aborts with a debugger by looking for
            execution of this line:</div>
          <div><br>
          </div>
          <div><a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123" target="_blank">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div>
          <div><br>
          </div>
          <div>Ryan </div>
          <div><br>
          </div>
          <div><br>
          </div>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at
            12:35 PM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank">compl.yue@icloud.com</a>> wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div><p>I'm not familiar with profiling GHC yet, may need more
                time to get myself proficient with it.</p><p>And a bit more details of my test workload for
                diagnostic: the db clients are Python processes from a
                cluster of worker nodes, consulting the db server to
                register some path for data files, under a data dir
                within a shared filesystem, then mmap those data files
                and fill in actual array data. So the db server don't
                have much computation to perform, but puts the data file
                path into a global index, which at the same validates
                its uniqueness. As there are many client processes
                trying to insert one meta data record concurrently, with
                my naive implementation, the global index's TVar will
                almost always in locked state by one client after
                another, from a queue never fall empty.</p><p>So if `readTVar` should spinning waiting, I doubt the
                threads should actually make high CPU utilization,
                because at any instant of time, all threads except the
                committing one will be doing that one thing.<br>
              </p><p>And I have something in my code to track STM retry like
                this:</p><p>```</p>
              <div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap">
<div><span style="color:rgb(212,212,212)">    </span><span style="color:rgb(106,153,85)">-- blocking wait not expected, track stm retries explicitly</span></div><div><span style="color:rgb(212,212,212)">    </span><span style="color:rgb(220,220,170)">trackSTM</span><span style="color:rgb(212,212,212)"> :: </span><span style="color:rgb(86,156,214)">Int</span><span style="color:rgb(212,212,212)"> -> </span><span style="color:rgb(86,156,214)">IO</span><span style="color:rgb(212,212,212)"> (</span><span style="color:rgb(86,156,214)">Either</span><span style="color:rgb(212,212,212)"> () </span><span style="color:rgb(156,220,254)">a</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)">    trackSTM !rtc = </span><span style="color:rgb(197,134,192)">do</span></div>
<div><span style="color:rgb(212,212,212)">      when </span><span style="color:rgb(106,153,85)">-- todo increase the threshold of reporting?</span></div><div><span style="color:rgb(212,212,212)">           (rtc > </span><span style="color:rgb(181,206,168)">0</span><span style="color:rgb(212,212,212)">) $ </span><span style="color:rgb(197,134,192)">do</span></div><div><span style="color:rgb(212,212,212)">        </span><span style="color:rgb(106,153,85)">-- trace out the retries so the end users can be aware of them</span></div><div><span style="color:rgb(212,212,212)">        tid <- myThreadId</span></div><div><span style="color:rgb(212,212,212)">        trace</span></div><div><span style="color:rgb(212,212,212)">            (  </span><span style="color:rgb(206,145,120)">"🔙</span><span style="color:rgb(215,186,125)">\n</span><span style="color:rgb(206,145,120)">"</span></div><div><span style="color:rgb(212,212,212)">            <> show callCtx</span></div><div><span style="color:rgb(212,212,212)">            <> </span><span style="color:rgb(206,145,120)">"🌀 "</span></div><div><span style="color:rgb(212,212,212)">            <> show tid</span></div><div><span style="color:rgb(212,212,212)">            <> </span><span style="color:rgb(206,145,120)">" stm retry #"</span></div><div><span style="color:rgb(212,212,212)">            <> show rtc</span></div><div><span style="color:rgb(212,212,212)">            )</span></div><div><span style="color:rgb(212,212,212)">          $ return </span><span style="color:rgb(86,156,214)">()</span></div>
<div><span style="color:rgb(212,212,212)">      atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)">case</span></div><div><span style="color:rgb(212,212,212)">         Nothing -> </span><span style="color:rgb(106,153,85)">-- stm failed, do a tracked retry</span></div><div><span style="color:rgb(212,212,212)">          trackSTM (rtc + </span><span style="color:rgb(181,206,168)">1</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)">        Just ... -> ...</span></div><div><span style="color:rgb(212,212,212)">
</span></div></div><p>```</p><p>No such trace msg fires during my test, neither in
                single thread run, nor in runs with pressure. I'm sure
                this tracing mechanism works, as I can see such traces
                fire, in case e.g. posting a TMVar to a TQueue for some
                other thread to fill it, then read the result out, if
                these 2 ops are composed into a single tx, then of
                course it's infinite retry loop, and a sequence of such
                msgs are logged with ever increasing rtc #.<br>
              </p><p>So I believe no retry has ever been triggered.</p><p>What can going on there?</p><p><br>
              </p>
              <div>On 2020/7/24 下午11:46, Ryan Yates wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div dir="ltr">
                    <div>> Then to explain the low CPU utilization
                      (~10%), am I right to understand it as that upon
                      reading a TVar locked by another committing tx, a
                      lightweight thread will put itself into `waiting
                      STM` and descheduled state, so the CPUs can only
                      stay idle as not so many threads are willing to
                      proceed?<br>
                    </div>
                    <div><br>
                    </div>
                    <div>Since the commit happens in finite steps, the
                      expectation is that the lock will be released very
                      soon.  Given this when the body of a transaction
                      executes `readTVar` it spins (active CPU!) until
                      the `TVar` is observed unlocked.  If a lock is
                      observed while commiting, it immediately starts
                      the transaction again from the beginning.  To get
                      the behavior of suspending a transaction you have
                      to successfully commit a transaction that executed
                      `retry`.  Then the transaction is put on the
                      wakeup lists of its read set and subsequent
                      commits will wake it up if its write set overlaps.</div>
                    <div><br>
                    </div>
                    <div>I don't think any of these things would explain
                      low CPU utilization.  You could try running with
                      `perf` and see if lots of time is spent in some
                      recognizable part of the RTS.</div>
                    <div><br>
                    </div>
                    <div>Ryan</div>
                  </div>
                  <div><br>
                  </div>
                  <br>
                  <div class="gmail_quote">
                    <div dir="ltr" class="gmail_attr">On Fri, Jul 24,
                      2020 at 11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank">compl.yue@icloud.com</a>>
                      wrote:<br>
                    </div>
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div><p>Thanks very much for the insightful
                          information Ryan! I'm glad my suspect was
                          wrong about the Haskell scheduler:<br>
                        </p><p>> The Haskell capability that is
                          committing a transaction will not yield to
                          another Haskell thread while it is doing the
                          commit.  The OS thread may be preempted, but
                          once commit starts the haskell scheduler is
                          not invoked until after locks are released.</p>
                        <div>So best effort had already been made in GHC
                          and I just need to cooperate better with its
                          design. Then to explain the low CPU
                          utilization (~10%), am I right to understand
                          it as that upon reading a TVar locked by
                          another committing tx, a lightweight thread
                          will put itself into `waiting STM` and
                          descheduled state, so the CPUs can only stay
                          idle as not so many threads are willing to
                          proceed?<br>
                        </div>
                        <div><br>
                        </div>
                        <div>Anyway, I see light with better data
                          structures to improve my situation, let me try
                          them and report back. Actually I later changed
                          `TVar (HaskMap k v)` to be `TVar (HashMap k
                          Int)` where the `Int` being array index into
                          `TVar (Vector (TVar (Maybe v)))`, in pursuing
                          insertion order preservation semantic of dict
                          entries (like that in Python 3.7+), then it's
                          very hopeful to incorporate stm-containers'
                          Map or ttrie to approach free of contention.<br>
                        </div><p>Thanks with regards,</p><p>Compl</p><p><br>
                        </p>
                        <div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>
                        </div>
                        <blockquote type="cite">
                          <div dir="ltr">
                            <div>Hi Compl,</div>
                            <div><br>
                            </div>
                            <div>Having a pool of transaction processing
                              threads can be helpful in a certain way. 
                              If the body of the transaction takes more
                              time to execute then the Haskell thread is
                              allowed and it yields, the
                              suspended thread won't get in the way of
                              other thread, but when it is rescheduled,
                              will have a low probability of success. 
                              Even worse, it will probably not discover
                              that it is doomed to failure until commit
                              time.  If transactions are more likely to
                              reach commit without yielding, they are
                              more likely to succeed.  If the
                              transactions are not conflicting, it
                              doesn't make much difference other than
                              cache churn.</div>
                            <div><br>
                            </div>
                            <div>The Haskell capability that is
                              committing a transaction will not yield to
                              another Haskell thread while it is doing
                              the commit.  The OS thread may be
                              preempted, but once commit starts the
                              haskell scheduler is not invoked until
                              after locks are released.</div>
                            <div><br>
                            </div>
                            <div>To get good performance from STM you
                              must pay attention to what TVars are
                              involved in a commit.  All STM systems are
                              working under the assumption of low
                              contention, so you want to minimize
                              "false" conflicts (conflicts that are not
                              essential to the computation).   
                              Something like `TVar (HashMap k v)` will
                              work pretty well for a low thread count,
                              but every transaction that writes to that
                              structure will be in conflict with every
                              other transaction that accesses it. 
                              Pushing the `TVar` into the nodes of the
                              structure reduces the possibilities for
                              conflict, while increasing the amount of
                              bookkeeping STM has to do.  I would like
                              to reduce the cost of that bookkeeping
                              using better structures, but we need to do
                              so without harming performance in the low
                              TVar count case.  Right now it is
                              optimized for good cache performance with
                              a handful of TVars.</div>
                            <div><br>
                            </div>
                            <div>There is another way to play with
                              performance by moving work into and out of
                              the transaction body.  A transaction body
                              that executes quickly will reach commit
                              faster.  But it may be delaying work that
                              moves into another transaction.  Forcing
                              values at the right time can make a big
                              difference.</div>
                            <div><br>
                            </div>
                            <div>Ryan</div>
                            <div dir="ltr"><br>
                            </div>
                            <div dir="ltr">On Fri, Jul 24, 2020 at 2:14
                              AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
                              wrote:<br>
                            </div>
                            <div class="gmail_quote">
                              <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                <div><p>Thanks Chris, I confess I didn't
                                    pay enough attention to STM
                                    specialized container libraries by
                                    far, I skimmed through the
                                    description of stm-containers and
                                    ttrie, and feel they would
                                    definitely improve my code's
                                    performance in case I limit the
                                    server's parallelism within hardware
                                    capabilities. That may because I'm
                                    still prototyping the api and
                                    infrastructure for correctness, so
                                    even `TVar (HashMap k v)` performs
                                    okay for me at the moment, only if
                                    at low contention (surely there're
                                    plenty of CPU cycles to be optimized
                                    out in next steps). I model my data
                                    after graph model, so most data,
                                    even most indices are localized to
                                    nodes and edges, those can be
                                    manipulated without conflict, that's
                                    why I assumed I have a low
                                    contention use case since the very
                                    beginning - until I found there are
                                    still (though minor) needs for
                                    global indices to guarantee global
                                    uniqueness, I feel faithful with
                                    stm-containers/ttrie to implement a
                                    more scalable global index data
                                    structure, thanks for hinting me.</p><p>So an evident solution comes into
                                    my mind now, is to run the server
                                    with a pool of tx processing
                                    threads, matching number of CPU
                                    cores, client RPC requests then get
                                    queued to be executed in some thread
                                    from the pool. But I'm really fond
                                    of the mechanism of M:N scheduler
                                    which solves massive/dynamic
                                    concurrency so elegantly. I had some
                                    good result with Go in this regard,
                                    and see GHC at par in doing this, I
                                    don't want to give up this enjoyable
                                    machinery.</p><p>But looked at the stm
                                    implementation in GHC, it seems
                                    written TVars are exclusively locked
                                    during commit of a tx, I suspect
                                    this is the culprit when there're
                                    large M lightweight threads
                                    scheduled upon a small N hardware
                                    capabilities, that is when a
                                    lightweight thread yield control
                                    during an stm transaction commit,
                                    the TVars it locked will stay so
                                    until it's scheduled again (and
                                    again) till it can finish the
                                    commit. This way, descheduled
                                    threads could hold live threads from
                                    progressing. I haven't gone into
                                    more details there, but wonder if
                                    there can be some improvement for
                                    GHC RTS to keep an stm committing
                                    thread from descheduled, but
                                    seemingly that may impose more
                                    starvation potential; or stm can be
                                    improved to have its TVar locks
                                    preemptable when the owner
                                    trec/thread is in descheduled state?
                                    Neither should be easy but I'd
                                    really love massive lightweight
                                    threads doing STM practically well.<br>
                                  </p><p>Best regards,</p><p>Compl</p><p><br>
                                  </p>
                                  <div>On 2020/7/24 上午12:57, Christopher
                                    Allen wrote:<br>
                                  </div>
                                  <blockquote type="cite">
                                    <div dir="ltr">It seems like you
                                      know how to run practical tests
                                      for tuning thread count and
                                      contention for throughput. Part of
                                      the reason you haven't gotten a
                                      super clear answer is "it
                                      depends." You give up fairness
                                      when you use STM instead of MVars
                                      or equivalent structures. That
                                      means a long running transaction
                                      might get stampeded by many small
                                      ones invalidating it over and
                                      over. The long-running transaction
                                      might never clear if the small
                                      transactions keep moving the
                                      cheese. I mention this because
                                      transaction runtime and size and
                                      count all affect throughput and
                                      latency. What might be ideal for
                                      one pattern of work might not be
                                      ideal for another. Optimizing for
                                      overall throughput might make the
                                      contention and fairness problems
                                      worse too. I've done practical
                                      tests to optimize this in the
                                      past, both for STM in Haskell and
                                      for RDBMS workloads.
                                      <div><br>
                                      </div>
                                      <div>The next step is sometimes
                                        figuring out whether you really
                                        need a data structure within a
                                        single STM container or if
                                        perhaps you can break up your
                                        STM container boundaries into
                                        zones or regions that roughly
                                        map onto update boundaries. That
                                        should make the transactions
                                        churn less. On the outside
                                        chance you do need to touch more
                                        than one container in a
                                        transaction, well, they compose.
                                        <div><br>
                                        </div>
                                        <div>e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a></div>
                                        <div><a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a><br>
                                        </div>
                                        <div><br>
                                        </div>
                                        <div>It also sounds a bit like
                                          your question bumps into
                                          Amdahl's Law a bit.</div>
                                        <div><br>
                                        </div>
                                        <div>All else fails, stop using
                                          STM and find something more
                                          tuned to your problem space.</div>
                                        <div><br>
                                        </div>
                                        <div>Hope this helps,</div>
                                        <div>Chris Allen</div>
                                        <div><br>
                                        </div>
                                      </div>
                                    </div>
                                    <br>
                                    <div class="gmail_quote">
                                      <div dir="ltr" class="gmail_attr">On
                                        Thu, Jul 23, 2020 at 9:53 AM
                                        YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
                                        wrote:<br>
                                      </div>
                                      <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                                        <div>Hello Cafe,
                                          <div><br>
                                          </div>
                                          <div>I'm working on an
                                            in-memory database, in
                                            Client/Server mode I just
                                            let each connected client
                                            submit remote procedure call
                                            running in its dedicated
                                            lightweight thread,
                                            modifying TVars in RAM per
                                            its business needs, then in
                                            case many clients connected
                                            concurrently and trying to
                                            insert new data, if they are
                                            triggering global index
                                            (some TVar) update, the
                                            throughput would drop
                                            drastically. I reduced the
                                            shared state to a simple int
                                            counter by TVar, got same
                                            symptom. While the
                                            parallelism feels okay when
                                            there's no hot TVar
                                            conflicting, or M is not
                                            much greater than N.</div>
                                          <div><br>
                                          </div>
                                          <div>As an empirical test
                                            workload, I have a `+RTS
                                            -N10` server process, it
                                            handles 10 concurrent
                                            clients okay, got ~5x of
                                            single thread throughput;
                                            but in handling 20
                                            concurrent clients, each of
                                            the 10 CPUs can only be
                                            driven to ~10% utilization,
                                            the throughput seems even
                                            worse than single thread.
                                            More clients can even drive
                                            it thrashing without much
                                             progressing.</div>
                                          <div><br>
                                          </div>
                                          <div> I can understand that
                                            pure STM doesn't scale well
                                            after reading [1], and I see
                                            it suggested [7] attractive
                                            and planned future work
                                            toward that direction.</div>
                                          <div><br>
                                          </div>
                                          <div>But I can't find certain
                                            libraries or frameworks
                                            addressing large M over
                                            small N scenarios, [1]
                                            experimented with designated
                                            N parallelism, and [7] is
                                            rather theoretical to my
                                            empirical needs.</div>
                                          <div><br>
                                          </div>
                                          <div>Can you direct me to some
                                            available library
                                            implementing the methodology
                                            proposed in [7] or other
                                            ways tackling this problem?</div>
                                          <div><br>
                                          </div>
                                          <div>I think the most
                                            difficult one is that a
                                            transaction should commit
                                            with global indices (with
                                            possibly unique constraints)
                                            atomically updated, and
                                            rollback with any violation
                                            of constraints, i.e.
                                            transactions have to cover
                                            global states like indices.
                                            Other problems seem more
                                            trivial than this.</div>
                                          <div><br>
                                          </div>
                                          <div>Specifically, [7] states:</div>
                                          <div><br>
                                          </div>
                                          <div>> It must be
                                            emphasized that all of the
                                            mechanisms we deploy
                                            originate, in one form or
                                            another, in the database
                                            literature from the 70s and
                                            80s. Our contribution is to
                                            adapt these techniques to
                                            software transactional
                                            memory, providing more
                                            effective solutions to
                                            important STM problems than
                                            prior proposals.</div>
                                          <div><br>
                                          </div>
                                          <div>I wonder any STM based
                                            library has simplified those
                                            techniques to be composed
                                            right away? I don't really
                                            want to implement those
                                            mechanisms by myself,
                                            rebuilding many wheels from
                                            scratch.</div>
                                          <div><br>
                                          </div>
                                          <div>Best regards,</div>
                                          <div>Compl</div>
                                          <div><br>
                                          </div>
                                          <div><br>
                                          </div>
                                          <div>[1] Comparing the
                                            performance of concurrent
                                            linked-list implementations
                                            in Haskell </div>
                                          <div><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
                                          <div><br>
                                          </div>
                                          <div>[7] M. Herlihy and E.
                                            Koskinen. Transactional
                                            boosting: a methodology for
                                            highly-concurrent
                                            transactional objects. In
                                            Proc. of PPoPP ’08, pages
                                            207–216. ACM Press, 2008.</div>
                                          <div><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
                                          <div><br>
                                          </div>
                                        </div>
_______________________________________________<br>
                                        Haskell-Cafe mailing list<br>
                                        To (un)subscribe, modify options
                                        or view archives go to:<br>
                                        <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
                                        Only members subscribed via the
                                        mailman list are allowed to
                                        post.</blockquote>
                                    </div>
                                    <br clear="all">
                                    <div><br>
                                    </div>
                                    -- <br>
                                    <div dir="ltr">
                                      <div dir="ltr">
                                        <div>
                                          <div dir="ltr">
                                            <div dir="ltr">Chris Allen<br>
                                              <div><span style="font-size:12.8px">Currently
                                                  working on </span><a href="http://haskellbook.com/" target="_blank">http://haskellbook.com</a></div>
                                            </div>
                                          </div>
                                        </div>
                                      </div>
                                    </div>
                                  </blockquote>
                                </div>
_______________________________________________<br>
                                Haskell-Cafe mailing list<br>
                                To (un)subscribe, modify options or view
                                archives go to:<br>
                                <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
                                Only members subscribed via the mailman
                                list are allowed to post.</blockquote>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                    </blockquote>
                  </div>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
      </blockquote>
      <br>
      <fieldset></fieldset>
      <pre>_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>
Only members subscribed via the mailman list are allowed to post.</pre>
    </blockquote>
  </div>

_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view archives go to:<br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman list are allowed to post.</blockquote></div>
_______________________________________________<br>Haskell-Cafe mailing list<br>To (un)subscribe, modify options or view archives go to:<br><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br></div></div></blockquote></div>