<div dir="ltr">To be clear, I was trying to refer to Linux `perf` [^1].  Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic.  For instance,<div> changing the size of heap objects can drastically change cache performance and completely different behavior can show up.</div><div><br></div><div>[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div><div><br></div><div>The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )</div><div><br></div><div>[^2]: <a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br></div><div><br></div><div>All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are!  You would see `perf` attribute a large amount of time to `read_current_value`).  The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body.  As you add more conflicting threads this gets worse of course as commits sequence.</div><div><br></div><div>The code you have will count commits of executions of `retry`.  Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries.  This is different from a transaction failing to commit and starting again.  These are invisible to the user.  Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again.  We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work.  You can observe these aborts with a debugger by looking for execution of this line:</div><div><br></div><div><a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div><div><br></div><div>Ryan </div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at 12:35 PM Compl Yue <<a href="mailto:compl.yue@icloud.com">compl.yue@icloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
  
    
  
  <div>
    <p>I'm not familiar with profiling GHC yet, may need more time to
      get myself proficient with it.</p>
    <p>And a bit more details of my test workload for diagnostic: the db
      clients are Python processes from a cluster of worker nodes,
      consulting the db server to register some path for data files,
      under a data dir within a shared filesystem, then mmap those data
      files and fill in actual array data. So the db server don't have
      much computation to perform, but puts the data file path into a
      global index, which at the same validates its uniqueness. As there
      are many client processes trying to insert one meta data record
      concurrently, with my naive implementation, the global index's
      TVar will almost always in locked state by one client after
      another, from a queue never fall empty.</p>
    <p>So if `readTVar` should spinning waiting, I doubt the threads
      should actually make high CPU utilization, because at any instant
      of time, all threads except the committing one will be doing that
      one thing.<br>
    </p>
    <p>And I have something in my code to track STM retry like this:</p>
    <p>```</p>
    <div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap">
<div><span style="color:rgb(212,212,212)">    </span><span style="color:rgb(106,153,85)">-- blocking wait not expected, track stm retries explicitly</span></div><div><span style="color:rgb(212,212,212)">    </span><span style="color:rgb(220,220,170)">trackSTM</span><span style="color:rgb(212,212,212)"> :: </span><span style="color:rgb(86,156,214)">Int</span><span style="color:rgb(212,212,212)"> -> </span><span style="color:rgb(86,156,214)">IO</span><span style="color:rgb(212,212,212)"> (</span><span style="color:rgb(86,156,214)">Either</span><span style="color:rgb(212,212,212)"> () </span><span style="color:rgb(156,220,254)">a</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)">    trackSTM !rtc = </span><span style="color:rgb(197,134,192)">do</span></div>
<div><span style="color:rgb(212,212,212)">      when </span><span style="color:rgb(106,153,85)">-- todo increase the threshold of reporting?</span></div><div><span style="color:rgb(212,212,212)">           (rtc > </span><span style="color:rgb(181,206,168)">0</span><span style="color:rgb(212,212,212)">) $ </span><span style="color:rgb(197,134,192)">do</span></div><div><span style="color:rgb(212,212,212)">        </span><span style="color:rgb(106,153,85)">-- trace out the retries so the end users can be aware of them</span></div><div><span style="color:rgb(212,212,212)">        tid <- myThreadId</span></div><div><span style="color:rgb(212,212,212)">        trace</span></div><div><span style="color:rgb(212,212,212)">            (  </span><span style="color:rgb(206,145,120)">"🔙</span><span style="color:rgb(215,186,125)">\n</span><span style="color:rgb(206,145,120)">"</span></div><div><span style="color:rgb(212,212,212)">            <> show callCtx</span></div><div><span style="color:rgb(212,212,212)">            <> </span><span style="color:rgb(206,145,120)">"🌀 "</span></div><div><span style="color:rgb(212,212,212)">            <> show tid</span></div><div><span style="color:rgb(212,212,212)">            <> </span><span style="color:rgb(206,145,120)">" stm retry #"</span></div><div><span style="color:rgb(212,212,212)">            <> show rtc</span></div><div><span style="color:rgb(212,212,212)">            )</span></div><div><span style="color:rgb(212,212,212)">          $ return </span><span style="color:rgb(86,156,214)">()</span></div>
<div><span style="color:rgb(212,212,212)">      atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)">case</span></div><div><span style="color:rgb(212,212,212)">         Nothing -> </span><span style="color:rgb(106,153,85)">-- stm failed, do a tracked retry</span></div><div><span style="color:rgb(212,212,212)">          trackSTM (rtc + </span><span style="color:rgb(181,206,168)">1</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)">        Just ... -> ...</span></div><div><span style="color:rgb(212,212,212)">
</span></div></div>
    <p>```</p>
    <p>No such trace msg fires during my test, neither in single thread
      run, nor in runs with pressure. I'm sure this tracing mechanism
      works, as I can see such traces fire, in case e.g. posting a TMVar
      to a TQueue for some other thread to fill it, then read the result
      out, if these 2 ops are composed into a single tx, then of course
      it's infinite retry loop, and a sequence of such msgs are logged
      with ever increasing rtc #.<br>
    </p>
    <p>So I believe no retry has ever been triggered.</p>
    <p>What can going on there?</p>
    <p><br>
    </p>
    <div>On 2020/7/24 下午11:46, Ryan Yates wrote:<br>
    </div>
    <blockquote type="cite">
      
      <div dir="ltr">
        <div dir="ltr">
          <div>> Then to explain the low CPU utilization (~10%), am I
            right to understand it as that upon reading a TVar locked by
            another committing tx, a lightweight thread will put itself
            into `waiting STM` and descheduled state, so the CPUs can
            only stay idle as not so many threads are willing to
            proceed?<br>
          </div>
          <div><br>
          </div>
          <div>Since the commit happens in finite steps, the expectation
            is that the lock will be released very soon.  Given this
            when the body of a transaction executes `readTVar` it spins
            (active CPU!) until the `TVar` is observed unlocked.  If a
            lock is observed while commiting, it immediately starts the
            transaction again from the beginning.  To get the behavior
            of suspending a transaction you have to successfully commit
            a transaction that executed `retry`.  Then the transaction
            is put on the wakeup lists of its read set and subsequent
            commits will wake it up if its write set overlaps.</div>
          <div><br>
          </div>
          <div>I don't think any of these things would explain low CPU
            utilization.  You could try running with `perf` and see if
            lots of time is spent in some recognizable part of the RTS.</div>
          <div><br>
          </div>
          <div>Ryan</div>
        </div>
        <div><br>
        </div>
        <br>
        <div class="gmail_quote">
          <div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at
            11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank">compl.yue@icloud.com</a>> wrote:<br>
          </div>
          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
            <div>
              <p>Thanks very much for the insightful information Ryan!
                I'm glad my suspect was wrong about the Haskell
                scheduler:<br>
              </p>
              <p>> The Haskell capability that is committing a
                transaction will not yield to another Haskell thread
                while it is doing the commit.  The OS thread may be
                preempted, but once commit starts the haskell scheduler
                is not invoked until after locks are released.</p>
              <div>So best effort had already been made in GHC and I
                just need to cooperate better with its design. Then to
                explain the low CPU utilization (~10%), am I right to
                understand it as that upon reading a TVar locked by
                another committing tx, a lightweight thread will put
                itself into `waiting STM` and descheduled state, so the
                CPUs can only stay idle as not so many threads are
                willing to proceed?<br>
              </div>
              <div><br>
              </div>
              <div>Anyway, I see light with better data structures to
                improve my situation, let me try them and report back.
                Actually I later changed `TVar (HaskMap k v)` to be
                `TVar (HashMap k Int)` where the `Int` being array index
                into `TVar (Vector (TVar (Maybe v)))`, in pursuing
                insertion order preservation semantic of dict entries
                (like that in Python 3.7+), then it's very hopeful to
                incorporate stm-containers' Map or ttrie to approach
                free of contention.<br>
              </div>
              <p>Thanks with regards,</p>
              <p>Compl</p>
              <p><br>
              </p>
              <div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>
              </div>
              <blockquote type="cite">
                <div dir="ltr">
                  <div>Hi Compl,</div>
                  <div><br>
                  </div>
                  <div>Having a pool of transaction processing threads
                    can be helpful in a certain way.  If the body of the
                    transaction takes more time to execute then the
                    Haskell thread is allowed and it yields, the
                    suspended thread won't get in the way of other
                    thread, but when it is rescheduled, will have a low
                    probability of success.  Even worse, it will
                    probably not discover that it is doomed to failure
                    until commit time.  If transactions are more likely
                    to reach commit without yielding, they are more
                    likely to succeed.  If the transactions are not
                    conflicting, it doesn't make much difference other
                    than cache churn.</div>
                  <div><br>
                  </div>
                  <div>The Haskell capability that is committing a
                    transaction will not yield to another Haskell thread
                    while it is doing the commit.  The OS thread may be
                    preempted, but once commit starts the haskell
                    scheduler is not invoked until after locks are
                    released.</div>
                  <div><br>
                  </div>
                  <div>To get good performance from STM you must pay
                    attention to what TVars are involved in a commit. 
                    All STM systems are working under the assumption of
                    low contention, so you want to minimize "false"
                    conflicts (conflicts that are not essential to the
                    computation).    Something like `TVar (HashMap k v)`
                    will work pretty well for a low thread count, but
                    every transaction that writes to that structure will
                    be in conflict with every other transaction that
                    accesses it.  Pushing the `TVar` into the nodes of
                    the structure reduces the possibilities for
                    conflict, while increasing the amount of bookkeeping
                    STM has to do.  I would like to reduce the cost of
                    that bookkeeping using better structures, but we
                    need to do so without harming performance in the low
                    TVar count case.  Right now it is optimized for good
                    cache performance with a handful of TVars.</div>
                  <div><br>
                  </div>
                  <div>There is another way to play with performance by
                    moving work into and out of the transaction body.  A
                    transaction body that executes quickly will reach
                    commit faster.  But it may be delaying work that
                    moves into another transaction.  Forcing values at
                    the right time can make a big difference.</div>
                  <div><br>
                  </div>
                  <div>Ryan</div>
                  <div dir="ltr"><br>
                  </div>
                  <div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM Compl
                    Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
                    wrote:<br>
                  </div>
                  <div class="gmail_quote">
                    <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                      <div>
                        <p>Thanks Chris, I confess I didn't pay enough
                          attention to STM specialized container
                          libraries by far, I skimmed through the
                          description of stm-containers and ttrie, and
                          feel they would definitely improve my code's
                          performance in case I limit the server's
                          parallelism within hardware capabilities. That
                          may because I'm still prototyping the api and
                          infrastructure for correctness, so even `TVar
                          (HashMap k v)` performs okay for me at the
                          moment, only if at low contention (surely
                          there're plenty of CPU cycles to be optimized
                          out in next steps). I model my data after
                          graph model, so most data, even most indices
                          are localized to nodes and edges, those can be
                          manipulated without conflict, that's why I
                          assumed I have a low contention use case since
                          the very beginning - until I found there are
                          still (though minor) needs for global indices
                          to guarantee global uniqueness, I feel
                          faithful with stm-containers/ttrie to
                          implement a more scalable global index data
                          structure, thanks for hinting me.</p>
                        <p>So an evident solution comes into my mind
                          now, is to run the server with a pool of tx
                          processing threads, matching number of CPU
                          cores, client RPC requests then get queued to
                          be executed in some thread from the pool. But
                          I'm really fond of the mechanism of M:N
                          scheduler which solves massive/dynamic
                          concurrency so elegantly. I had some good
                          result with Go in this regard, and see GHC at
                          par in doing this, I don't want to give up
                          this enjoyable machinery.</p>
                        <p>But looked at the stm implementation in GHC,
                          it seems written TVars are exclusively locked
                          during commit of a tx, I suspect this is the
                          culprit when there're large M lightweight
                          threads scheduled upon a small N hardware
                          capabilities, that is when a lightweight
                          thread yield control during an stm transaction
                          commit, the TVars it locked will stay so until
                          it's scheduled again (and again) till it can
                          finish the commit. This way, descheduled
                          threads could hold live threads from
                          progressing. I haven't gone into more details
                          there, but wonder if there can be some
                          improvement for GHC RTS to keep an stm
                          committing thread from descheduled, but
                          seemingly that may impose more starvation
                          potential; or stm can be improved to have its
                          TVar locks preemptable when the owner
                          trec/thread is in descheduled state? Neither
                          should be easy but I'd really love massive
                          lightweight threads doing STM practically
                          well.<br>
                        </p>
                        <p>Best regards,</p>
                        <p>Compl</p>
                        <p><br>
                        </p>
                        <div>On 2020/7/24 上午12:57, Christopher Allen
                          wrote:<br>
                        </div>
                        <blockquote type="cite">
                          <div dir="ltr">It seems like you know how to
                            run practical tests for tuning thread count
                            and contention for throughput. Part of the
                            reason you haven't gotten a super clear
                            answer is "it depends." You give up fairness
                            when you use STM instead of MVars or
                            equivalent structures. That means a long
                            running transaction might get stampeded by
                            many small ones invalidating it over and
                            over. The long-running transaction might
                            never clear if the small transactions keep
                            moving the cheese. I mention this because
                            transaction runtime and size and count all
                            affect throughput and latency. What might be
                            ideal for one pattern of work might not be
                            ideal for another. Optimizing for overall
                            throughput might make the contention and
                            fairness problems worse too. I've done
                            practical tests to optimize this in the
                            past, both for STM in Haskell and for RDBMS
                            workloads.
                            <div><br>
                            </div>
                            <div>The next step is sometimes figuring out
                              whether you really need a data structure
                              within a single STM container or if
                              perhaps you can break up your STM
                              container boundaries into zones or regions
                              that roughly map onto update boundaries.
                              That should make the transactions churn
                              less. On the outside chance you do need to
                              touch more than one container in a
                              transaction, well, they compose.
                              <div><br>
                              </div>
                              <div>e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a></div>
                              <div><a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a><br>
                              </div>
                              <div><br>
                              </div>
                              <div>It also sounds a bit like
                                your question bumps into Amdahl's Law a
                                bit.</div>
                              <div><br>
                              </div>
                              <div>All else fails, stop using STM and
                                find something more tuned to your
                                problem space.</div>
                              <div><br>
                              </div>
                              <div>Hope this helps,</div>
                              <div>Chris Allen</div>
                              <div><br>
                              </div>
                            </div>
                          </div>
                          <br>
                          <div class="gmail_quote">
                            <div dir="ltr" class="gmail_attr">On Thu,
                              Jul 23, 2020 at 9:53 AM YueCompl via
                              Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
                              wrote:<br>
                            </div>
                            <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
                              <div>Hello Cafe,
                                <div><br>
                                </div>
                                <div>I'm working on an in-memory
                                  database, in Client/Server mode I just
                                  let each connected client submit
                                  remote procedure call running in its
                                  dedicated lightweight thread,
                                  modifying TVars in RAM per its
                                  business needs, then in case many
                                  clients connected concurrently and
                                  trying to insert new data, if they are
                                  triggering global index (some TVar)
                                  update, the throughput would drop
                                  drastically. I reduced the shared
                                  state to a simple int counter by TVar,
                                  got same symptom. While the
                                  parallelism feels okay when there's no
                                  hot TVar conflicting, or M is not much
                                  greater than N.</div>
                                <div><br>
                                </div>
                                <div>As an empirical test workload, I
                                  have a `+RTS -N10` server process, it
                                  handles 10 concurrent clients okay,
                                  got ~5x of single thread throughput;
                                  but in handling 20 concurrent clients,
                                  each of the 10 CPUs can only be driven
                                  to ~10% utilization, the throughput
                                  seems even worse than single thread.
                                  More clients can even drive it
                                  thrashing without much  progressing.</div>
                                <div><br>
                                </div>
                                <div> I can understand that pure STM
                                  doesn't scale well after reading [1],
                                  and I see it suggested [7] attractive
                                  and planned future work toward that
                                  direction.</div>
                                <div><br>
                                </div>
                                <div>But I can't find certain libraries
                                  or frameworks addressing large M over
                                  small N scenarios, [1] experimented
                                  with designated N parallelism, and [7]
                                  is rather theoretical to my empirical
                                  needs.</div>
                                <div><br>
                                </div>
                                <div>Can you direct me to some available
                                  library implementing the methodology
                                  proposed in [7] or other ways tackling
                                  this problem?</div>
                                <div><br>
                                </div>
                                <div>I think the most difficult one is
                                  that a transaction should commit with
                                  global indices (with possibly unique
                                  constraints) atomically updated, and
                                  rollback with any violation of
                                  constraints, i.e. transactions have to
                                  cover global states like indices.
                                  Other problems seem more trivial than
                                  this.</div>
                                <div><br>
                                </div>
                                <div>Specifically, [7] states:</div>
                                <div><br>
                                </div>
                                <div>> It must be emphasized that all
                                  of the mechanisms we deploy originate,
                                  in one form or another, in the
                                  database literature from the 70s and
                                  80s. Our contribution is to adapt
                                  these techniques to software
                                  transactional memory, providing more
                                  effective solutions to important STM
                                  problems than prior proposals.</div>
                                <div><br>
                                </div>
                                <div>I wonder any STM based library has
                                  simplified those techniques to be
                                  composed right away? I don't really
                                  want to implement those mechanisms by
                                  myself, rebuilding many wheels from
                                  scratch.</div>
                                <div><br>
                                </div>
                                <div>Best regards,</div>
                                <div>Compl</div>
                                <div><br>
                                </div>
                                <div><br>
                                </div>
                                <div>[1] Comparing the performance of
                                  concurrent linked-list implementations
                                  in Haskell </div>
                                <div><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
                                <div><br>
                                </div>
                                <div>[7] M. Herlihy and E. Koskinen.
                                  Transactional boosting: a methodology
                                  for highly-concurrent transactional
                                  objects. In Proc. of PPoPP ’08, pages
                                  207–216. ACM Press, 2008.</div>
                                <div><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
                                <div><br>
                                </div>
                              </div>
_______________________________________________<br>
                              Haskell-Cafe mailing list<br>
                              To (un)subscribe, modify options or view
                              archives go to:<br>
                              <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
                              Only members subscribed via the mailman
                              list are allowed to post.</blockquote>
                          </div>
                          <br clear="all">
                          <div><br>
                          </div>
                          -- <br>
                          <div dir="ltr">
                            <div dir="ltr">
                              <div>
                                <div dir="ltr">
                                  <div dir="ltr">Chris Allen<br>
                                    <div><span style="font-size:12.8px">Currently
                                        working on </span><a href="http://haskellbook.com" target="_blank">http://haskellbook.com</a></div>
                                  </div>
                                </div>
                              </div>
                            </div>
                          </div>
                        </blockquote>
                      </div>
                      _______________________________________________<br>
                      Haskell-Cafe mailing list<br>
                      To (un)subscribe, modify options or view archives
                      go to:<br>
                      <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
                      Only members subscribed via the mailman list are
                      allowed to post.</blockquote>
                  </div>
                </div>
              </blockquote>
            </div>
          </blockquote>
        </div>
      </div>
    </blockquote>
  </div>

</blockquote></div>