<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>I'm not familiar with profiling GHC yet, may need more time to

      get myself proficient with it.</p>

    <p>And a bit more details of my test workload for diagnostic: the db

      clients are Python processes from a cluster of worker nodes,

      consulting the db server to register some path for data files,

      under a data dir within a shared filesystem, then mmap those data

      files and fill in actual array data. So the db server don't have

      much computation to perform, but puts the data file path into a

      global index, which at the same validates its uniqueness. As there

      are many client processes trying to insert one meta data record

      concurrently, with my naive implementation, the global index's

      TVar will almost always in locked state by one client after

      another, from a queue never fall empty.</p>

    <p>So if `readTVar` should spinning waiting, I doubt the threads

      should actually make high CPU utilization, because at any instant

      of time, all threads except the committing one will be doing that

      one thing.<br>

    </p>

    <p>And I have something in my code to track STM retry like this:</p>

    <p>```</p>

    <div style="color: #d4d4d4;background-color: #1e1e1e;font-family: ''Ubuntu Mono', 'Droid Sans Mono', 'monospace', monospace, 'Droid Sans Fallback'', 'Droid Sans Mono', 'monospace', monospace, 'Droid Sans Fallback';font-weight: normal;font-size: 17px;line-height: 23px;white-space: pre;">

<div><span style="color: #d4d4d4;">    </span><span style="color: #6a9955;">-- blocking wait not expected, track stm retries explicitly</span></div><div><span style="color: #d4d4d4;">    </span><span style="color: #dcdcaa;">trackSTM</span><span style="color: #d4d4d4;"> :: </span><span style="color: #569cd6;">Int</span><span style="color: #d4d4d4;"> -> </span><span style="color: #569cd6;">IO</span><span style="color: #d4d4d4;"> (</span><span style="color: #569cd6;">Either</span><span style="color: #d4d4d4;"> () </span><span style="color: #9cdcfe;">a</span><span style="color: #d4d4d4;">)</span></div><div><span style="color: #d4d4d4;">    trackSTM !rtc = </span><span style="color: #c586c0;">do</span></div>

<div><span style="color: #d4d4d4;">      when </span><span style="color: #6a9955;">-- todo increase the threshold of reporting?</span></div><div><span style="color: #d4d4d4;">           (rtc > </span><span style="color: #b5cea8;">0</span><span style="color: #d4d4d4;">) $ </span><span style="color: #c586c0;">do</span></div><div><span style="color: #d4d4d4;">        </span><span style="color: #6a9955;">-- trace out the retries so the end users can be aware of them</span></div><div><span style="color: #d4d4d4;">        tid <- myThreadId</span></div><div><span style="color: #d4d4d4;">        trace</span></div><div><span style="color: #d4d4d4;">            (  </span><span style="color: #ce9178;">"🔙</span><span style="color: #d7ba7d;">\n</span><span style="color: #ce9178;">"</span></div><div><span style="color: #d4d4d4;">            <> show callCtx</span></div><div><span style="color: #d4d4d4;">            <> </span><span style="color: #ce9178;">"🌀 "</span></div><div><span style="color: #d4d4d4;">            <> show tid</span></div><div><span style="color: #d4d4d4;">            <> </span><span style="color: #ce9178;">" stm retry #"</span></div><div><span style="color: #d4d4d4;">            <> show rtc</span></div><div><span style="color: #d4d4d4;">            )</span></div><div><span style="color: #d4d4d4;">          $ return </span><span style="color: #569cd6;">()</span></div>

<div><span style="color: #d4d4d4;">      atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color: #c586c0;">case</span></div><div><span style="color: #d4d4d4;">         Nothing -> </span><span style="color: #6a9955;">-- stm failed, do a tracked retry</span></div><div><span style="color: #d4d4d4;">          trackSTM (rtc + </span><span style="color: #b5cea8;">1</span><span style="color: #d4d4d4;">)</span></div><div><span style="color: #d4d4d4;">        Just ... -> ...</span></div><div><span style="color: #d4d4d4;">

</span></div></div>

    <p>```</p>

    <p>No such trace msg fires during my test, neither in single thread

      run, nor in runs with pressure. I'm sure this tracing mechanism

      works, as I can see such traces fire, in case e.g. posting a TMVar

      to a TQueue for some other thread to fill it, then read the result

      out, if these 2 ops are composed into a single tx, then of course

      it's infinite retry loop, and a sequence of such msgs are logged

      with ever increasing rtc #.<br>

    </p>

    <p>So I believe no retry has ever been triggered.</p>

    <p>What can going on there?</p>

    <p><br>

    </p>

    <div class="moz-cite-prefix">On 2020/7/24 下午11:46, Ryan Yates wrote:<br>

    </div>

    <blockquote type="cite"

cite="mid:CAO27hRoyaC8EAYbvr5ARV45XP1F6=Ja2UC_cqTOj_WEX3Zi_aw@mail.gmail.com">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <div dir="ltr">

        <div dir="ltr">

          <div>> Then to explain the low CPU utilization (~10%), am I

            right to understand it as that upon reading a TVar locked by

            another committing tx, a lightweight thread will put itself

            into `waiting STM` and descheduled state, so the CPUs can

            only stay idle as not so many threads are willing to

            proceed?<br>

          </div>

          <div><br>

          </div>

          <div>Since the commit happens in finite steps, the expectation

            is that the lock will be released very soon.  Given this

            when the body of a transaction executes `readTVar` it spins

            (active CPU!) until the `TVar` is observed unlocked.  If a

            lock is observed while commiting, it immediately starts the

            transaction again from the beginning.  To get the behavior

            of suspending a transaction you have to successfully commit

            a transaction that executed `retry`.  Then the transaction

            is put on the wakeup lists of its read set and subsequent

            commits will wake it up if its write set overlaps.</div>

          <div><br>

          </div>

          <div>I don't think any of these things would explain low CPU

            utilization.  You could try running with `perf` and see if

            lots of time is spent in some recognizable part of the RTS.</div>

          <div><br>

          </div>

          <div>Ryan</div>

        </div>

        <div><br>

        </div>

        <br>

        <div class="gmail_quote">

          <div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at

            11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com"

              moz-do-not-send="true">compl.yue@icloud.com</a>> wrote:<br>

          </div>

          <blockquote class="gmail_quote" style="margin:0px 0px 0px

            0.8ex;border-left:1px solid

            rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Thanks very much for the insightful information Ryan!

                I'm glad my suspect was wrong about the Haskell

                scheduler:<br>

              </p>

              <p>> The Haskell capability that is committing a

                transaction will not yield to another Haskell thread

                while it is doing the commit.  The OS thread may be

                preempted, but once commit starts the haskell scheduler

                is not invoked until after locks are released.</p>

              <div>So best effort had already been made in GHC and I

                just need to cooperate better with its design. Then to

                explain the low CPU utilization (~10%), am I right to

                understand it as that upon reading a TVar locked by

                another committing tx, a lightweight thread will put

                itself into `waiting STM` and descheduled state, so the

                CPUs can only stay idle as not so many threads are

                willing to proceed?<br>

              </div>

              <div><br>

              </div>

              <div>Anyway, I see light with better data structures to

                improve my situation, let me try them and report back.

                Actually I later changed `TVar (HaskMap k v)` to be

                `TVar (HashMap k Int)` where the `Int` being array index

                into `TVar (Vector (TVar (Maybe v)))`, in pursuing

                insertion order preservation semantic of dict entries

                (like that in Python 3.7+), then it's very hopeful to

                incorporate stm-containers' Map or ttrie to approach

                free of contention.<br>

              </div>

              <p>Thanks with regards,</p>

              <p>Compl</p>

              <p><br>

              </p>

              <div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">

                  <div>Hi Compl,</div>

                  <div><br>

                  </div>

                  <div>Having a pool of transaction processing threads

                    can be helpful in a certain way.  If the body of the

                    transaction takes more time to execute then the

                    Haskell thread is allowed and it yields, the

                    suspended thread won't get in the way of other

                    thread, but when it is rescheduled, will have a low

                    probability of success.  Even worse, it will

                    probably not discover that it is doomed to failure

                    until commit time.  If transactions are more likely

                    to reach commit without yielding, they are more

                    likely to succeed.  If the transactions are not

                    conflicting, it doesn't make much difference other

                    than cache churn.</div>

                  <div><br>

                  </div>

                  <div>The Haskell capability that is committing a

                    transaction will not yield to another Haskell thread

                    while it is doing the commit.  The OS thread may be

                    preempted, but once commit starts the haskell

                    scheduler is not invoked until after locks are

                    released.</div>

                  <div><br>

                  </div>

                  <div>To get good performance from STM you must pay

                    attention to what TVars are involved in a commit. 

                    All STM systems are working under the assumption of

                    low contention, so you want to minimize "false"

                    conflicts (conflicts that are not essential to the

                    computation).    Something like `TVar (HashMap k v)`

                    will work pretty well for a low thread count, but

                    every transaction that writes to that structure will

                    be in conflict with every other transaction that

                    accesses it.  Pushing the `TVar` into the nodes of

                    the structure reduces the possibilities for

                    conflict, while increasing the amount of bookkeeping

                    STM has to do.  I would like to reduce the cost of

                    that bookkeeping using better structures, but we

                    need to do so without harming performance in the low

                    TVar count case.  Right now it is optimized for good

                    cache performance with a handful of TVars.</div>

                  <div><br>

                  </div>

                  <div>There is another way to play with performance by

                    moving work into and out of the transaction body.  A

                    transaction body that executes quickly will reach

                    commit faster.  But it may be delaying work that

                    moves into another transaction.  Forcing values at

                    the right time can make a big difference.</div>

                  <div><br>

                  </div>

                  <div>Ryan</div>

                  <div dir="ltr"><br>

                  </div>

                  <div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM Compl

                    Yue via Haskell-Cafe <<a

                      href="mailto:haskell-cafe@haskell.org"

                      target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>>

                    wrote:<br>

                  </div>

                  <div class="gmail_quote">

                    <blockquote class="gmail_quote" style="margin:0px

                      0px 0px 0.8ex;border-left:1px solid

                      rgb(204,204,204);padding-left:1ex">

                      <div>

                        <p>Thanks Chris, I confess I didn't pay enough

                          attention to STM specialized container

                          libraries by far, I skimmed through the

                          description of stm-containers and ttrie, and

                          feel they would definitely improve my code's

                          performance in case I limit the server's

                          parallelism within hardware capabilities. That

                          may because I'm still prototyping the api and

                          infrastructure for correctness, so even `TVar

                          (HashMap k v)` performs okay for me at the

                          moment, only if at low contention (surely

                          there're plenty of CPU cycles to be optimized

                          out in next steps). I model my data after

                          graph model, so most data, even most indices

                          are localized to nodes and edges, those can be

                          manipulated without conflict, that's why I

                          assumed I have a low contention use case since

                          the very beginning - until I found there are

                          still (though minor) needs for global indices

                          to guarantee global uniqueness, I feel

                          faithful with stm-containers/ttrie to

                          implement a more scalable global index data

                          structure, thanks for hinting me.</p>

                        <p>So an evident solution comes into my mind

                          now, is to run the server with a pool of tx

                          processing threads, matching number of CPU

                          cores, client RPC requests then get queued to

                          be executed in some thread from the pool. But

                          I'm really fond of the mechanism of M:N

                          scheduler which solves massive/dynamic

                          concurrency so elegantly. I had some good

                          result with Go in this regard, and see GHC at

                          par in doing this, I don't want to give up

                          this enjoyable machinery.</p>

                        <p>But looked at the stm implementation in GHC,

                          it seems written TVars are exclusively locked

                          during commit of a tx, I suspect this is the

                          culprit when there're large M lightweight

                          threads scheduled upon a small N hardware

                          capabilities, that is when a lightweight

                          thread yield control during an stm transaction

                          commit, the TVars it locked will stay so until

                          it's scheduled again (and again) till it can

                          finish the commit. This way, descheduled

                          threads could hold live threads from

                          progressing. I haven't gone into more details

                          there, but wonder if there can be some

                          improvement for GHC RTS to keep an stm

                          committing thread from descheduled, but

                          seemingly that may impose more starvation

                          potential; or stm can be improved to have its

                          TVar locks preemptable when the owner

                          trec/thread is in descheduled state? Neither

                          should be easy but I'd really love massive

                          lightweight threads doing STM practically

                          well.<br>

                        </p>

                        <p>Best regards,</p>

                        <p>Compl</p>

                        <p><br>

                        </p>

                        <div>On 2020/7/24 上午12:57, Christopher Allen

                          wrote:<br>

                        </div>

                        <blockquote type="cite">

                          <div dir="ltr">It seems like you know how to

                            run practical tests for tuning thread count

                            and contention for throughput. Part of the

                            reason you haven't gotten a super clear

                            answer is "it depends." You give up fairness

                            when you use STM instead of MVars or

                            equivalent structures. That means a long

                            running transaction might get stampeded by

                            many small ones invalidating it over and

                            over. The long-running transaction might

                            never clear if the small transactions keep

                            moving the cheese. I mention this because

                            transaction runtime and size and count all

                            affect throughput and latency. What might be

                            ideal for one pattern of work might not be

                            ideal for another. Optimizing for overall

                            throughput might make the contention and

                            fairness problems worse too. I've done

                            practical tests to optimize this in the

                            past, both for STM in Haskell and for RDBMS

                            workloads.

                            <div><br>

                            </div>

                            <div>The next step is sometimes figuring out

                              whether you really need a data structure

                              within a single STM container or if

                              perhaps you can break up your STM

                              container boundaries into zones or regions

                              that roughly map onto update boundaries.

                              That should make the transactions churn

                              less. On the outside chance you do need to

                              touch more than one container in a

                              transaction, well, they compose.

                              <div><br>

                              </div>

                              <div>e.g. <a

                                  href="https://hackage.haskell.org/package/stm-containers"

                                  target="_blank" moz-do-not-send="true">https://hackage.haskell.org/package/stm-containers</a></div>

                              <div><a

                                  href="https://hackage.haskell.org/package/ttrie"

                                  target="_blank" moz-do-not-send="true">https://hackage.haskell.org/package/ttrie</a><br>

                              </div>

                              <div><br>

                              </div>

                              <div>It also sounds a bit like

                                your question bumps into Amdahl's Law a

                                bit.</div>

                              <div><br>

                              </div>

                              <div>All else fails, stop using STM and

                                find something more tuned to your

                                problem space.</div>

                              <div><br>

                              </div>

                              <div>Hope this helps,</div>

                              <div>Chris Allen</div>

                              <div><br>

                              </div>

                            </div>

                          </div>

                          <br>

                          <div class="gmail_quote">

                            <div dir="ltr" class="gmail_attr">On Thu,

                              Jul 23, 2020 at 9:53 AM YueCompl via

                              Haskell-Cafe <<a

                                href="mailto:haskell-cafe@haskell.org"

                                target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>>

                              wrote:<br>

                            </div>

                            <blockquote class="gmail_quote"

                              style="margin:0px 0px 0px

                              0.8ex;border-left:1px solid

                              rgb(204,204,204);padding-left:1ex">

                              <div>Hello Cafe,

                                <div><br>

                                </div>

                                <div>I'm working on an in-memory

                                  database, in Client/Server mode I just

                                  let each connected client submit

                                  remote procedure call running in its

                                  dedicated lightweight thread,

                                  modifying TVars in RAM per its

                                  business needs, then in case many

                                  clients connected concurrently and

                                  trying to insert new data, if they are

                                  triggering global index (some TVar)

                                  update, the throughput would drop

                                  drastically. I reduced the shared

                                  state to a simple int counter by TVar,

                                  got same symptom. While the

                                  parallelism feels okay when there's no

                                  hot TVar conflicting, or M is not much

                                  greater than N.</div>

                                <div><br>

                                </div>

                                <div>As an empirical test workload, I

                                  have a `+RTS -N10` server process, it

                                  handles 10 concurrent clients okay,

                                  got ~5x of single thread throughput;

                                  but in handling 20 concurrent clients,

                                  each of the 10 CPUs can only be driven

                                  to ~10% utilization, the throughput

                                  seems even worse than single thread.

                                  More clients can even drive it

                                  thrashing without much  progressing.</div>

                                <div><br>

                                </div>

                                <div> I can understand that pure STM

                                  doesn't scale well after reading [1],

                                  and I see it suggested [7] attractive

                                  and planned future work toward that

                                  direction.</div>

                                <div><br>

                                </div>

                                <div>But I can't find certain libraries

                                  or frameworks addressing large M over

                                  small N scenarios, [1] experimented

                                  with designated N parallelism, and [7]

                                  is rather theoretical to my empirical

                                  needs.</div>

                                <div><br>

                                </div>

                                <div>Can you direct me to some available

                                  library implementing the methodology

                                  proposed in [7] or other ways tackling

                                  this problem?</div>

                                <div><br>

                                </div>

                                <div>I think the most difficult one is

                                  that a transaction should commit with

                                  global indices (with possibly unique

                                  constraints) atomically updated, and

                                  rollback with any violation of

                                  constraints, i.e. transactions have to

                                  cover global states like indices.

                                  Other problems seem more trivial than

                                  this.</div>

                                <div><br>

                                </div>

                                <div>Specifically, [7] states:</div>

                                <div><br>

                                </div>

                                <div>> It must be emphasized that all

                                  of the mechanisms we deploy originate,

                                  in one form or another, in the

                                  database literature from the 70s and

                                  80s. Our contribution is to adapt

                                  these techniques to software

                                  transactional memory, providing more

                                  effective solutions to important STM

                                  problems than prior proposals.</div>

                                <div><br>

                                </div>

                                <div>I wonder any STM based library has

                                  simplified those techniques to be

                                  composed right away? I don't really

                                  want to implement those mechanisms by

                                  myself, rebuilding many wheels from

                                  scratch.</div>

                                <div><br>

                                </div>

                                <div>Best regards,</div>

                                <div>Compl</div>

                                <div><br>

                                </div>

                                <div><br>

                                </div>

                                <div>[1] Comparing the performance of

                                  concurrent linked-list implementations

                                  in Haskell </div>

                                <div><a

                                    href="https://simonmar.github.io/bib/papers/concurrent-data.pdf"

                                    target="_blank"

                                    moz-do-not-send="true">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>

                                <div><br>

                                </div>

                                <div>[7] M. Herlihy and E. Koskinen.

                                  Transactional boosting: a methodology

                                  for highly-concurrent transactional

                                  objects. In Proc. of PPoPP ’08, pages

                                  207–216. ACM Press, 2008.</div>

                                <div><a

                                    href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf"

                                    target="_blank"

                                    moz-do-not-send="true">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>

                                <div><br>

                                </div>

                              </div>

_______________________________________________<br>

                              Haskell-Cafe mailing list<br>

                              To (un)subscribe, modify options or view

                              archives go to:<br>

                              <a

                                href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe"

                                rel="noreferrer" target="_blank"

                                moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

                              Only members subscribed via the mailman

                              list are allowed to post.</blockquote>

                          </div>

                          <br clear="all">

                          <div><br>

                          </div>

                          -- <br>

                          <div dir="ltr">

                            <div dir="ltr">

                              <div>

                                <div dir="ltr">

                                  <div dir="ltr">Chris Allen<br>

                                    <div><span style="font-size:12.8px">Currently

                                        working on </span><a

                                        href="http://haskellbook.com"

                                        target="_blank"

                                        moz-do-not-send="true">http://haskellbook.com</a></div>

                                  </div>

                                </div>

                              </div>

                            </div>

                          </div>

                        </blockquote>

                      </div>

                      _______________________________________________<br>

                      Haskell-Cafe mailing list<br>

                      To (un)subscribe, modify options or view archives

                      go to:<br>

                      <a

                        href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe"

                        rel="noreferrer" target="_blank"

                        moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

                      Only members subscribed via the mailman list are

                      allowed to post.</blockquote>

                  </div>

                </div>

              </blockquote>

            </div>

          </blockquote>

        </div>

      </div>

    </blockquote>

  </body>

</html>