<div dir="ltr"><div dir="ltr"><div>> Then to explain the low CPU

      utilization (~10%), am I right to understand it as that upon

      reading a TVar locked by another committing tx, a lightweight

      thread will put itself into `waiting STM` and descheduled state,

      so the CPUs can only stay idle as not so many threads are willing

      to proceed?<br>

    </div><div><br></div><div>Since the commit happens in finite steps, the expectation is that the lock will be released very soon.  Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked.  If a lock is observed while commiting, it immediately starts the transaction again from the beginning.  To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`.  Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.</div><div><br></div><div>I don't think any of these things would explain low CPU utilization.  You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.</div><div><br></div><div>Ryan</div>

    <div></div></div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at 11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com">compl.yue@icloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

  <div>

    <p>Thanks very much for the insightful information Ryan! I'm glad my

      suspect was wrong about the Haskell scheduler:<br>

    </p>

    <p>> The Haskell capability that is committing a transaction will

      not yield to another Haskell thread while it is doing the commit. 

      The OS thread may be preempted, but once commit starts the haskell

      scheduler is not invoked until after locks are released.</p>

    <div>So best effort had already been made in GHC and I just need to

      cooperate better with its design. Then to explain the low CPU

      utilization (~10%), am I right to understand it as that upon

      reading a TVar locked by another committing tx, a lightweight

      thread will put itself into `waiting STM` and descheduled state,

      so the CPUs can only stay idle as not so many threads are willing

      to proceed?<br>

    </div>

    <div><br>

    </div>

    <div>Anyway, I see light with better data structures to improve my

      situation, let me try them and report back. Actually I later

      changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where

      the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`,

      in pursuing insertion order preservation semantic of dict entries

      (like that in Python 3.7+), then it's very hopeful to incorporate

      stm-containers' Map or ttrie to approach free of contention.<br>

    </div>

    <p>Thanks with regards,</p>

    <p>Compl</p>

    <p><br>

    </p>

    <div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <div>Hi Compl,</div>

        <div><br>

        </div>

        <div>Having a pool of transaction processing threads can be

          helpful in a certain way.  If the body of the transaction

          takes more time to execute then the Haskell thread is allowed

          and it yields, the suspended thread won't get in the way of

          other thread, but when it is rescheduled, will have a low

          probability of success.  Even worse, it will probably not

          discover that it is doomed to failure until commit time.  If

          transactions are more likely to reach commit without yielding,

          they are more likely to succeed.  If the transactions are not

          conflicting, it doesn't make much difference other than cache

          churn.</div>

        <div><br>

        </div>

        <div>The Haskell capability that is committing a transaction

          will not yield to another Haskell thread while it is doing the

          commit.  The OS thread may be preempted, but once commit

          starts the haskell scheduler is not invoked until after locks

          are released.</div>

        <div><br>

        </div>

        <div>To get good performance from STM you must pay attention to

          what TVars are involved in a commit.  All STM systems are

          working under the assumption of low contention, so you want to

          minimize "false" conflicts (conflicts that are not essential

          to the computation).    Something like `TVar (HashMap k v)`

          will work pretty well for a low thread count, but every

          transaction that writes to that structure will be in conflict

          with every other transaction that accesses it.  Pushing the

          `TVar` into the nodes of the structure reduces the

          possibilities for conflict, while increasing the amount of

          bookkeeping STM has to do.  I would like to reduce the cost of

          that bookkeeping using better structures, but we need to do so

          without harming performance in the low TVar count case.  Right

          now it is optimized for good cache performance with a handful

          of TVars.</div>

        <div><br>

        </div>

        <div>There is another way to play with performance by moving

          work into and out of the transaction body.  A transaction body

          that executes quickly will reach commit faster.  But it may be

          delaying work that moves into another transaction.  Forcing

          values at the right time can make a big difference.</div>

        <div><br>

        </div>

        <div>Ryan</div>

        <div dir="ltr"><br>

        </div>

        <div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via

          Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>

          wrote:<br>

        </div>

        <div class="gmail_quote">

          <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

            <div>

              <p>Thanks Chris, I confess I didn't pay enough attention

                to STM specialized container libraries by far, I skimmed

                through the description of stm-containers and ttrie, and

                feel they would definitely improve my code's performance

                in case I limit the server's parallelism within hardware

                capabilities. That may because I'm still prototyping the

                api and infrastructure for correctness, so even `TVar

                (HashMap k v)` performs okay for me at the moment, only

                if at low contention (surely there're plenty of CPU

                cycles to be optimized out in next steps). I model my

                data after graph model, so most data, even most indices

                are localized to nodes and edges, those can be

                manipulated without conflict, that's why I assumed I

                have a low contention use case since the very beginning

                - until I found there are still (though minor) needs for

                global indices to guarantee global uniqueness, I feel

                faithful with stm-containers/ttrie to implement a more

                scalable global index data structure, thanks for hinting

                me.</p>

              <p>So an evident solution comes into my mind now, is to

                run the server with a pool of tx processing threads,

                matching number of CPU cores, client RPC requests then

                get queued to be executed in some thread from the pool.

                But I'm really fond of the mechanism of M:N scheduler

                which solves massive/dynamic concurrency so elegantly. I

                had some good result with Go in this regard, and see GHC

                at par in doing this, I don't want to give up this

                enjoyable machinery.</p>

              <p>But looked at the stm implementation in GHC, it seems

                written TVars are exclusively locked during commit of a

                tx, I suspect this is the culprit when there're large M

                lightweight threads scheduled upon a small N hardware

                capabilities, that is when a lightweight thread yield

                control during an stm transaction commit, the TVars it

                locked will stay so until it's scheduled again (and

                again) till it can finish the commit. This way,

                descheduled threads could hold live threads from

                progressing. I haven't gone into more details there, but

                wonder if there can be some improvement for GHC RTS to

                keep an stm committing thread from descheduled, but

                seemingly that may impose more starvation potential; or

                stm can be improved to have its TVar locks preemptable

                when the owner trec/thread is in descheduled state?

                Neither should be easy but I'd really love massive

                lightweight threads doing STM practically well.<br>

              </p>

              <p>Best regards,</p>

              <p>Compl</p>

              <p><br>

              </p>

              <div>On 2020/7/24 上午12:57, Christopher Allen wrote:<br>

              </div>

              <blockquote type="cite">

                <div dir="ltr">It seems like you know how to run

                  practical tests for tuning thread count and contention

                  for throughput. Part of the reason you haven't gotten

                  a super clear answer is "it depends." You give up

                  fairness when you use STM instead of MVars or

                  equivalent structures. That means a long running

                  transaction might get stampeded by many small ones

                  invalidating it over and over. The long-running

                  transaction might never clear if the small

                  transactions keep moving the cheese. I mention this

                  because transaction runtime and size and count all

                  affect throughput and latency. What might be ideal for

                  one pattern of work might not be ideal for another.

                  Optimizing for overall throughput might make the

                  contention and fairness problems worse too. I've done

                  practical tests to optimize this in the past, both for

                  STM in Haskell and for RDBMS workloads.

                  <div><br>

                  </div>

                  <div>The next step is sometimes figuring out whether

                    you really need a data structure within a single STM

                    container or if perhaps you can break up your STM

                    container boundaries into zones or regions that

                    roughly map onto update boundaries. That should make

                    the transactions churn less. On the outside chance

                    you do need to touch more than one container in a

                    transaction, well, they compose.

                    <div><br>

                    </div>

                    <div>e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a></div>

                    <div><a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a><br>

                    </div>

                    <div><br>

                    </div>

                    <div>It also sounds a bit like your question bumps

                      into Amdahl's Law a bit.</div>

                    <div><br>

                    </div>

                    <div>All else fails, stop using STM and find

                      something more tuned to your problem space.</div>

                    <div><br>

                    </div>

                    <div>Hope this helps,</div>

                    <div>Chris Allen</div>

                    <div><br>

                    </div>

                  </div>

                </div>

                <br>

                <div class="gmail_quote">

                  <div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020

                    at 9:53 AM YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>

                    wrote:<br>

                  </div>

                  <blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">

                    <div>Hello Cafe,

                      <div><br>

                      </div>

                      <div>I'm working on an in-memory database, in

                        Client/Server mode I just let each connected

                        client submit remote procedure call running in

                        its dedicated lightweight thread, modifying

                        TVars in RAM per its business needs, then in

                        case many clients connected concurrently and

                        trying to insert new data, if they are

                        triggering global index (some TVar) update, the

                        throughput would drop drastically. I reduced the

                        shared state to a simple int counter by TVar,

                        got same symptom. While the parallelism feels

                        okay when there's no hot TVar conflicting, or M

                        is not much greater than N.</div>

                      <div><br>

                      </div>

                      <div>As an empirical test workload, I have a `+RTS

                        -N10` server process, it handles 10 concurrent

                        clients okay, got ~5x of single thread

                        throughput; but in handling 20 concurrent

                        clients, each of the 10 CPUs can only be driven

                        to ~10% utilization, the throughput seems even

                        worse than single thread. More clients can even

                        drive it thrashing without much  progressing.</div>

                      <div><br>

                      </div>

                      <div> I can understand that pure STM doesn't scale

                        well after reading [1], and I see it suggested

                        [7] attractive and planned future work toward

                        that direction.</div>

                      <div><br>

                      </div>

                      <div>But I can't find certain libraries or

                        frameworks addressing large M over small N

                        scenarios, [1] experimented with designated N

                        parallelism, and [7] is rather theoretical to my

                        empirical needs.</div>

                      <div><br>

                      </div>

                      <div>Can you direct me to some available library

                        implementing the methodology proposed in [7] or

                        other ways tackling this problem?</div>

                      <div><br>

                      </div>

                      <div>I think the most difficult one is that a

                        transaction should commit with global indices

                        (with possibly unique constraints) atomically

                        updated, and rollback with any violation of

                        constraints, i.e. transactions have to cover

                        global states like indices. Other problems seem

                        more trivial than this.</div>

                      <div><br>

                      </div>

                      <div>Specifically, [7] states:</div>

                      <div><br>

                      </div>

                      <div>> It must be emphasized that all of the

                        mechanisms we deploy originate, in one form or

                        another, in the database literature from the 70s

                        and 80s. Our contribution is to adapt these

                        techniques to software transactional memory,

                        providing more effective solutions to important

                        STM problems than prior proposals.</div>

                      <div><br>

                      </div>

                      <div>I wonder any STM based library has simplified

                        those techniques to be composed right away? I

                        don't really want to implement those mechanisms

                        by myself, rebuilding many wheels from scratch.</div>

                      <div><br>

                      </div>

                      <div>Best regards,</div>

                      <div>Compl</div>

                      <div><br>

                      </div>

                      <div><br>

                      </div>

                      <div>[1] Comparing the performance of concurrent

                        linked-list implementations in Haskell </div>

                      <div><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>

                      <div><br>

                      </div>

                      <div>[7] M. Herlihy and E. Koskinen. Transactional

                        boosting: a methodology for highly-concurrent

                        transactional objects. In Proc. of PPoPP ’08,

                        pages 207–216. ACM Press, 2008.</div>

                      <div><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>

                      <div><br>

                      </div>

                    </div>

                    _______________________________________________<br>

                    Haskell-Cafe mailing list<br>

                    To (un)subscribe, modify options or view archives go

                    to:<br>

                    <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

                    Only members subscribed via the mailman list are

                    allowed to post.</blockquote>

                </div>

                <br clear="all">

                <div><br>

                </div>

                -- <br>

                <div dir="ltr">

                  <div dir="ltr">

                    <div>

                      <div dir="ltr">

                        <div dir="ltr">Chris Allen<br>

                          <div><span style="font-size:12.8px">Currently

                              working on </span><a href="http://haskellbook.com" target="_blank">http://haskellbook.com</a></div>

                        </div>

                      </div>

                    </div>

                  </div>

                </div>

              </blockquote>

            </div>

            _______________________________________________<br>

            Haskell-Cafe mailing list<br>

            To (un)subscribe, modify options or view archives go to:<br>

            <a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>

            Only members subscribed via the mailman list are allowed to

            post.</blockquote>

        </div>

      </div>

    </blockquote>

  </div>

</blockquote></div></div>