<div dir="ltr">To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance,<div> changing the size of heap objects can drastically change cache performance and completely different behavior can show up.</div><div><br></div><div>[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div><div><br></div><div>The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... )</div><div><br></div><div>[^2]: <a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br></div><div><br></div><div>All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence.</div><div><br></div><div>The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line:</div><div><br></div><div><a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div><div><br></div><div>Ryan </div><div><br></div><div><br></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at 12:35 PM Compl Yue <<a href="mailto:compl.yue@icloud.com">compl.yue@icloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>I'm not familiar with profiling GHC yet, may need more time to
get myself proficient with it.</p>
<p>And a bit more details of my test workload for diagnostic: the db
clients are Python processes from a cluster of worker nodes,
consulting the db server to register some path for data files,
under a data dir within a shared filesystem, then mmap those data
files and fill in actual array data. So the db server don't have
much computation to perform, but puts the data file path into a
global index, which at the same validates its uniqueness. As there
are many client processes trying to insert one meta data record
concurrently, with my naive implementation, the global index's
TVar will almost always in locked state by one client after
another, from a queue never fall empty.</p>
<p>So if `readTVar` should spinning waiting, I doubt the threads
should actually make high CPU utilization, because at any instant
of time, all threads except the committing one will be doing that
one thing.<br>
</p>
<p>And I have something in my code to track STM retry like this:</p>
<p>```</p>
<div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap">
<div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(106,153,85)">-- blocking wait not expected, track stm retries explicitly</span></div><div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(220,220,170)">trackSTM</span><span style="color:rgb(212,212,212)"> :: </span><span style="color:rgb(86,156,214)">Int</span><span style="color:rgb(212,212,212)"> -> </span><span style="color:rgb(86,156,214)">IO</span><span style="color:rgb(212,212,212)"> (</span><span style="color:rgb(86,156,214)">Either</span><span style="color:rgb(212,212,212)"> () </span><span style="color:rgb(156,220,254)">a</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)"> trackSTM !rtc = </span><span style="color:rgb(197,134,192)">do</span></div>
<div><span style="color:rgb(212,212,212)"> when </span><span style="color:rgb(106,153,85)">-- todo increase the threshold of reporting?</span></div><div><span style="color:rgb(212,212,212)"> (rtc > </span><span style="color:rgb(181,206,168)">0</span><span style="color:rgb(212,212,212)">) $ </span><span style="color:rgb(197,134,192)">do</span></div><div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(106,153,85)">-- trace out the retries so the end users can be aware of them</span></div><div><span style="color:rgb(212,212,212)"> tid <- myThreadId</span></div><div><span style="color:rgb(212,212,212)"> trace</span></div><div><span style="color:rgb(212,212,212)"> ( </span><span style="color:rgb(206,145,120)">"🔙</span><span style="color:rgb(215,186,125)">\n</span><span style="color:rgb(206,145,120)">"</span></div><div><span style="color:rgb(212,212,212)"> <> show callCtx</span></div><div><span style="color:rgb(212,212,212)"> <> </span><span style="color:rgb(206,145,120)">"🌀 "</span></div><div><span style="color:rgb(212,212,212)"> <> show tid</span></div><div><span style="color:rgb(212,212,212)"> <> </span><span style="color:rgb(206,145,120)">" stm retry #"</span></div><div><span style="color:rgb(212,212,212)"> <> show rtc</span></div><div><span style="color:rgb(212,212,212)"> )</span></div><div><span style="color:rgb(212,212,212)"> $ return </span><span style="color:rgb(86,156,214)">()</span></div>
<div><span style="color:rgb(212,212,212)"> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)">case</span></div><div><span style="color:rgb(212,212,212)"> Nothing -> </span><span style="color:rgb(106,153,85)">-- stm failed, do a tracked retry</span></div><div><span style="color:rgb(212,212,212)"> trackSTM (rtc + </span><span style="color:rgb(181,206,168)">1</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)"> Just ... -> ...</span></div><div><span style="color:rgb(212,212,212)">
</span></div></div>
<p>```</p>
<p>No such trace msg fires during my test, neither in single thread
run, nor in runs with pressure. I'm sure this tracing mechanism
works, as I can see such traces fire, in case e.g. posting a TMVar
to a TQueue for some other thread to fill it, then read the result
out, if these 2 ops are composed into a single tx, then of course
it's infinite retry loop, and a sequence of such msgs are logged
with ever increasing rtc #.<br>
</p>
<p>So I believe no retry has ever been triggered.</p>
<p>What can going on there?</p>
<p><br>
</p>
<div>On 2020/7/24 下午11:46, Ryan Yates wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div>> Then to explain the low CPU utilization (~10%), am I
right to understand it as that upon reading a TVar locked by
another committing tx, a lightweight thread will put itself
into `waiting STM` and descheduled state, so the CPUs can
only stay idle as not so many threads are willing to
proceed?<br>
</div>
<div><br>
</div>
<div>Since the commit happens in finite steps, the expectation
is that the lock will be released very soon. Given this
when the body of a transaction executes `readTVar` it spins
(active CPU!) until the `TVar` is observed unlocked. If a
lock is observed while commiting, it immediately starts the
transaction again from the beginning. To get the behavior
of suspending a transaction you have to successfully commit
a transaction that executed `retry`. Then the transaction
is put on the wakeup lists of its read set and subsequent
commits will wake it up if its write set overlaps.</div>
<div><br>
</div>
<div>I don't think any of these things would explain low CPU
utilization. You could try running with `perf` and see if
lots of time is spent in some recognizable part of the RTS.</div>
<div><br>
</div>
<div>Ryan</div>
</div>
<div><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at
11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank">compl.yue@icloud.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks very much for the insightful information Ryan!
I'm glad my suspect was wrong about the Haskell
scheduler:<br>
</p>
<p>> The Haskell capability that is committing a
transaction will not yield to another Haskell thread
while it is doing the commit. The OS thread may be
preempted, but once commit starts the haskell scheduler
is not invoked until after locks are released.</p>
<div>So best effort had already been made in GHC and I
just need to cooperate better with its design. Then to
explain the low CPU utilization (~10%), am I right to
understand it as that upon reading a TVar locked by
another committing tx, a lightweight thread will put
itself into `waiting STM` and descheduled state, so the
CPUs can only stay idle as not so many threads are
willing to proceed?<br>
</div>
<div><br>
</div>
<div>Anyway, I see light with better data structures to
improve my situation, let me try them and report back.
Actually I later changed `TVar (HaskMap k v)` to be
`TVar (HashMap k Int)` where the `Int` being array index
into `TVar (Vector (TVar (Maybe v)))`, in pursuing
insertion order preservation semantic of dict entries
(like that in Python 3.7+), then it's very hopeful to
incorporate stm-containers' Map or ttrie to approach
free of contention.<br>
</div>
<p>Thanks with regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi Compl,</div>
<div><br>
</div>
<div>Having a pool of transaction processing threads
can be helpful in a certain way. If the body of the
transaction takes more time to execute then the
Haskell thread is allowed and it yields, the
suspended thread won't get in the way of other
thread, but when it is rescheduled, will have a low
probability of success. Even worse, it will
probably not discover that it is doomed to failure
until commit time. If transactions are more likely
to reach commit without yielding, they are more
likely to succeed. If the transactions are not
conflicting, it doesn't make much difference other
than cache churn.</div>
<div><br>
</div>
<div>The Haskell capability that is committing a
transaction will not yield to another Haskell thread
while it is doing the commit. The OS thread may be
preempted, but once commit starts the haskell
scheduler is not invoked until after locks are
released.</div>
<div><br>
</div>
<div>To get good performance from STM you must pay
attention to what TVars are involved in a commit.
All STM systems are working under the assumption of
low contention, so you want to minimize "false"
conflicts (conflicts that are not essential to the
computation). Something like `TVar (HashMap k v)`
will work pretty well for a low thread count, but
every transaction that writes to that structure will
be in conflict with every other transaction that
accesses it. Pushing the `TVar` into the nodes of
the structure reduces the possibilities for
conflict, while increasing the amount of bookkeeping
STM has to do. I would like to reduce the cost of
that bookkeeping using better structures, but we
need to do so without harming performance in the low
TVar count case. Right now it is optimized for good
cache performance with a handful of TVars.</div>
<div><br>
</div>
<div>There is another way to play with performance by
moving work into and out of the transaction body. A
transaction body that executes quickly will reach
commit faster. But it may be delaying work that
moves into another transaction. Forcing values at
the right time can make a big difference.</div>
<div><br>
</div>
<div>Ryan</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM Compl
Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks Chris, I confess I didn't pay enough
attention to STM specialized container
libraries by far, I skimmed through the
description of stm-containers and ttrie, and
feel they would definitely improve my code's
performance in case I limit the server's
parallelism within hardware capabilities. That
may because I'm still prototyping the api and
infrastructure for correctness, so even `TVar
(HashMap k v)` performs okay for me at the
moment, only if at low contention (surely
there're plenty of CPU cycles to be optimized
out in next steps). I model my data after
graph model, so most data, even most indices
are localized to nodes and edges, those can be
manipulated without conflict, that's why I
assumed I have a low contention use case since
the very beginning - until I found there are
still (though minor) needs for global indices
to guarantee global uniqueness, I feel
faithful with stm-containers/ttrie to
implement a more scalable global index data
structure, thanks for hinting me.</p>
<p>So an evident solution comes into my mind
now, is to run the server with a pool of tx
processing threads, matching number of CPU
cores, client RPC requests then get queued to
be executed in some thread from the pool. But
I'm really fond of the mechanism of M:N
scheduler which solves massive/dynamic
concurrency so elegantly. I had some good
result with Go in this regard, and see GHC at
par in doing this, I don't want to give up
this enjoyable machinery.</p>
<p>But looked at the stm implementation in GHC,
it seems written TVars are exclusively locked
during commit of a tx, I suspect this is the
culprit when there're large M lightweight
threads scheduled upon a small N hardware
capabilities, that is when a lightweight
thread yield control during an stm transaction
commit, the TVars it locked will stay so until
it's scheduled again (and again) till it can
finish the commit. This way, descheduled
threads could hold live threads from
progressing. I haven't gone into more details
there, but wonder if there can be some
improvement for GHC RTS to keep an stm
committing thread from descheduled, but
seemingly that may impose more starvation
potential; or stm can be improved to have its
TVar locks preemptable when the owner
trec/thread is in descheduled state? Neither
should be easy but I'd really love massive
lightweight threads doing STM practically
well.<br>
</p>
<p>Best regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 上午12:57, Christopher Allen
wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">It seems like you know how to
run practical tests for tuning thread count
and contention for throughput. Part of the
reason you haven't gotten a super clear
answer is "it depends." You give up fairness
when you use STM instead of MVars or
equivalent structures. That means a long
running transaction might get stampeded by
many small ones invalidating it over and
over. The long-running transaction might
never clear if the small transactions keep
moving the cheese. I mention this because
transaction runtime and size and count all
affect throughput and latency. What might be
ideal for one pattern of work might not be
ideal for another. Optimizing for overall
throughput might make the contention and
fairness problems worse too. I've done
practical tests to optimize this in the
past, both for STM in Haskell and for RDBMS
workloads.
<div><br>
</div>
<div>The next step is sometimes figuring out
whether you really need a data structure
within a single STM container or if
perhaps you can break up your STM
container boundaries into zones or regions
that roughly map onto update boundaries.
That should make the transactions churn
less. On the outside chance you do need to
touch more than one container in a
transaction, well, they compose.
<div><br>
</div>
<div>e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a></div>
<div><a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a><br>
</div>
<div><br>
</div>
<div>It also sounds a bit like
your question bumps into Amdahl's Law a
bit.</div>
<div><br>
</div>
<div>All else fails, stop using STM and
find something more tuned to your
problem space.</div>
<div><br>
</div>
<div>Hope this helps,</div>
<div>Chris Allen</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu,
Jul 23, 2020 at 9:53 AM YueCompl via
Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>Hello Cafe,
<div><br>
</div>
<div>I'm working on an in-memory
database, in Client/Server mode I just
let each connected client submit
remote procedure call running in its
dedicated lightweight thread,
modifying TVars in RAM per its
business needs, then in case many
clients connected concurrently and
trying to insert new data, if they are
triggering global index (some TVar)
update, the throughput would drop
drastically. I reduced the shared
state to a simple int counter by TVar,
got same symptom. While the
parallelism feels okay when there's no
hot TVar conflicting, or M is not much
greater than N.</div>
<div><br>
</div>
<div>As an empirical test workload, I
have a `+RTS -N10` server process, it
handles 10 concurrent clients okay,
got ~5x of single thread throughput;
but in handling 20 concurrent clients,
each of the 10 CPUs can only be driven
to ~10% utilization, the throughput
seems even worse than single thread.
More clients can even drive it
thrashing without much progressing.</div>
<div><br>
</div>
<div> I can understand that pure STM
doesn't scale well after reading [1],
and I see it suggested [7] attractive
and planned future work toward that
direction.</div>
<div><br>
</div>
<div>But I can't find certain libraries
or frameworks addressing large M over
small N scenarios, [1] experimented
with designated N parallelism, and [7]
is rather theoretical to my empirical
needs.</div>
<div><br>
</div>
<div>Can you direct me to some available
library implementing the methodology
proposed in [7] or other ways tackling
this problem?</div>
<div><br>
</div>
<div>I think the most difficult one is
that a transaction should commit with
global indices (with possibly unique
constraints) atomically updated, and
rollback with any violation of
constraints, i.e. transactions have to
cover global states like indices.
Other problems seem more trivial than
this.</div>
<div><br>
</div>
<div>Specifically, [7] states:</div>
<div><br>
</div>
<div>> It must be emphasized that all
of the mechanisms we deploy originate,
in one form or another, in the
database literature from the 70s and
80s. Our contribution is to adapt
these techniques to software
transactional memory, providing more
effective solutions to important STM
problems than prior proposals.</div>
<div><br>
</div>
<div>I wonder any STM based library has
simplified those techniques to be
composed right away? I don't really
want to implement those mechanisms by
myself, rebuilding many wheels from
scratch.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Compl</div>
<div><br>
</div>
<div><br>
</div>
<div>[1] Comparing the performance of
concurrent linked-list implementations
in Haskell </div>
<div><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
<div><br>
</div>
<div>[7] M. Herlihy and E. Koskinen.
Transactional boosting: a methodology
for highly-concurrent transactional
objects. In Proc. of PPoPP ’08, pages
207–216. ACM Press, 2008.</div>
<div><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
<div><br>
</div>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view
archives go to:<br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman
list are allowed to post.</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">Chris Allen<br>
<div><span style="font-size:12.8px">Currently
working on </span><a href="http://haskellbook.com" target="_blank">http://haskellbook.com</a></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view archives
go to:<br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman list are
allowed to post.</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote></div>