<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Shame on me for I have neither experienced with `perf`, I'd learn
these essential tools soon to put them into good use.</p>
<p>It's great to learn about how `orElse` actually works, I did get
confused why there are so little retries captured, and now I know.
So that little trick should definitely be removed before going
production, as it does no much useful things at excessive cost. I
put it there to help me understand internal working of stm, now I
get even better knowledge ;-)<br>
</p>
<p>I think a debugger will trap every single abort, isn't it
annoying when many aborts would occur? If I'd like to count the
number of aborts, ideally accounted per service endpoints, time
periods, source modules etc. there some tricks for that?</p>
<p>Thanks with best regards,</p>
<p>Compl</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 2020/7/25 上午2:02, Ryan Yates wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAO27hRpGPqddoB9cnNoEg9phJg8gDexE572fOgscCezTS+grog@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">To be clear, I was trying to refer to Linux `perf`
[^1]. Sampling based profiling can do a good job with
concurrent and parallel programs where other methods are
problematic. For instance,
<div> changing the size of heap objects can drastically change
cache performance and completely different behavior can show
up.</div>
<div><br>
</div>
<div>[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)"
moz-do-not-send="true">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div>
<div><br>
</div>
<div>The spinning in `readTVar` should always be very short and
it typically shows up as intensive CPU use, though it may not
be high energy use with `pause` in the loop on x86 (looks like
we don't have it [^2], I thought we did, but maybe that was
only in some of my code... )</div>
<div><br>
</div>
<div>[^2]: <a
href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275"
moz-do-not-send="true">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br>
</div>
<div><br>
</div>
<div>All that to say, I doubt that you are spending much time
spinning (but it would certainly be interesting to know if you
are! You would see `perf` attribute a large amount of time to
`read_current_value`). The amount of code to execute for
commit (the time when locks are held) is always much shorter
than it takes to execute the transaction body. As you add
more conflicting threads this gets worse of course as commits
sequence.</div>
<div><br>
</div>
<div>The code you have will count commits of executions of
`retry`. Note that `retry` is a user level idea, that is, you
are counting user level *explicit* retries. This is different
from a transaction failing to commit and starting again.
These are invisible to the user. Also using your trace will
convert `retry` from the efficient wake on write
implementation, to an active retry that will always attempt
again. We don't have cheap logging of transaction aborts in
GHC, but I have built such logging in my work. You can
observe these aborts with a debugger by looking for execution
of this line:</div>
<div><br>
</div>
<div><a
href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123"
moz-do-not-send="true">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div>
<div><br>
</div>
<div>Ryan </div>
<div><br>
</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at 12:35
PM Compl Yue <<a href="mailto:compl.yue@icloud.com"
moz-do-not-send="true">compl.yue@icloud.com</a>> wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>I'm not familiar with profiling GHC yet, may need more
time to get myself proficient with it.</p>
<p>And a bit more details of my test workload for
diagnostic: the db clients are Python processes from a
cluster of worker nodes, consulting the db server to
register some path for data files, under a data dir within
a shared filesystem, then mmap those data files and fill
in actual array data. So the db server don't have much
computation to perform, but puts the data file path into a
global index, which at the same validates its uniqueness.
As there are many client processes trying to insert one
meta data record concurrently, with my naive
implementation, the global index's TVar will almost always
in locked state by one client after another, from a queue
never fall empty.</p>
<p>So if `readTVar` should spinning waiting, I doubt the
threads should actually make high CPU utilization, because
at any instant of time, all threads except the committing
one will be doing that one thing.<br>
</p>
<p>And I have something in my code to track STM retry like
this:</p>
<p>```</p>
<div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap">
<div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(106,153,85)">-- blocking wait not expected, track stm retries explicitly</span></div><div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(220,220,170)">trackSTM</span><span style="color:rgb(212,212,212)"> :: </span><span style="color:rgb(86,156,214)">Int</span><span style="color:rgb(212,212,212)"> -> </span><span style="color:rgb(86,156,214)">IO</span><span style="color:rgb(212,212,212)"> (</span><span style="color:rgb(86,156,214)">Either</span><span style="color:rgb(212,212,212)"> () </span><span style="color:rgb(156,220,254)">a</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)"> trackSTM !rtc = </span><span style="color:rgb(197,134,192)">do</span></div>
<div><span style="color:rgb(212,212,212)"> when </span><span style="color:rgb(106,153,85)">-- todo increase the threshold of reporting?</span></div><div><span style="color:rgb(212,212,212)"> (rtc > </span><span style="color:rgb(181,206,168)">0</span><span style="color:rgb(212,212,212)">) $ </span><span style="color:rgb(197,134,192)">do</span></div><div><span style="color:rgb(212,212,212)"> </span><span style="color:rgb(106,153,85)">-- trace out the retries so the end users can be aware of them</span></div><div><span style="color:rgb(212,212,212)"> tid <- myThreadId</span></div><div><span style="color:rgb(212,212,212)"> trace</span></div><div><span style="color:rgb(212,212,212)"> ( </span><span style="color:rgb(206,145,120)">"🔙</span><span style="color:rgb(215,186,125)">\n</span><span style="color:rgb(206,145,120)">"</span></div><div><span style="color:rgb(212,212,212)"> <> show callCtx</span></div><div><span style="color:rgb(212,212,212)"> <> </span><span style="color:rgb(206,145,120)">"🌀 "</span></div><div><span style="color:rgb(212,212,212)"> <> show tid</span></div><div><span style="color:rgb(212,212,212)"> <> </span><span style="color:rgb(206,145,120)">" stm retry #"</span></div><div><span style="color:rgb(212,212,212)"> <> show rtc</span></div><div><span style="color:rgb(212,212,212)"> )</span></div><div><span style="color:rgb(212,212,212)"> $ return </span><span style="color:rgb(86,156,214)">()</span></div>
<div><span style="color:rgb(212,212,212)"> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)">case</span></div><div><span style="color:rgb(212,212,212)"> Nothing -> </span><span style="color:rgb(106,153,85)">-- stm failed, do a tracked retry</span></div><div><span style="color:rgb(212,212,212)"> trackSTM (rtc + </span><span style="color:rgb(181,206,168)">1</span><span style="color:rgb(212,212,212)">)</span></div><div><span style="color:rgb(212,212,212)"> Just ... -> ...</span></div><div><span style="color:rgb(212,212,212)">
</span></div></div>
<p>```</p>
<p>No such trace msg fires during my test, neither in single
thread run, nor in runs with pressure. I'm sure this
tracing mechanism works, as I can see such traces fire, in
case e.g. posting a TMVar to a TQueue for some other
thread to fill it, then read the result out, if these 2
ops are composed into a single tx, then of course it's
infinite retry loop, and a sequence of such msgs are
logged with ever increasing rtc #.<br>
</p>
<p>So I believe no retry has ever been triggered.</p>
<p>What can going on there?</p>
<p><br>
</p>
<div>On 2020/7/24 下午11:46, Ryan Yates wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div dir="ltr">
<div>> Then to explain the low CPU utilization
(~10%), am I right to understand it as that upon
reading a TVar locked by another committing tx, a
lightweight thread will put itself into `waiting
STM` and descheduled state, so the CPUs can only
stay idle as not so many threads are willing to
proceed?<br>
</div>
<div><br>
</div>
<div>Since the commit happens in finite steps, the
expectation is that the lock will be released very
soon. Given this when the body of a transaction
executes `readTVar` it spins (active CPU!) until the
`TVar` is observed unlocked. If a lock is observed
while commiting, it immediately starts the
transaction again from the beginning. To get the
behavior of suspending a transaction you have to
successfully commit a transaction that executed
`retry`. Then the transaction is put on the wakeup
lists of its read set and subsequent commits will
wake it up if its write set overlaps.</div>
<div><br>
</div>
<div>I don't think any of these things would explain
low CPU utilization. You could try running with
`perf` and see if lots of time is spent in some
recognizable part of the RTS.</div>
<div><br>
</div>
<div>Ryan</div>
</div>
<div><br>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020
at 11:22 AM Compl Yue <<a
href="mailto:compl.yue@icloud.com" target="_blank"
moz-do-not-send="true">compl.yue@icloud.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px
0px 0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks very much for the insightful information
Ryan! I'm glad my suspect was wrong about the
Haskell scheduler:<br>
</p>
<p>> The Haskell capability that is
committing a transaction will not yield to
another Haskell thread while it is doing the
commit. The OS thread may be preempted, but
once commit starts the haskell scheduler is not
invoked until after locks are released.</p>
<div>So best effort had already been made in GHC
and I just need to cooperate better with its
design. Then to explain the low CPU utilization
(~10%), am I right to understand it as that upon
reading a TVar locked by another committing tx,
a lightweight thread will put itself into
`waiting STM` and descheduled state, so the CPUs
can only stay idle as not so many threads are
willing to proceed?<br>
</div>
<div><br>
</div>
<div>Anyway, I see light with better data
structures to improve my situation, let me try
them and report back. Actually I later changed
`TVar (HaskMap k v)` to be `TVar (HashMap k
Int)` where the `Int` being array index into
`TVar (Vector (TVar (Maybe v)))`, in pursuing
insertion order preservation semantic of dict
entries (like that in Python 3.7+), then it's
very hopeful to incorporate stm-containers' Map
or ttrie to approach free of contention.<br>
</div>
<p>Thanks with regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi Compl,</div>
<div><br>
</div>
<div>Having a pool of transaction processing
threads can be helpful in a certain way. If
the body of the transaction takes more time
to execute then the Haskell thread is
allowed and it yields, the suspended thread
won't get in the way of other thread, but
when it is rescheduled, will have a low
probability of success. Even worse, it will
probably not discover that it is doomed to
failure until commit time. If transactions
are more likely to reach commit without
yielding, they are more likely to succeed.
If the transactions are not conflicting, it
doesn't make much difference other than
cache churn.</div>
<div><br>
</div>
<div>The Haskell capability that is
committing a transaction will not yield to
another Haskell thread while it is doing the
commit. The OS thread may be preempted, but
once commit starts the haskell scheduler is
not invoked until after locks are released.</div>
<div><br>
</div>
<div>To get good performance from STM you must
pay attention to what TVars are involved in
a commit. All STM systems are working under
the assumption of low contention, so you
want to minimize "false" conflicts
(conflicts that are not essential to the
computation). Something like `TVar
(HashMap k v)` will work pretty well for a
low thread count, but every transaction that
writes to that structure will be in conflict
with every other transaction that accesses
it. Pushing the `TVar` into the nodes of
the structure reduces the possibilities for
conflict, while increasing the amount of
bookkeeping STM has to do. I would like to
reduce the cost of that bookkeeping using
better structures, but we need to do so
without harming performance in the low TVar
count case. Right now it is optimized for
good cache performance with a handful of
TVars.</div>
<div><br>
</div>
<div>There is another way to play with
performance by moving work into and out of
the transaction body. A transaction body
that executes quickly will reach commit
faster. But it may be delaying work that
moves into another transaction. Forcing
values at the right time can make a big
difference.</div>
<div><br>
</div>
<div>Ryan</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM
Compl Yue via Haskell-Cafe <<a
href="mailto:haskell-cafe@haskell.org"
target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks Chris, I confess I didn't pay
enough attention to STM specialized
container libraries by far, I skimmed
through the description of
stm-containers and ttrie, and feel
they would definitely improve my
code's performance in case I limit the
server's parallelism within hardware
capabilities. That may because I'm
still prototyping the api and
infrastructure for correctness, so
even `TVar (HashMap k v)` performs
okay for me at the moment, only if at
low contention (surely there're plenty
of CPU cycles to be optimized out in
next steps). I model my data after
graph model, so most data, even most
indices are localized to nodes and
edges, those can be manipulated
without conflict, that's why I assumed
I have a low contention use case since
the very beginning - until I found
there are still (though minor) needs
for global indices to guarantee global
uniqueness, I feel faithful with
stm-containers/ttrie to implement a
more scalable global index data
structure, thanks for hinting me.</p>
<p>So an evident solution comes into my
mind now, is to run the server with a
pool of tx processing threads,
matching number of CPU cores, client
RPC requests then get queued to be
executed in some thread from the pool.
But I'm really fond of the mechanism
of M:N scheduler which solves
massive/dynamic concurrency so
elegantly. I had some good result with
Go in this regard, and see GHC at par
in doing this, I don't want to give up
this enjoyable machinery.</p>
<p>But looked at the stm implementation
in GHC, it seems written TVars are
exclusively locked during commit of a
tx, I suspect this is the culprit when
there're large M lightweight threads
scheduled upon a small N hardware
capabilities, that is when a
lightweight thread yield control
during an stm transaction commit, the
TVars it locked will stay so until
it's scheduled again (and again) till
it can finish the commit. This way,
descheduled threads could hold live
threads from progressing. I haven't
gone into more details there, but
wonder if there can be some
improvement for GHC RTS to keep an stm
committing thread from descheduled,
but seemingly that may impose more
starvation potential; or stm can be
improved to have its TVar locks
preemptable when the owner trec/thread
is in descheduled state? Neither
should be easy but I'd really love
massive lightweight threads doing STM
practically well.<br>
</p>
<p>Best regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 上午12:57, Christopher
Allen wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">It seems like you know
how to run practical tests for
tuning thread count and contention
for throughput. Part of the reason
you haven't gotten a super clear
answer is "it depends." You give up
fairness when you use STM instead of
MVars or equivalent structures. That
means a long running transaction
might get stampeded by many small
ones invalidating it over and over.
The long-running transaction might
never clear if the small
transactions keep moving the cheese.
I mention this because transaction
runtime and size and count all
affect throughput and latency. What
might be ideal for one pattern of
work might not be ideal for another.
Optimizing for overall throughput
might make the contention and
fairness problems worse too. I've
done practical tests to optimize
this in the past, both for STM in
Haskell and for RDBMS workloads.
<div><br>
</div>
<div>The next step is sometimes
figuring out whether you really
need a data structure within a
single STM container or if perhaps
you can break up your STM
container boundaries into zones or
regions that roughly map onto
update boundaries. That should
make the transactions churn less.
On the outside chance you do need
to touch more than one container
in a transaction, well, they
compose.
<div><br>
</div>
<div>e.g. <a
href="https://hackage.haskell.org/package/stm-containers"
target="_blank"
moz-do-not-send="true">https://hackage.haskell.org/package/stm-containers</a></div>
<div><a
href="https://hackage.haskell.org/package/ttrie"
target="_blank"
moz-do-not-send="true">https://hackage.haskell.org/package/ttrie</a><br>
</div>
<div><br>
</div>
<div>It also sounds a bit like
your question bumps into
Amdahl's Law a bit.</div>
<div><br>
</div>
<div>All else fails, stop using
STM and find something more
tuned to your problem space.</div>
<div><br>
</div>
<div>Hope this helps,</div>
<div>Chris Allen</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Thu, Jul 23, 2020 at 9:53 AM
YueCompl via Haskell-Cafe <<a
href="mailto:haskell-cafe@haskell.org"
target="_blank"
moz-do-not-send="true">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote"
style="margin:0px 0px 0px
0.8ex;border-left:1px solid
rgb(204,204,204);padding-left:1ex">
<div>Hello Cafe,
<div><br>
</div>
<div>I'm working on an in-memory
database, in Client/Server
mode I just let each connected
client submit remote procedure
call running in its dedicated
lightweight thread, modifying
TVars in RAM per its business
needs, then in case many
clients connected concurrently
and trying to insert new data,
if they are triggering global
index (some TVar) update, the
throughput would drop
drastically. I reduced the
shared state to a simple int
counter by TVar, got same
symptom. While the parallelism
feels okay when there's no hot
TVar conflicting, or M is not
much greater than N.</div>
<div><br>
</div>
<div>As an empirical test
workload, I have a `+RTS -N10`
server process, it handles 10
concurrent clients okay, got
~5x of single thread
throughput; but in handling 20
concurrent clients, each of
the 10 CPUs can only be driven
to ~10% utilization, the
throughput seems even worse
than single thread. More
clients can even drive it
thrashing without much
progressing.</div>
<div><br>
</div>
<div> I can understand that pure
STM doesn't scale well after
reading [1], and I see it
suggested [7] attractive and
planned future work toward
that direction.</div>
<div><br>
</div>
<div>But I can't find certain
libraries or frameworks
addressing large M over small
N scenarios, [1] experimented
with designated N parallelism,
and [7] is rather theoretical
to my empirical needs.</div>
<div><br>
</div>
<div>Can you direct me to some
available library implementing
the methodology proposed in
[7] or other ways tackling
this problem?</div>
<div><br>
</div>
<div>I think the most difficult
one is that a transaction
should commit with global
indices (with possibly unique
constraints) atomically
updated, and rollback with any
violation of constraints, i.e.
transactions have to cover
global states like indices.
Other problems seem more
trivial than this.</div>
<div><br>
</div>
<div>Specifically, [7] states:</div>
<div><br>
</div>
<div>> It must be emphasized
that all of the mechanisms we
deploy originate, in one form
or another, in the database
literature from the 70s and
80s. Our contribution is to
adapt these techniques to
software transactional memory,
providing more effective
solutions to important STM
problems than prior proposals.</div>
<div><br>
</div>
<div>I wonder any STM based
library has simplified those
techniques to be composed
right away? I don't really
want to implement those
mechanisms by myself,
rebuilding many wheels from
scratch.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Compl</div>
<div><br>
</div>
<div><br>
</div>
<div>[1] Comparing the
performance of concurrent
linked-list implementations in
Haskell </div>
<div><a
href="https://simonmar.github.io/bib/papers/concurrent-data.pdf"
target="_blank"
moz-do-not-send="true">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
<div><br>
</div>
<div>[7] M. Herlihy and E.
Koskinen. Transactional
boosting: a methodology for
highly-concurrent
transactional objects. In
Proc. of PPoPP ’08, pages
207–216. ACM Press, 2008.</div>
<div><a
href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf"
target="_blank"
moz-do-not-send="true">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
<div><br>
</div>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options
or view archives go to:<br>
<a
href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the
mailman list are allowed to post.</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">Chris Allen<br>
<div><span
style="font-size:12.8px">Currently
working on </span><a
href="http://haskellbook.com"
target="_blank"
moz-do-not-send="true">http://haskellbook.com</a></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view
archives go to:<br>
<a
href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe"
rel="noreferrer" target="_blank"
moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman
list are allowed to post.</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>