<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Thanks Ryan, and I'm honored to get Simon's attention.</p>
<p>I did have some worry about package tskiplist, that its github
repository seems withdrawn, I emailed the maintainer Peter
Robinson lately but have gotten no response by far. What
particularly worrying me is the 1st sentence of the Readme has
changed from 1.0.0 to 1.0.1 (which is current) as:
</p>
<p>> - <span style="color: rgb(200, 195, 188); font-family:
"PT Sans", -apple-system, BlinkMacSystemFont,
"Segoe UI", Roboto, Oxygen-Sans, Cantarell,
"Helvetica Neue", sans-serif; font-size: 17px;
font-style: normal; font-variant-ligatures: normal;
font-variant-caps: normal; font-weight: 400; letter-spacing:
0.024px; orphans: 2; text-align: left; text-indent: 0px;
text-transform: none; white-space: normal; widows: 2;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(25, 27, 28); text-decoration-style:
initial; text-decoration-color: initial; display: inline
!important; float: none;">This package provides an
implementation of a skip list in STM.</span></p>
<p>>+ <span style="color: rgb(200, 195, 188); font-family:
"PT Sans", -apple-system, BlinkMacSystemFont,
"Segoe UI", Roboto, Oxygen-Sans, Cantarell,
"Helvetica Neue", sans-serif; font-size: 17px;
font-style: normal; font-variant-ligatures: normal;
font-variant-caps: normal; font-weight: 400; letter-spacing:
0.024px; orphans: 2; text-align: left; text-indent: 0px;
text-transform: none; white-space: normal; widows: 2;
word-spacing: 0px; -webkit-text-stroke-width: 0px;
background-color: rgb(25, 27, 28); text-decoration-style:
initial; text-decoration-color: initial; display: inline
!important; float: none;">This package provides a
proof-of-concept implementation of a skip list in STM</span></p>
<p>This has to mean something but I can't figure out yet.<br>
</p>
<p>Dear Peter Robinson, I hope you can see this message and get in
the loop of discussion. <br>
</p>
<p>Despite that, I don't think overhead of TVar itself the most
serious issue in my situation, as before GC engagement, there are
as many TVars being allocated and updated without stuck at
business progressing. And now I realize what presuring GC in my
situation is not only the large number of pointers (TVars), and at
the same time, they form many circular structures, that might be
nightmare for a GC. As I model my data after graph model, in my
test workload, there are many FeatureSet instances each being an
entity/node object, then there are many Feature instances per
FeatureSet object, each Feature instance being an unary
relationship/edge object, with a reference attribute (via TVar)
pointing to the FeatureSet object it belongs to, circular
structures form because I maintain an index at each FeatureSet
object, sorted by weight etc., but ultimately pointing back (via
TVar) to all Feature objects belonging to the set.<br>
</p>
<p>I'm still curious why the new non-moving GC in 8.10.1 still don't
get obvious business progressing in my situation. I tested it on
my Mac yesterday and there I don't know how to see how CPU time is
distributed over threads within a process, I'll further test it
with some Linux boxes to try understand it better.<br>
</p>
<p>Best regards,</p>
<p>Compl</p>
<p><br>
</p>
<div class="moz-cite-prefix">On 2020/7/30 上午10:05, Ryan Yates wrote:<br>
</div>
<blockquote type="cite"
cite="mid:CAO27hRp3vUoq=9EDGLa+Pw9_CQUsdsh3LoarFmq6Xe1CgTJXjQ@mail.gmail.com">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<div dir="ltr">Simon, I certainly want to help get to the bottom
of the performance issue at hand :D. Sorry if my reply was
misleading. The constant factor overhead of pushing `TVar`s
into the internal structure may be pressuring unacceptable GC
behavior to happen sooner. My impression was that given the
same size problem performance loss shifted from synchronization
to GC.
<div><br>
</div>
<div>Compl, I'm not aware of mutable heap objects being
problematic in particular for GHC's GC. There are lots of
special cases to handle them of course. I have successfully
written Haskell programs that get good performance from the GC
with the dominant fraction of heap objects being mutable. I
looked a little more at `TSkipList` and one tricky aspect of
an STM based skip list is how to manage randomness. In
`TSkipList`'s code there is the following comment:</div>
<div>
<div><br>
</div>
</div>
<div>
<pre style="color:rgb(0,0,0)"><span class="gmail-hs-comment" style="color:rgb(138,138,138)">-- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./</span>
<a name="line-98" moz-do-not-send="true"></a><span class="gmail-hs-comment" style="color:rgb(138,138,138)">-- For performance reasons, this function uses 'unsafePerformIO' to access the</span>
<a name="line-99" moz-do-not-send="true"></a><span class="gmail-hs-comment" style="color:rgb(138,138,138)">-- random number generator. (It would be possible to store the random number</span>
<a name="line-100" moz-do-not-send="true"></a><span class="gmail-hs-comment" style="color:rgb(138,138,138)">-- generator in a 'TVar' and thus be able to access it safely from within the</span>
<a name="line-101" moz-do-not-send="true"></a><span class="gmail-hs-comment" style="color:rgb(138,138,138)">-- STM monad. This, however, might cause high contention among threads.)</span></pre>
<pre style="color:rgb(0,0,0)"><span class="gmail-hs-comment" style="color:rgb(138,138,138)"><pre style="color:rgb(0,0,0)"><span class="gmail-hs-identifier" style="color:rgb(7,54,66)">chooseLevel</span> <span class="gmail-hs-glyph" style="color:rgb(220,50,47)">::</span> <a href="http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#TSkipList" class="gmail-" style="text-decoration-line:none;border-bottom:1px solid rgb(238,232,213)" moz-do-not-send="true"><span class="gmail-hs-identifier gmail-hs-type" style="color:rgb(95,95,175)">TSkipList</span></a> <a href="http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#local-6989586621679028835" class="gmail-" style="text-decoration-line:none;border-bottom:1px solid rgb(238,232,213)" moz-do-not-send="true"><span class="gmail-hs-identifier gmail-hs-type" style="color:rgb(95,95,175)">k</span></a> <a href="http://hackage.haskell.org/package/tskiplist-1.0.1/docs/src/Control.Concurrent.STM.TSkipList.Internal.html#local-6989586621679028836" style="text-decoration-line:none;border-bottom:1px solid rgb(238,232,213)" moz-do-not-send="true"><span class="gmail-hs-identifier gmail-hs-type" style="color:rgb(95,95,175)">a</span></a> <span class="gmail-hs-glyph" style="color:rgb(220,50,47)">-></span> <span class="gmail-hs-identifier gmail-hs-type" style="color:rgb(95,95,175)">Int</span></pre></span></pre>
</div>
<div><br>
</div>
<div>This level is chosen on insertion to determine the height
of the node. When writing my own STM skiplist I found that
the details in unsafely accessing randomness had a significant
impact on performance. We went with an unboxed array of PCG
states that had an entry for each capability giving constant
memory overhead in the number of capabilities. `TSkipList`
uses `newStdGen` which involves allocation and
synchronization.</div>
<div><br>
</div>
<div>Again, I'm not pointing this out to say that this is the
entirety of the issue you are encountering, rather, I do think
the `TSkipList` library could be improved to allocate much
less. Others can speak to how to tell where the time is going
in GC (my knowledge of this is likely out of date).</div>
<div><br>
</div>
<div>Ryan</div>
<div><br>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Wed, Jul 29, 2020 at 4:57
PM Simon Peyton Jones <<a
href="mailto:simonpj@microsoft.com" moz-do-not-send="true">simonpj@microsoft.com</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px
0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div lang="EN-GB">
<div class="gmail-m_-3608968087787866939WordSection1">
<p class="MsoNormal"><span>Compl’s problem is (apparently)
that execution becomes dominated by GC. That doesn’t
sound like a constant-factor overhead from TVars, no
matter how efficient (or otherwise) they are. It
sounds more like a space leak to me; perhaps you need
some strict evaluation or something.</span></p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal"><span>My point is only: before
re-engineering STM it would make sense to get a much
more detailed insight into what is actually happening,
and where the space and time is going. We have tools
to do this (heap profiling, Threadscope, …) but I know
they need some skill and insight to use well. But we
don’t have nearly enough insight to draw meaningful
conclusions yet.</span></p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal"><span>Maybe someone with experience
of performance debugging might feel able to help
Compl?</span></p>
<p class="MsoNormal"><span> </span></p>
<p class="MsoNormal"><span>Simon</span></p>
<p class="MsoNormal"><span> </span></p>
<div
style="border-top:none;border-right:none;border-bottom:none;border-left:1.5pt
solid blue;padding:0cm 0cm 0cm 4pt">
<div>
<div
style="border-right:none;border-bottom:none;border-left:none;border-top:1pt
solid rgb(225,225,225);padding:3pt 0cm 0cm">
<p class="MsoNormal"><b><span lang="EN-US">From:</span></b><span
lang="EN-US"> Haskell-Cafe <<a
href="mailto:haskell-cafe-bounces@haskell.org"
target="_blank" moz-do-not-send="true">haskell-cafe-bounces@haskell.org</a>>
<b>On Behalf Of </b>Ryan Yates<br>
<b>Sent:</b> 29 July 2020 20:41<br>
<b>To:</b> YueCompl <<a
href="mailto:compl.yue@icloud.com"
target="_blank" moz-do-not-send="true">compl.yue@icloud.com</a>><br>
<b>Cc:</b> Haskell Cafe <<a
href="mailto:haskell-cafe@haskell.org"
target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>><br>
<b>Subject:</b> Re: [Haskell-cafe] STM friendly
TreeMap (or similar with range scan api) ? WAS:
Best ways to achieve throughput, for large M:N
ratio of STM threads, with hot TVar updates?</span></p>
</div>
</div>
<p class="MsoNormal"> </p>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Hi Compl,</p>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
There is a lot of overhead with TVars. My thesis
work addresses this by incorporating mutable
constructor fields with STM. I would like to get
all that into GHC as soon as I can :D. I haven't
looked closely at the `tskiplist` package, I'll
take a look and see if I see any potential
issues. There was some recent work on concurrent
B-tree that may be interesting to try.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Ryan</p>
</div>
</div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
<div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
On Wed, Jul 29, 2020 at 10:24 AM YueCompl <<a
href="mailto:compl.yue@icloud.com"
target="_blank" moz-do-not-send="true">compl.yue@icloud.com</a>>
wrote:</p>
</div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid rgb(204,204,204);padding:0cm 0cm 0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Hi Cafe and Ryan,</p>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
I tried Map/Set from stm-containers and
TSkipList (added range scan api against its
internal data structure) from <a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhackage.haskell.org%2Fpackage%2Ftskiplist&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=ZOvJVBqJgdGqx2k%2F49fhZeTYkWAd4GRY%2B8ZxH7cyEkI%3D&reserved=0"
target="_blank" moz-do-not-send="true">http://hackage.haskell.org/package/tskiplist</a> ,
with them I've got quite improved at
scalability on concurrency. </p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
But unfortunately then I hit another wall at
single thread scalability over working memory
size, I suspect it's because massively more
TVars (those being pointers per se) are
introduced by those "contention-free" data
structures, they need to mutate separate
pointers concurrently in avoiding contentions
anyway, but such pointer-intensive heap seems
imposing extraordinary pressure to GHC's
garbage collector, that GC will dominate CPU
utilization with poor business progress. </p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
For example in my test, I use `+RTS -H2g` for
the Haskell server process, so GC is not
triggered until after a while, then spin off 3
Python client to insert new records
concurrently, in the first stage each Python
process happily taking ~90% CPU filling
(through local mmap) the arrays allocated from
the server and logs of success scroll quickly,
while the server process utilizes only 30~40%
CPU to serve those 3 clients (insert meta data
records into unique indices merely); then the
client processes' CPU utilization drop
drastically once Haskell server process'
private memory reached around 2gb, i.e. GC
started engaging, the server process's CPU
utilization quickly approaches ~300%, while
all client processes' drop to 0% for most of
the time, and occasionally burst a tiny while
with some log output showing progress. And I
disable parallel GC lately, enabling parallel
GC only makes it worse.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
If I comment out the code updating the indices
(those creating many TVars), the overall
throughput only drop slowly as more data are
inserted, the parallelism feels steady even
after the server process' private memory takes
several GBs.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
I didn't expect this, but appears to me that
GC of GHC is really not good at handling
massive number of pointers in the heap, while
those pointers are essential to reduce
contention (and maybe expensive data copying
too) at heavy parallelism/concurrency.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Btw I tried `+RTS -xn` with GHC 8.10.1 too, no
obvious different behavior compared to 8.8.3;
and also tried tweaking GC related RTS options
a bit, including increasing -G up to 10, no
much difference too.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
I feel hopeless at the moment, wondering if
I'll have to rewrite this in-memory db in
Go/Rust or some other runtime ...</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Btw I read <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Ftech.channable.com%2Fposts%2F2020-04-07-lessons-in-managing-haskell-memory.html&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838761589&sdata=gqSH82%2FOYRaW4fzBDl%2BLDjhbRA%2BDRE6jaj4k1UI2gFE%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html</a> in
searching about the symptoms, and don't feel
likely to convert my DB managed data into
immutable types thus to fit into Compact
Regions, not quite likely a live in-mem
database instance can do.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
So seems there are good reasons no successful
DBMS, at least in-memory ones have been
written in Haskell.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Best regards,</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Compl</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
<br>
<br>
</p>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
On 2020-07-25, at 22:07, Ryan Yates <<a
href="mailto:fryguybob@gmail.com"
target="_blank" moz-do-not-send="true">fryguybob@gmail.com</a>>
wrote:</p>
</div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
<div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Unfortunately my STM benchmarks are
rather disorganized. The most
relevant paper using them is:</p>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Leveraging hardware TM in Haskell
(PPoPP '19)</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdl.acm.org%2Fdoi%2F10.1145%2F3293883.3295711&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=h3po1gPutR%2BsiCST1N0RNkM6irnVL0%2BVbYl3Vs8F8Oc%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://dl.acm.org/doi/10.1145/3293883.3295711</a></p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Or my thesis:</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Furresearch.rochester.edu%2FinstitutionalPublicationPublicView.action%3FinstitutionalItemId%3D34931&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=jBQMX5RRajIj0KbLWQCMt%2BMyMJIEmTpSuEHBWpq5Isg%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931</a> </p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
The PPoPP benchmarks are on a
branch (or the releases tab on
github):</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
<a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fwip%2Fmutable-fields%2Fbenchmarks%2FPPoPP2019%2Fsrc&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838771582&sdata=PinsrrGPgAB9TgxH61xngSItw1DcIRf1Niq39b%2BOe0s%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src</a> </p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
All that to say, without an
implementation of mutable
constructor fields (which I'm
working on getting into GHC) the
scaling is limited.</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
Ryan</p>
</div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
</div>
</div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
</p>
<div>
<div>
<p class="MsoNormal"
style="margin-right:0cm;margin-bottom:6pt;margin-left:0cm">
On Sat, Jul 25, 2020 at 3:45 AM
Compl Yue via Haskell-Cafe <<a
href="mailto:haskell-cafe@haskell.org"
target="_blank"
moz-do-not-send="true">haskell-cafe@haskell.org</a>>
wrote:</p>
</div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid rgb(204,204,204);padding:0cm 0cm
0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p>Dear Cafe,</p>
<p>As Chris Allen has suggested, I
learned that <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=ZwtAltlFRkny5q7M%2B7Pople6c4WA%2Bs8vZhwewUge7eg%3D&reserved=0"
target="_blank"
moz-do-not-send="true">
https://hackage.haskell.org/package/stm-containers</a> and <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838781576&sdata=zMcZy%2BEzqklkQGjKglCgwg5ZoWyWZIyeRNaCcqtnECs%3D&reserved=0"
target="_blank"
moz-do-not-send="true">
https://hackage.haskell.org/package/ttrie</a> can help a lot when used
in place of traditional HashMap
for stm tx processing, under heavy
concurrency, yet still with
automatic parallelism as GHC
implemented them. Then I realized
that in addition to hash map (used
to implement dicts and scopes), I
also need to find a TreeMap
replacement data structure to
implement the db index. I've been
focusing on the uniqueness
constraint aspect, but it's still
an index, needs to provide range
scan api for db clients, so hash
map is not sufficient for the
index.</p>
<p>I see Ryan shared the code
benchmarking RBTree with stm in
mind:</p>
<p><a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree-Throughput&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=Nl2eN81Kjaf5qyNKEaxxc0ioMw6w4QoX4b5vAE5RaF8%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput</a>
</p>
<p><a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Ffryguybob%2Fghc-stm-benchmarks%2Ftree%2Fmaster%2Fbenchmarks%2FRBTree&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838791571&sdata=%2BLp6HQCyROOlpA2pr8BR8DPls68oY5Y77GKgqbSKmno%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree</a></p>
<p>But can't find conclusion or
interpretation of that benchmark
suite. And here's a followup
question:</p>
<p> </p>
<p>Where are some STM contention
optimized data structures, that
having keys ordered, with
sub-range traversing api ?
</p>
<p>(of course production ready
libraries most desirable)</p>
<p> </p>
<p>Thanks with regards,</p>
<p>Compl</p>
<p> </p>
<div>
<p class="MsoNormal">On 2020/7/25
<span
style="font-family:"MS
Gothic"">下午</span>2:04,
Compl Yue via Haskell-Cafe
wrote:</p>
</div>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<p>Shame on me for I have neither
experienced with `perf`, I'd
learn these essential tools soon
to put them into good use.</p>
<p>It's great to learn about how
`orElse` actually works, I did
get confused why there are so
little retries captured, and now
I know. So that little trick
should definitely be removed
before going production, as it
does no much useful things at
excessive cost. I put it there
to help me understand internal
working of stm, now I get even
better knowledge ;-)</p>
<p>I think a debugger will trap
every single abort, isn't it
annoying when many aborts would
occur? If I'd like to count the
number of aborts, ideally
accounted per service endpoints,
time periods, source modules
etc. there some tricks for that?</p>
<p>Thanks with best regards,</p>
<p>Compl</p>
<p> </p>
<div>
<p class="MsoNormal">On
2020/7/25 <span
style="font-family:"MS
Gothic"">上午</span>2:02,
Ryan Yates wrote:</p>
</div>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<div>
<p class="MsoNormal">To be
clear, I was trying to refer
to Linux `perf` [^1].
Sampling based profiling can
do a good job with
concurrent and parallel
programs where other methods
are problematic. For
instance,
</p>
<div>
<p class="MsoNormal"> changing
the size of heap objects
can drastically change
cache performance and
completely different
behavior can show up.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">[^1]: <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FPerf_(Linux)&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=v%2Bv2aVaBITriAM26CqN%2Bp35yshLl%2BbY4BWVEIOSlStA%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://en.wikipedia.org/wiki/Perf_(Linux)</a></p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">The
spinning in `readTVar`
should always be very
short and it typically
shows up as intensive CPU
use, though it may not be
high energy use with
`pause` in the loop on x86
(looks like we don't have
it [^2], I thought we did,
but maybe that was only in
some of my code... )</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">[^2]: <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1275&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838801566&sdata=YBLmeg4Xxby%2BJJmO8B5etdA6tDpBYOry7jdjEoRFd%2Fk%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> </p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">All
that to say, I doubt that
you are spending much time
spinning (but it would
certainly be interesting
to know if you are! You
would see `perf` attribute
a large amount of time to
`read_current_value`).
The amount of code to
execute for commit (the
time when locks are held)
is always much shorter
than it takes to execute
the transaction body. As
you add more conflicting
threads this gets worse of
course as commits
sequence.</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">The
code you have will count
commits of executions of
`retry`. Note that
`retry` is a user level
idea, that is, you are
counting user level
*explicit* retries. This
is different from a
transaction failing to
commit and starting
again. These are
invisible to the user.
Also using your trace will
convert `retry` from the
efficient wake on write
implementation, to an
active retry that will
always attempt again. We
don't have cheap logging
of transaction aborts in
GHC, but I have built such
logging in my work. You
can observe these aborts
with a debugger by looking
for execution of this
line:</p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal"><a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fghc%2Fghc%2Fblob%2Fmaster%2Frts%2FSTM.c%23L1123&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=jAEm1CpEYQx6ORikerxVHOSlaOmrTzB3m9EVmOwo%2B8w%3D&reserved=0"
target="_blank"
moz-do-not-send="true">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal">Ryan </p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
</div>
<p class="MsoNormal"> </p>
<div>
<div>
<p class="MsoNormal">On Fri,
Jul 24, 2020 at 12:35 PM
Compl Yue <<a
href="mailto:compl.yue@icloud.com"
target="_blank"
moz-do-not-send="true">compl.yue@icloud.com</a>>
wrote:</p>
</div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid
rgb(204,204,204);padding:0cm
0cm 0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p>I'm not familiar with
profiling GHC yet, may
need more time to get
myself proficient with
it.</p>
<p>And a bit more details
of my test workload for
diagnostic: the db
clients are Python
processes from a cluster
of worker nodes,
consulting the db server
to register some path
for data files, under a
data dir within a shared
filesystem, then mmap
those data files and
fill in actual array
data. So the db server
don't have much
computation to perform,
but puts the data file
path into a global
index, which at the same
validates its
uniqueness. As there are
many client processes
trying to insert one
meta data record
concurrently, with my
naive implementation,
the global index's TVar
will almost always in
locked state by one
client after another,
from a queue never fall
empty.</p>
<p>So if `readTVar` should
spinning waiting, I
doubt the threads should
actually make high CPU
utilization, because at
any instant of time, all
threads except the
committing one will be
doing that one thing.</p>
<p>And I have something in
my code to track STM
retry like this:</p>
<p>```</p>
<div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(106,153,85)">-- blocking wait not
expected, track
stm retries
explicitly</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(220,220,170)">trackSTM</span><span
style="font-size:13pt;color:rgb(212,212,212)">
::
</span><span
style="font-size:13pt;color:rgb(86,156,214)">Int</span><span
style="font-size:13pt;color:rgb(212,212,212)"> ->
</span><span
style="font-size:13pt;color:rgb(86,156,214)">IO</span><span
style="font-size:13pt;color:rgb(212,212,212)"> (</span><span
style="font-size:13pt;color:rgb(86,156,214)">Either</span><span
style="font-size:13pt;color:rgb(212,212,212)"> ()
</span><span
style="font-size:13pt;color:rgb(156,220,254)">a</span><span
style="font-size:13pt;color:rgb(212,212,212)">)</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">trackSTM !rtc =
</span><span
style="font-size:13pt;color:rgb(197,134,192)">do</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">when
</span><span
style="font-size:13pt;color:rgb(106,153,85)">--
todo increase the
threshold of
reporting?</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">(rtc >
</span><span
style="font-size:13pt;color:rgb(181,206,168)">0</span><span
style="font-size:13pt;color:rgb(212,212,212)">) $
</span><span
style="font-size:13pt;color:rgb(197,134,192)">do</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(106,153,85)">-- trace out the retries so
the end users can
be aware of them</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">tid <- myThreadId</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">trace</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">(
</span><span
style="font-size:13pt;color:rgb(206,145,120)">"</span><span
style="font-size:13pt;font-family:"Segoe UI
Emoji",sans-serif;color:rgb(206,145,120)">🔙</span><span
style="font-size:13pt;color:rgb(215,186,125)">\n</span><span
style="font-size:13pt;color:rgb(206,145,120)">"</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)"><> show callCtx</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)"><>
</span><span
style="font-size:13pt;color:rgb(206,145,120)">"</span><span
style="font-size:13pt;font-family:"Segoe UI
Emoji",sans-serif;color:rgb(206,145,120)">🌀</span><span
style="font-size:13pt;color:rgb(206,145,120)"> "</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)"><> show tid</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)"><>
</span><span
style="font-size:13pt;color:rgb(206,145,120)">"
stm retry #"</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)"><> show rtc</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">)</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">$ return
</span><span
style="font-size:13pt;color:rgb(86,156,214)">()</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">atomically ((Just
<$> stmJob)
`orElse` return
Nothing) >>=
\</span><span
style="font-size:13pt;color:rgb(197,134,192)">case</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">Nothing ->
</span><span
style="font-size:13pt;color:rgb(106,153,85)">--
stm failed, do a
tracked retry</span><span
style="font-size:13pt;color:rgb(212,212,212)"></span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">trackSTM (rtc +
</span><span
style="font-size:13pt;color:rgb(181,206,168)">1</span><span
style="font-size:13pt;color:rgb(212,212,212)">)</span></p>
</div>
<div>
<p class="MsoNormal"
style="line-height:17.25pt;background:rgb(30,30,30)"><span
style="font-size:13pt;color:rgb(212,212,212)">Just ... -> ...</span></p>
</div>
</div>
<p>```</p>
<p>No such trace msg fires
during my test, neither
in single thread run,
nor in runs with
pressure. I'm sure this
tracing mechanism works,
as I can see such traces
fire, in case e.g.
posting a TMVar to a
TQueue for some other
thread to fill it, then
read the result out, if
these 2 ops are composed
into a single tx, then
of course it's infinite
retry loop, and a
sequence of such msgs
are logged with ever
increasing rtc #.</p>
<p>So I believe no retry
has ever been triggered.</p>
<p>What can going on
there?</p>
<p> </p>
<div>
<p class="MsoNormal">On
2020/7/24 <span
style="font-family:"MS
Gothic"">下午</span>11:46,
Ryan Yates wrote:</p>
</div>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<div>
<div>
<div>
<p
class="MsoNormal">>
Then to explain
the low CPU
utilization
(~10%), am I
right to
understand it as
that upon
reading a TVar
locked by
another
committing tx, a
lightweight
thread will put
itself into
`waiting STM`
and descheduled
state, so the
CPUs can only
stay idle as not
so many threads
are willing to
proceed?</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Since
the commit
happens in
finite steps,
the expectation
is that the lock
will be released
very soon.
Given this when
the body of a
transaction
executes
`readTVar` it
spins (active
CPU!) until the
`TVar` is
observed
unlocked. If a
lock is observed
while commiting,
it immediately
starts the
transaction
again from the
beginning. To
get the behavior
of suspending a
transaction you
have to
successfully
commit a
transaction that
executed
`retry`. Then
the transaction
is put on the
wakeup lists of
its read set and
subsequent
commits will
wake it up if
its write set
overlaps.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">I
don't think any
of these things
would explain
low CPU
utilization.
You could try
running with
`perf` and see
if lots of time
is spent in some
recognizable
part of the RTS.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Ryan</p>
</div>
</div>
<div>
<p class="MsoNormal"> </p>
</div>
<p class="MsoNormal"> </p>
<div>
<div>
<p
class="MsoNormal">On
Fri, Jul 24,
2020 at 11:22 AM
Compl Yue <<a
href="mailto:compl.yue@icloud.com" target="_blank"
moz-do-not-send="true">compl.yue@icloud.com</a>>
wrote:</p>
</div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid
rgb(204,204,204);padding:0cm
0cm 0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p>Thanks very
much for the
insightful
information
Ryan! I'm glad
my suspect was
wrong about
the Haskell
scheduler:</p>
<p>> The
Haskell
capability
that is
committing a
transaction
will not yield
to another
Haskell thread
while it is
doing the
commit. The
OS thread may
be
preempted, but
once commit
starts the
haskell
scheduler is
not invoked
until after
locks are
released.</p>
<div>
<p
class="MsoNormal">So
best effort
had already
been made in
GHC and I just
need to
cooperate
better with
its design.
Then to
explain the
low CPU
utilization
(~10%), am I
right to
understand it
as that upon
reading a TVar
locked by
another
committing tx,
a lightweight
thread will
put itself
into `waiting
STM` and
descheduled
state, so the
CPUs can only
stay idle as
not so many
threads are
willing to
proceed?</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Anyway,
I see light
with better
data
structures to
improve my
situation, let
me try them
and report
back. Actually
I later
changed `TVar
(HaskMap k v)`
to be `TVar
(HashMap k
Int)` where
the `Int`
being array
index into
`TVar (Vector
(TVar (Maybe
v)))`, in
pursuing
insertion
order
preservation
semantic of
dict entries
(like that in
Python 3.7+),
then it's very
hopeful to
incorporate
stm-containers'
Map or ttrie
to approach
free of
contention.</p>
</div>
<p>Thanks with
regards,</p>
<p>Compl</p>
<p> </p>
<div>
<p
class="MsoNormal">On
2020/7/24 <span
style="font-family:"MS Gothic"">下午</span>10:03, Ryan Yates
wrote:</p>
</div>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<div>
<div>
<p
class="MsoNormal">Hi
Compl,</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Having
a pool of
transaction processing
threads can be
helpful in a
certain way.
If the body of
the
transaction
takes more
time to
execute then
the Haskell
thread is
allowed and it
yields, the
suspended thread
won't get in
the way of
other thread,
but when it is
rescheduled,
will have a
low
probability of
success. Even
worse, it will
probably not
discover that
it is doomed
to failure
until commit
time. If
transactions
are more
likely to
reach commit
without
yielding, they
are more
likely to
succeed. If
the
transactions
are not
conflicting,
it doesn't
make much
difference
other than
cache churn.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">The
Haskell
capability
that is
committing a
transaction
will not yield
to another
Haskell thread
while it is
doing the
commit. The
OS thread may
be
preempted, but
once commit
starts the
haskell
scheduler is
not invoked
until after
locks are
released.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">To
get good
performance
from STM you
must pay
attention to
what TVars are
involved in a
commit. All
STM systems
are working
under the
assumption of
low
contention, so
you want to
minimize
"false"
conflicts
(conflicts
that are not
essential to
the
computation).
Something
like `TVar
(HashMap k v)`
will work
pretty well
for a low
thread count,
but every
transaction
that writes to
that structure
will be in
conflict with
every other
transaction
that accesses
it. Pushing
the `TVar`
into the nodes
of the
structure
reduces the
possibilities
for conflict,
while
increasing the
amount of
bookkeeping
STM has to
do. I would
like to reduce
the cost of
that
bookkeeping
using better
structures,
but we need to
do so without
harming
performance in
the low TVar
count case.
Right now it
is optimized
for good cache
performance
with a handful
of TVars.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">There
is another way
to play with
performance by
moving work
into and out
of the
transaction
body. A
transaction
body that
executes
quickly will
reach commit
faster. But
it may be
delaying work
that moves
into another
transaction.
Forcing values
at the right
time can make
a big
difference.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Ryan</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">On
Fri, Jul 24,
2020 at 2:14
AM Compl Yue
via
Haskell-Cafe
<<a
href="mailto:haskell-cafe@haskell.org"
target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>>
wrote:</p>
</div>
<div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid
rgb(204,204,204);padding:0cm
0cm 0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p>Thanks
Chris, I
confess I
didn't pay
enough
attention to
STM
specialized
container
libraries by
far, I skimmed
through the
description of
stm-containers
and ttrie, and
feel they
would
definitely
improve my
code's
performance in
case I limit
the server's
parallelism
within
hardware
capabilities.
That may
because I'm
still
prototyping
the api and
infrastructure
for
correctness,
so even `TVar
(HashMap k v)`
performs okay
for me at the
moment, only
if at low
contention
(surely
there're
plenty of CPU
cycles to be
optimized out
in next
steps). I
model my data
after graph
model, so most
data, even
most indices
are localized
to nodes and
edges, those
can be
manipulated
without
conflict,
that's why I
assumed I have
a low
contention use
case since the
very beginning
- until I
found there
are still
(though minor)
needs for
global indices
to guarantee
global
uniqueness, I
feel faithful
with
stm-containers/ttrie
to implement a
more scalable
global index
data
structure,
thanks for
hinting me.</p>
<p>So an
evident
solution comes
into my mind
now, is to run
the server
with a pool of
tx processing
threads,
matching
number of CPU
cores, client
RPC requests
then get
queued to be
executed in
some thread
from the pool.
But I'm really
fond of the
mechanism of
M:N scheduler
which solves
massive/dynamic
concurrency so
elegantly. I
had some good
result with Go
in this
regard, and
see GHC at par
in doing this,
I don't want
to give up
this enjoyable
machinery.</p>
<p>But looked
at the stm
implementation
in GHC, it
seems written
TVars are
exclusively
locked during
commit of a
tx, I suspect
this is the
culprit when
there're large
M lightweight
threads
scheduled upon
a small N
hardware
capabilities,
that is when a
lightweight
thread yield
control during
an stm
transaction
commit, the
TVars it
locked will
stay so until
it's scheduled
again (and
again) till it
can finish the
commit. This
way,
descheduled
threads could
hold live
threads from
progressing. I
haven't gone
into more
details there,
but wonder if
there can be
some
improvement
for GHC RTS to
keep an stm
committing
thread from
descheduled,
but seemingly
that may
impose more
starvation
potential; or
stm can be
improved to
have its TVar
locks
preemptable
when the owner
trec/thread is
in descheduled
state? Neither
should be easy
but I'd really
love massive
lightweight
threads doing
STM
practically
well.</p>
<p>Best
regards,</p>
<p>Compl</p>
<p> </p>
<div>
<p
class="MsoNormal">On
2020/7/24 <span
style="font-family:"MS Gothic"">上午</span>12:57, Christopher
Allen wrote:</p>
</div>
<blockquote
style="margin-top:5pt;margin-bottom:5pt">
<div>
<p
class="MsoNormal">It
seems like you
know how to
run practical
tests for
tuning thread
count and
contention for
throughput.
Part of the
reason you
haven't gotten
a super clear
answer is "it
depends." You
give up
fairness when
you use STM
instead of
MVars or
equivalent
structures.
That means a
long running
transaction
might get
stampeded by
many small
ones
invalidating
it over and
over. The
long-running
transaction
might never
clear if
the small
transactions
keep moving
the cheese. I
mention this
because
transaction
runtime and
size and count
all affect
throughput and
latency. What
might be ideal
for one
pattern of
work might not
be ideal for
another.
Optimizing for
overall
throughput
might make the
contention and
fairness
problems worse
too. I've done
practical
tests to
optimize this
in the past,
both for STM
in Haskell and
for RDBMS
workloads. </p>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">The
next step is
sometimes
figuring out
whether you
really need a
data structure
within a
single STM
container or
if perhaps you
can break up
your STM
container
boundaries
into zones or
regions that
roughly map
onto update
boundaries.
That should
make the
transactions
churn less. On
the outside
chance you do
need to touch
more than one
container in a
transaction,
well, they
compose.
</p>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">e.g. <a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fstm-containers&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838811560&sdata=Lq1%2BGj0Z6%2BBGMRAZrSzcTAlYgj0B0A67RaQcyyCcXbk%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://hackage.haskell.org/package/stm-containers</a></p>
</div>
<div>
<p
class="MsoNormal"><a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fhackage.haskell.org%2Fpackage%2Fttrie&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=PpaiVM2NrPM2HzK0bh%2BMR8YF90yHlxKnN9gwZVQHqR0%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://hackage.haskell.org/package/ttrie</a></p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">It
also sounds a
bit like
your question
bumps into
Amdahl's Law a
bit.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">All
else fails,
stop using STM
and find
something more
tuned to your
problem space.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Hope
this helps,</p>
</div>
<div>
<p
class="MsoNormal">Chris
Allen</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
</div>
</div>
<p
class="MsoNormal"> </p>
<div>
<div>
<p
class="MsoNormal">On
Thu, Jul 23,
2020 at 9:53
AM YueCompl
via
Haskell-Cafe
<<a
href="mailto:haskell-cafe@haskell.org"
target="_blank" moz-do-not-send="true">haskell-cafe@haskell.org</a>>
wrote:</p>
</div>
<blockquote
style="border-top:none;border-right:none;border-bottom:none;border-left:1pt
solid
rgb(204,204,204);padding:0cm
0cm 0cm
6pt;margin-left:4.8pt;margin-right:0cm">
<div>
<p
class="MsoNormal">Hello
Cafe, </p>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">I'm
working on an
in-memory
database, in
Client/Server
mode I just
let each
connected
client submit
remote
procedure call
running in its
dedicated
lightweight
thread,
modifying
TVars in RAM
per its
business
needs, then in
case many
clients
connected
concurrently
and trying to
insert new
data, if they
are triggering
global index
(some TVar)
update, the
throughput
would drop
drastically. I
reduced the
shared state
to a simple
int counter by
TVar, got same
symptom. While
the
parallelism
feels okay
when there's
no hot TVar
conflicting,
or M is not
much greater
than N.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">As
an empirical
test workload,
I have a `+RTS
-N10` server
process, it
handles 10
concurrent
clients okay,
got ~5x of
single thread
throughput;
but in
handling 20
concurrent
clients, each
of the 10 CPUs
can only be
driven to ~10%
utilization,
the throughput
seems even
worse than
single thread.
More clients
can even drive
it thrashing
without much
progressing.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal"> I
can understand
that pure STM
doesn't scale
well after
reading [1],
and I see it
suggested [7]
attractive and
planned future
work toward
that
direction.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">But
I can't find
certain
libraries or
frameworks
addressing
large M over
small N
scenarios, [1]
experimented
with
designated N
parallelism,
and [7] is
rather
theoretical to
my empirical
needs.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Can
you direct me
to some
available
library
implementing
the
methodology
proposed in
[7] or other
ways tackling
this problem?</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">I
think the most
difficult one
is that a
transaction
should commit
with global
indices (with
possibly
unique
constraints)
atomically
updated, and
rollback with
any violation
of
constraints,
i.e.
transactions
have to cover
global states
like indices.
Other problems
seem more
trivial than
this.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Specifically,
[7] states:</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">> It
must be
emphasized
that all of
the mechanisms
we deploy
originate, in
one form or
another, in
the database
literature
from the 70s
and 80s. Our
contribution
is to adapt
these
techniques to
software
transactional
memory,
providing more
effective
solutions to
important STM
problems than
prior
proposals.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">I
wonder any STM
based library
has simplified
those
techniques to
be composed
right away? I
don't really
want to
implement
those
mechanisms by
myself,
rebuilding
many wheels
from scratch.</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">Best
regards,</p>
</div>
<div>
<p
class="MsoNormal">Compl</p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">[1] Comparing
the
performance of
concurrent
linked-list
implementations
in Haskell </p>
</div>
<div>
<p
class="MsoNormal"><a
href="https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fsimonmar.github.io%2Fbib%2Fpapers%2Fconcurrent-data.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838821555&sdata=41Jaz8ZRmRfBHyGKxfhJlm4xR7q0pOtJShtO0jTlOwQ%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
<div>
<p
class="MsoNormal">[7]
M. Herlihy and
E. Koskinen.
Transactional
boosting: a
methodology
for
highly-concurrent
transactional
objects. In
Proc. of PPoPP
’08, pages
207–216. ACM
Press, 2008.</p>
</div>
<div>
<p
class="MsoNormal"><a
href="https://nam06.safelinks.protection.outlook.com/?url=https:%2F%2Fwww.cs.stevens.edu%2F~ejk%2Fpapers%2Fboosting-ppopp08.pdf&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=ya8Az1oC6f2xoMb90S9HCH57UTQ0nV9sg6SW%2B5JCPC4%3D&reserved=0"
target="_blank" moz-do-not-send="true">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></p>
</div>
<div>
<p
class="MsoNormal"> </p>
</div>
</div>
<p
class="MsoNormal">_______________________________________________<br>
Haskell-Cafe
mailing list<br>
To
(un)subscribe,
modify options
or view
archives go
to:<br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=c2AV7CO42o3tcw0EuMzqedKkBCtQjWjvdMoUsb4llbY%3D&reserved=0"
target="_blank" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members
subscribed via
the mailman
list are
allowed to
post.</p>
</blockquote>
</div>
<p
class="MsoNormal"><br
clear="all">
</p>
<div>
<p
class="MsoNormal"> </p>
</div>
<p
class="MsoNormal">--
</p>
<div>
<div>
<div>
<div>
<div>
<p
class="MsoNormal">Chris
Allen</p>
<div>
<p
class="MsoNormal"><span
style="font-size:9.5pt">Currently working on </span><a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhaskellbook.com%2F&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838831548&sdata=tIHFQFZPIQgRp8oqGRvyebm1YQdCvGD0VoMcflzJwKc%3D&reserved=0"
target="_blank" moz-do-not-send="true">http://haskellbook.com</a></p>
</div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
<p
class="MsoNormal">_______________________________________________<br>
Haskell-Cafe
mailing list<br>
To
(un)subscribe,
modify options
or view
archives go
to:<br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0"
target="_blank" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members
subscribed via
the mailman
list are
allowed to
post.</p>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
<p class="MsoNormal"> </p>
<pre>_______________________________________________</pre>
<pre>Haskell-Cafe mailing list</pre>
<pre>To (un)subscribe, modify options or view archives go to:</pre>
<pre><a href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838841547&sdata=vdzv5WBA62cNwO6DA1D4KEHDCweyOerpn1PdMK0A%2BHw%3D&reserved=0" target="_blank" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a></pre>
<pre>Only members subscribed via the mailman list are allowed to post.</pre>
</blockquote>
</div>
<p class="MsoNormal">_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or
view archives go to:<br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0"
target="_blank"
moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the
mailman list are allowed to post.</p>
</blockquote>
</div>
<p class="MsoNormal">_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view
archives go to:<br>
<a
href="https://nam06.safelinks.protection.outlook.com/?url=http%3A%2F%2Fmail.haskell.org%2Fcgi-bin%2Fmailman%2Flistinfo%2Fhaskell-cafe&data=02%7C01%7Csimonpj%40microsoft.com%7C8ebd68bca55140cebaae08d833f888f2%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C637316489838851540&sdata=Btpa3sjfAjTf2ICO0QpQG5vVCawIjERNjUHji06uG5Y%3D&reserved=0"
target="_blank" moz-do-not-send="true">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman
list are allowed to post.</p>
</div>
</blockquote>
</div>
<p class="MsoNormal"> </p>
</div>
</div>
</blockquote>
</div>
</div>
</div>
</div>
</blockquote>
</div>
</blockquote>
</body>
</html>