<html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi Cafe and Ryan,<div class=""><br class=""></div><div class="">I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from <a href="http://hackage.haskell.org/package/tskiplist" class="">http://hackage.haskell.org/package/tskiplist</a> , with them I've got quite improved at scalability on concurrency. </div><div class=""><br class=""></div><div class="">But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. </div><div class=""><br class=""></div><div class="">For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse.</div><div class=""><br class=""></div><div class="">If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs.</div><div class=""><br class=""></div><div class="">I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency.</div><div class=""><br class=""></div><div class="">Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too.</div><div class=""><br class=""></div><div class="">I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ...</div><div class=""><br class=""></div><div class="">Btw I read <a href="https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html" class="">https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html</a> in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do.</div><div class=""><br class=""></div><div class="">So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell.</div><div class=""><br class=""></div><div class="">Best regards,</div><div class="">Compl</div><div class=""><br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 2020-07-25, at 22:07, Ryan Yates <<a href="mailto:fryguybob@gmail.com" class="">fryguybob@gmail.com</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class="">Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is:<div class=""><br class=""></div><div class="">Leveraging hardware TM in Haskell (PPoPP '19)<br class=""></div><div class=""><a href="https://dl.acm.org/doi/10.1145/3293883.3295711" target="_blank" class="">https://dl.acm.org/doi/10.1145/3293883.3295711</a></div><div class=""><br class=""></div><div class="">Or my thesis:</div><div class=""><a href="https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931" target="_blank" class="">https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931</a> <br class=""></div><div class=""><br class=""></div><div class=""> The PPoPP benchmarks are on a branch (or the releases tab on github):</div><div class=""><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src</a> </div><div class=""><br class=""></div><div class=""><br class=""></div><div class=""> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited.</div><div class=""><br class=""></div><div class="">Ryan</div><div class=""> <br class=""></div></div><br class=""><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>> wrote:<br class=""></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class=""><p class="">Dear Cafe,</p><p class="">As Chris Allen has suggested, I learned that <a href="https://hackage.haskell.org/package/stm-containers" target="_blank" class="">https://hackage.haskell.org/package/stm-containers</a>
and <a href="https://hackage.haskell.org/package/ttrie" target="_blank" class="">https://hackage.haskell.org/package/ttrie</a>
can help a lot when used in place of traditional HashMap for stm
tx processing, under heavy concurrency, yet still with automatic
parallelism as GHC implemented them. Then I realized that in
addition to hash map (used to implement dicts and scopes), I also
need to find a TreeMap replacement data structure to implement the
db index. I've been focusing on the uniqueness constraint aspect,
but it's still an index, needs to provide range scan api for db
clients, so hash map is not sufficient for the index.<br class="">
</p><p class="">I see Ryan shared the code benchmarking RBTree with stm in mind:</p><p class=""> <a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput</a>
<br class="">
</p><p class=""><a href="https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree" target="_blank" class="">https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree</a></p><p class="">But can't find conclusion or interpretation of that benchmark
suite. And here's a followup question:</p><p class=""><br class="">
</p><p class="">Where are some STM contention optimized data structures, that
having keys ordered, with sub-range traversing api ? <br class="">
</p><p class="">(of course production ready libraries most desirable)<br class="">
</p><p class=""><br class="">
</p><p class="">Thanks with regards,</p><p class="">Compl</p><p class=""><br class="">
</p>
<div class="">On 2020/7/25 δΈε2:04, Compl Yue via
Haskell-Cafe wrote:<br class="">
</div>
<blockquote type="cite" class=""><p class="">Shame on me for I have neither experienced with `perf`, I'd
learn these essential tools soon to put them into good use.</p><p class="">It's great to learn about how `orElse` actually works, I did
get confused why there are so little retries captured, and now I
know. So that little trick should definitely be removed before
going production, as it does no much useful things at excessive
cost. I put it there to help me understand internal working of
stm, now I get even better knowledge ;-)<br class="">
</p><p class="">I think a debugger will trap every single abort, isn't it
annoying when many aborts would occur? If I'd like to count the
number of aborts, ideally accounted per service endpoints, time
periods, source modules etc. there some tricks for that?</p><p class="">Thanks with best regards,</p><p class="">Compl</p><p class=""><br class="">
</p>
<div class="">On 2020/7/25 δΈε2:02, Ryan Yates
wrote:<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">To be clear, I was trying to refer to Linux
`perf` [^1]. Sampling based profiling can do a good job with
concurrent and parallel programs where other methods are
problematic. For instance,
<div class=""> changing the size of heap objects can drastically change
cache performance and completely different behavior can show
up.</div>
<div class=""><br class="">
</div>
<div class="">[^1]: <a href="https://en.wikipedia.org/wiki/Perf_(Linux)" target="_blank" class="">https://en.wikipedia.org/wiki/Perf_(Linux)</a></div>
<div class=""><br class="">
</div>
<div class="">The spinning in `readTVar` should always be very short
and it typically shows up as intensive CPU use, though it
may not be high energy use with `pause` in the loop on x86
(looks like we don't have it [^2], I thought we did, but
maybe that was only in some of my code... )</div>
<div class=""><br class="">
</div>
<div class="">[^2]: <a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275" target="_blank" class="">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275</a> <br class="">
</div>
<div class=""><br class="">
</div>
<div class="">All that to say, I doubt that you are spending much time
spinning (but it would certainly be interesting to know if
you are! You would see `perf` attribute a large amount of
time to `read_current_value`). The amount of code to
execute for commit (the time when locks are held) is always
much shorter than it takes to execute the transaction body.
As you add more conflicting threads this gets worse of
course as commits sequence.</div>
<div class=""><br class="">
</div>
<div class="">The code you have will count commits of executions of
`retry`. Note that `retry` is a user level idea, that is,
you are counting user level *explicit* retries. This is
different from a transaction failing to commit and starting
again. These are invisible to the user. Also using your
trace will convert `retry` from the efficient wake on write
implementation, to an active retry that will always attempt
again. We don't have cheap logging of transaction aborts in
GHC, but I have built such logging in my work. You can
observe these aborts with a debugger by looking for
execution of this line:</div>
<div class=""><br class="">
</div>
<div class=""><a href="https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123" target="_blank" class="">https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123</a></div>
<div class=""><br class="">
</div>
<div class="">Ryan </div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at
12:35 PM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank" class="">compl.yue@icloud.com</a>> wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class=""><p class="">I'm not familiar with profiling GHC yet, may need more
time to get myself proficient with it.</p><p class="">And a bit more details of my test workload for
diagnostic: the db clients are Python processes from a
cluster of worker nodes, consulting the db server to
register some path for data files, under a data dir
within a shared filesystem, then mmap those data files
and fill in actual array data. So the db server don't
have much computation to perform, but puts the data file
path into a global index, which at the same validates
its uniqueness. As there are many client processes
trying to insert one meta data record concurrently, with
my naive implementation, the global index's TVar will
almost always in locked state by one client after
another, from a queue never fall empty.</p><p class="">So if `readTVar` should spinning waiting, I doubt the
threads should actually make high CPU utilization,
because at any instant of time, all threads except the
committing one will be doing that one thing.<br class="">
</p><p class="">And I have something in my code to track STM retry like
this:</p><p class="">```</p>
<div style="color:rgb(212,212,212);background-color:rgb(30,30,30);font-weight:normal;font-size:17px;line-height:23px;white-space:pre-wrap" class="">
<div class=""><span style="color:rgb(212,212,212)" class=""> </span><span style="color:rgb(106,153,85)" class="">-- blocking wait not expected, track stm retries explicitly</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> </span><span style="color:rgb(220,220,170)" class="">trackSTM</span><span style="color:rgb(212,212,212)" class=""> :: </span><span style="color:rgb(86,156,214)" class="">Int</span><span style="color:rgb(212,212,212)" class=""> -> </span><span style="color:rgb(86,156,214)" class="">IO</span><span style="color:rgb(212,212,212)" class=""> (</span><span style="color:rgb(86,156,214)" class="">Either</span><span style="color:rgb(212,212,212)" class=""> () </span><span style="color:rgb(156,220,254)" class="">a</span><span style="color:rgb(212,212,212)" class="">)</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> trackSTM !rtc = </span><span style="color:rgb(197,134,192)" class="">do</span></div>
<div class=""><span style="color:rgb(212,212,212)" class=""> when </span><span style="color:rgb(106,153,85)" class="">-- todo increase the threshold of reporting?</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> (rtc > </span><span style="color:rgb(181,206,168)" class="">0</span><span style="color:rgb(212,212,212)" class="">) $ </span><span style="color:rgb(197,134,192)" class="">do</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> </span><span style="color:rgb(106,153,85)" class="">-- trace out the retries so the end users can be aware of them</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> tid <- myThreadId</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> trace</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> ( </span><span style="color:rgb(206,145,120)" class="">"π</span><span style="color:rgb(215,186,125)" class="">\n</span><span style="color:rgb(206,145,120)" class="">"</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> <> show callCtx</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> <> </span><span style="color:rgb(206,145,120)" class="">"π "</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> <> show tid</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> <> </span><span style="color:rgb(206,145,120)" class="">" stm retry #"</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> <> show rtc</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> )</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> $ return </span><span style="color:rgb(86,156,214)" class="">()</span></div>
<div class=""><span style="color:rgb(212,212,212)" class=""> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \</span><span style="color:rgb(197,134,192)" class="">case</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> Nothing -> </span><span style="color:rgb(106,153,85)" class="">-- stm failed, do a tracked retry</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> trackSTM (rtc + </span><span style="color:rgb(181,206,168)" class="">1</span><span style="color:rgb(212,212,212)" class="">)</span></div><div class=""><span style="color:rgb(212,212,212)" class=""> Just ... -> ...</span></div><div class=""><span style="color:rgb(212,212,212)" class="">
</span></div></div><p class="">```</p><p class="">No such trace msg fires during my test, neither in
single thread run, nor in runs with pressure. I'm sure
this tracing mechanism works, as I can see such traces
fire, in case e.g. posting a TMVar to a TQueue for some
other thread to fill it, then read the result out, if
these 2 ops are composed into a single tx, then of
course it's infinite retry loop, and a sequence of such
msgs are logged with ever increasing rtc #.<br class="">
</p><p class="">So I believe no retry has ever been triggered.</p><p class="">What can going on there?</p><p class=""><br class="">
</p>
<div class="">On 2020/7/24 δΈε11:46, Ryan Yates wrote:<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div dir="ltr" class="">
<div class="">> Then to explain the low CPU utilization
(~10%), am I right to understand it as that upon
reading a TVar locked by another committing tx, a
lightweight thread will put itself into `waiting
STM` and descheduled state, so the CPUs can only
stay idle as not so many threads are willing to
proceed?<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Since the commit happens in finite steps, the
expectation is that the lock will be released very
soon. Given this when the body of a transaction
executes `readTVar` it spins (active CPU!) until
the `TVar` is observed unlocked. If a lock is
observed while commiting, it immediately starts
the transaction again from the beginning. To get
the behavior of suspending a transaction you have
to successfully commit a transaction that executed
`retry`. Then the transaction is put on the
wakeup lists of its read set and subsequent
commits will wake it up if its write set overlaps.</div>
<div class=""><br class="">
</div>
<div class="">I don't think any of these things would explain
low CPU utilization. You could try running with
`perf` and see if lots of time is spent in some
recognizable part of the RTS.</div>
<div class=""><br class="">
</div>
<div class="">Ryan</div>
</div>
<div class=""><br class="">
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Fri, Jul 24,
2020 at 11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com" target="_blank" class="">compl.yue@icloud.com</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class=""><p class="">Thanks very much for the insightful
information Ryan! I'm glad my suspect was
wrong about the Haskell scheduler:<br class="">
</p><p class="">> The Haskell capability that is
committing a transaction will not yield to
another Haskell thread while it is doing the
commit. The OS thread may be preempted, but
once commit starts the haskell scheduler is
not invoked until after locks are released.</p>
<div class="">So best effort had already been made in GHC
and I just need to cooperate better with its
design. Then to explain the low CPU
utilization (~10%), am I right to understand
it as that upon reading a TVar locked by
another committing tx, a lightweight thread
will put itself into `waiting STM` and
descheduled state, so the CPUs can only stay
idle as not so many threads are willing to
proceed?<br class="">
</div>
<div class=""><br class="">
</div>
<div class="">Anyway, I see light with better data
structures to improve my situation, let me try
them and report back. Actually I later changed
`TVar (HaskMap k v)` to be `TVar (HashMap k
Int)` where the `Int` being array index into
`TVar (Vector (TVar (Maybe v)))`, in pursuing
insertion order preservation semantic of dict
entries (like that in Python 3.7+), then it's
very hopeful to incorporate stm-containers'
Map or ttrie to approach free of contention.<br class="">
</div><p class="">Thanks with regards,</p><p class="">Compl</p><p class=""><br class="">
</p>
<div class="">On 2020/7/24 δΈε10:03, Ryan Yates wrote:<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">
<div class="">Hi Compl,</div>
<div class=""><br class="">
</div>
<div class="">Having a pool of transaction processing
threads can be helpful in a certain way.
If the body of the transaction takes more
time to execute then the Haskell thread is
allowed and it yields, the
suspended thread won't get in the way of
other thread, but when it is rescheduled,
will have a low probability of success.
Even worse, it will probably not discover
that it is doomed to failure until commit
time. If transactions are more likely to
reach commit without yielding, they are
more likely to succeed. If the
transactions are not conflicting, it
doesn't make much difference other than
cache churn.</div>
<div class=""><br class="">
</div>
<div class="">The Haskell capability that is
committing a transaction will not yield to
another Haskell thread while it is doing
the commit. The OS thread may be
preempted, but once commit starts the
haskell scheduler is not invoked until
after locks are released.</div>
<div class=""><br class="">
</div>
<div class="">To get good performance from STM you
must pay attention to what TVars are
involved in a commit. All STM systems are
working under the assumption of low
contention, so you want to minimize
"false" conflicts (conflicts that are not
essential to the computation).
Something like `TVar (HashMap k v)` will
work pretty well for a low thread count,
but every transaction that writes to that
structure will be in conflict with every
other transaction that accesses it.
Pushing the `TVar` into the nodes of the
structure reduces the possibilities for
conflict, while increasing the amount of
bookkeeping STM has to do. I would like
to reduce the cost of that bookkeeping
using better structures, but we need to do
so without harming performance in the low
TVar count case. Right now it is
optimized for good cache performance with
a handful of TVars.</div>
<div class=""><br class="">
</div>
<div class="">There is another way to play with
performance by moving work into and out of
the transaction body. A transaction body
that executes quickly will reach commit
faster. But it may be delaying work that
moves into another transaction. Forcing
values at the right time can make a big
difference.</div>
<div class=""><br class="">
</div>
<div class="">Ryan</div>
<div dir="ltr" class=""><br class="">
</div>
<div dir="ltr" class="">On Fri, Jul 24, 2020 at 2:14
AM Compl Yue via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>>
wrote:<br class="">
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class=""><p class="">Thanks Chris, I confess I didn't
pay enough attention to STM
specialized container libraries by
far, I skimmed through the
description of stm-containers and
ttrie, and feel they would
definitely improve my code's
performance in case I limit the
server's parallelism within hardware
capabilities. That may because I'm
still prototyping the api and
infrastructure for correctness, so
even `TVar (HashMap k v)` performs
okay for me at the moment, only if
at low contention (surely there're
plenty of CPU cycles to be optimized
out in next steps). I model my data
after graph model, so most data,
even most indices are localized to
nodes and edges, those can be
manipulated without conflict, that's
why I assumed I have a low
contention use case since the very
beginning - until I found there are
still (though minor) needs for
global indices to guarantee global
uniqueness, I feel faithful with
stm-containers/ttrie to implement a
more scalable global index data
structure, thanks for hinting me.</p><p class="">So an evident solution comes into
my mind now, is to run the server
with a pool of tx processing
threads, matching number of CPU
cores, client RPC requests then get
queued to be executed in some thread
from the pool. But I'm really fond
of the mechanism of M:N scheduler
which solves massive/dynamic
concurrency so elegantly. I had some
good result with Go in this regard,
and see GHC at par in doing this, I
don't want to give up this enjoyable
machinery.</p><p class="">But looked at the stm
implementation in GHC, it seems
written TVars are exclusively locked
during commit of a tx, I suspect
this is the culprit when there're
large M lightweight threads
scheduled upon a small N hardware
capabilities, that is when a
lightweight thread yield control
during an stm transaction commit,
the TVars it locked will stay so
until it's scheduled again (and
again) till it can finish the
commit. This way, descheduled
threads could hold live threads from
progressing. I haven't gone into
more details there, but wonder if
there can be some improvement for
GHC RTS to keep an stm committing
thread from descheduled, but
seemingly that may impose more
starvation potential; or stm can be
improved to have its TVar locks
preemptable when the owner
trec/thread is in descheduled state?
Neither should be easy but I'd
really love massive lightweight
threads doing STM practically well.<br class="">
</p><p class="">Best regards,</p><p class="">Compl</p><p class=""><br class="">
</p>
<div class="">On 2020/7/24 δΈε12:57, Christopher
Allen wrote:<br class="">
</div>
<blockquote type="cite" class="">
<div dir="ltr" class="">It seems like you
know how to run practical tests
for tuning thread count and
contention for throughput. Part of
the reason you haven't gotten a
super clear answer is "it
depends." You give up fairness
when you use STM instead of MVars
or equivalent structures. That
means a long running transaction
might get stampeded by many small
ones invalidating it over and
over. The long-running transaction
might never clear if the small
transactions keep moving the
cheese. I mention this because
transaction runtime and size and
count all affect throughput and
latency. What might be ideal for
one pattern of work might not be
ideal for another. Optimizing for
overall throughput might make the
contention and fairness problems
worse too. I've done practical
tests to optimize this in the
past, both for STM in Haskell and
for RDBMS workloads.
<div class=""><br class="">
</div>
<div class="">The next step is sometimes
figuring out whether you really
need a data structure within a
single STM container or if
perhaps you can break up your
STM container boundaries into
zones or regions that roughly
map onto update boundaries. That
should make the transactions
churn less. On the outside
chance you do need to touch more
than one container in a
transaction, well, they compose.
<div class=""><br class="">
</div>
<div class="">e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank" class="">https://hackage.haskell.org/package/stm-containers</a></div>
<div class=""><a href="https://hackage.haskell.org/package/ttrie" target="_blank" class="">https://hackage.haskell.org/package/ttrie</a><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">It also sounds a bit like
your question bumps into
Amdahl's Law a bit.</div>
<div class=""><br class="">
</div>
<div class="">All else fails, stop using
STM and find something more
tuned to your problem space.</div>
<div class=""><br class="">
</div>
<div class="">Hope this helps,</div>
<div class="">Chris Allen</div>
<div class=""><br class="">
</div>
</div>
</div>
<br class="">
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On
Thu, Jul 23, 2020 at 9:53 AM
YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank" class="">haskell-cafe@haskell.org</a>>
wrote:<br class="">
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div class="">Hello Cafe,
<div class=""><br class="">
</div>
<div class="">I'm working on an
in-memory database, in
Client/Server mode I just
let each connected client
submit remote procedure call
running in its dedicated
lightweight thread,
modifying TVars in RAM per
its business needs, then in
case many clients connected
concurrently and trying to
insert new data, if they are
triggering global index
(some TVar) update, the
throughput would drop
drastically. I reduced the
shared state to a simple int
counter by TVar, got same
symptom. While the
parallelism feels okay when
there's no hot TVar
conflicting, or M is not
much greater than N.</div>
<div class=""><br class="">
</div>
<div class="">As an empirical test
workload, I have a `+RTS
-N10` server process, it
handles 10 concurrent
clients okay, got ~5x of
single thread throughput;
but in handling 20
concurrent clients, each of
the 10 CPUs can only be
driven to ~10% utilization,
the throughput seems even
worse than single thread.
More clients can even drive
it thrashing without much
progressing.</div>
<div class=""><br class="">
</div>
<div class=""> I can understand that
pure STM doesn't scale well
after reading [1], and I see
it suggested [7] attractive
and planned future work
toward that direction.</div>
<div class=""><br class="">
</div>
<div class="">But I can't find certain
libraries or frameworks
addressing large M over
small N scenarios, [1]
experimented with designated
N parallelism, and [7] is
rather theoretical to my
empirical needs.</div>
<div class=""><br class="">
</div>
<div class="">Can you direct me to some
available library
implementing the methodology
proposed in [7] or other
ways tackling this problem?</div>
<div class=""><br class="">
</div>
<div class="">I think the most
difficult one is that a
transaction should commit
with global indices (with
possibly unique constraints)
atomically updated, and
rollback with any violation
of constraints, i.e.
transactions have to cover
global states like indices.
Other problems seem more
trivial than this.</div>
<div class=""><br class="">
</div>
<div class="">Specifically, [7] states:</div>
<div class=""><br class="">
</div>
<div class="">> It must be
emphasized that all of the
mechanisms we deploy
originate, in one form or
another, in the database
literature from the 70s and
80s. Our contribution is to
adapt these techniques to
software transactional
memory, providing more
effective solutions to
important STM problems than
prior proposals.</div>
<div class=""><br class="">
</div>
<div class="">I wonder any STM based
library has simplified those
techniques to be composed
right away? I don't really
want to implement those
mechanisms by myself,
rebuilding many wheels from
scratch.</div>
<div class=""><br class="">
</div>
<div class="">Best regards,</div>
<div class="">Compl</div>
<div class=""><br class="">
</div>
<div class=""><br class="">
</div>
<div class="">[1] Comparing the
performance of concurrent
linked-list implementations
in Haskell </div>
<div class=""><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank" class="">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
<div class=""><br class="">
</div>
<div class="">[7] M. Herlihy and E.
Koskinen. Transactional
boosting: a methodology for
highly-concurrent
transactional objects. In
Proc. of PPoPP β08, pages
207β216. ACM Press, 2008.</div>
<div class=""><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank" class="">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
<div class=""><br class="">
</div>
</div>
_______________________________________________<br class="">
Haskell-Cafe mailing list<br class="">
To (un)subscribe, modify options
or view archives go to:<br class="">
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">
Only members subscribed via the
mailman list are allowed to
post.</blockquote>
</div>
<br clear="all" class="">
<div class=""><br class="">
</div>
-- <br class="">
<div dir="ltr" class="">
<div dir="ltr" class="">
<div class="">
<div dir="ltr" class="">
<div dir="ltr" class="">Chris Allen<br class="">
<div class=""><span style="font-size:12.8px" class="">Currently
working on </span><a href="http://haskellbook.com/" target="_blank" class="">http://haskellbook.com</a></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
_______________________________________________<br class="">
Haskell-Cafe mailing list<br class="">
To (un)subscribe, modify options or view
archives go to:<br class="">
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">
Only members subscribed via the mailman
list are allowed to post.</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote>
</div>
</blockquote>
<br class="">
<fieldset class=""></fieldset>
<pre class="">_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a>
Only members subscribed via the mailman list are allowed to post.</pre>
</blockquote>
</div>
_______________________________________________<br class="">
Haskell-Cafe mailing list<br class="">
To (un)subscribe, modify options or view archives go to:<br class="">
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">
Only members subscribed via the mailman list are allowed to post.</blockquote></div>
_______________________________________________<br class="">Haskell-Cafe mailing list<br class="">To (un)subscribe, modify options or view archives go to:<br class=""><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br class=""></div></body></html>