<div dir="ltr"><div dir="ltr"><div>> Then to explain the low CPU
utilization (~10%), am I right to understand it as that upon
reading a TVar locked by another committing tx, a lightweight
thread will put itself into `waiting STM` and descheduled state,
so the CPUs can only stay idle as not so many threads are willing
to proceed?<br>
</div><div><br></div><div>Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps.</div><div><br></div><div>I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS.</div><div><br></div><div>Ryan</div>
<div></div></div><div><br></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Fri, Jul 24, 2020 at 11:22 AM Compl Yue <<a href="mailto:compl.yue@icloud.com">compl.yue@icloud.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks very much for the insightful information Ryan! I'm glad my
suspect was wrong about the Haskell scheduler:<br>
</p>
<p>> The Haskell capability that is committing a transaction will
not yield to another Haskell thread while it is doing the commit.
The OS thread may be preempted, but once commit starts the haskell
scheduler is not invoked until after locks are released.</p>
<div>So best effort had already been made in GHC and I just need to
cooperate better with its design. Then to explain the low CPU
utilization (~10%), am I right to understand it as that upon
reading a TVar locked by another committing tx, a lightweight
thread will put itself into `waiting STM` and descheduled state,
so the CPUs can only stay idle as not so many threads are willing
to proceed?<br>
</div>
<div><br>
</div>
<div>Anyway, I see light with better data structures to improve my
situation, let me try them and report back. Actually I later
changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where
the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`,
in pursuing insertion order preservation semantic of dict entries
(like that in Python 3.7+), then it's very hopeful to incorporate
stm-containers' Map or ttrie to approach free of contention.<br>
</div>
<p>Thanks with regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 下午10:03, Ryan Yates wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">
<div>Hi Compl,</div>
<div><br>
</div>
<div>Having a pool of transaction processing threads can be
helpful in a certain way. If the body of the transaction
takes more time to execute then the Haskell thread is allowed
and it yields, the suspended thread won't get in the way of
other thread, but when it is rescheduled, will have a low
probability of success. Even worse, it will probably not
discover that it is doomed to failure until commit time. If
transactions are more likely to reach commit without yielding,
they are more likely to succeed. If the transactions are not
conflicting, it doesn't make much difference other than cache
churn.</div>
<div><br>
</div>
<div>The Haskell capability that is committing a transaction
will not yield to another Haskell thread while it is doing the
commit. The OS thread may be preempted, but once commit
starts the haskell scheduler is not invoked until after locks
are released.</div>
<div><br>
</div>
<div>To get good performance from STM you must pay attention to
what TVars are involved in a commit. All STM systems are
working under the assumption of low contention, so you want to
minimize "false" conflicts (conflicts that are not essential
to the computation). Something like `TVar (HashMap k v)`
will work pretty well for a low thread count, but every
transaction that writes to that structure will be in conflict
with every other transaction that accesses it. Pushing the
`TVar` into the nodes of the structure reduces the
possibilities for conflict, while increasing the amount of
bookkeeping STM has to do. I would like to reduce the cost of
that bookkeeping using better structures, but we need to do so
without harming performance in the low TVar count case. Right
now it is optimized for good cache performance with a handful
of TVars.</div>
<div><br>
</div>
<div>There is another way to play with performance by moving
work into and out of the transaction body. A transaction body
that executes quickly will reach commit faster. But it may be
delaying work that moves into another transaction. Forcing
values at the right time can make a big difference.</div>
<div><br>
</div>
<div>Ryan</div>
<div dir="ltr"><br>
</div>
<div dir="ltr">On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via
Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<div class="gmail_quote">
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>
<p>Thanks Chris, I confess I didn't pay enough attention
to STM specialized container libraries by far, I skimmed
through the description of stm-containers and ttrie, and
feel they would definitely improve my code's performance
in case I limit the server's parallelism within hardware
capabilities. That may because I'm still prototyping the
api and infrastructure for correctness, so even `TVar
(HashMap k v)` performs okay for me at the moment, only
if at low contention (surely there're plenty of CPU
cycles to be optimized out in next steps). I model my
data after graph model, so most data, even most indices
are localized to nodes and edges, those can be
manipulated without conflict, that's why I assumed I
have a low contention use case since the very beginning
- until I found there are still (though minor) needs for
global indices to guarantee global uniqueness, I feel
faithful with stm-containers/ttrie to implement a more
scalable global index data structure, thanks for hinting
me.</p>
<p>So an evident solution comes into my mind now, is to
run the server with a pool of tx processing threads,
matching number of CPU cores, client RPC requests then
get queued to be executed in some thread from the pool.
But I'm really fond of the mechanism of M:N scheduler
which solves massive/dynamic concurrency so elegantly. I
had some good result with Go in this regard, and see GHC
at par in doing this, I don't want to give up this
enjoyable machinery.</p>
<p>But looked at the stm implementation in GHC, it seems
written TVars are exclusively locked during commit of a
tx, I suspect this is the culprit when there're large M
lightweight threads scheduled upon a small N hardware
capabilities, that is when a lightweight thread yield
control during an stm transaction commit, the TVars it
locked will stay so until it's scheduled again (and
again) till it can finish the commit. This way,
descheduled threads could hold live threads from
progressing. I haven't gone into more details there, but
wonder if there can be some improvement for GHC RTS to
keep an stm committing thread from descheduled, but
seemingly that may impose more starvation potential; or
stm can be improved to have its TVar locks preemptable
when the owner trec/thread is in descheduled state?
Neither should be easy but I'd really love massive
lightweight threads doing STM practically well.<br>
</p>
<p>Best regards,</p>
<p>Compl</p>
<p><br>
</p>
<div>On 2020/7/24 上午12:57, Christopher Allen wrote:<br>
</div>
<blockquote type="cite">
<div dir="ltr">It seems like you know how to run
practical tests for tuning thread count and contention
for throughput. Part of the reason you haven't gotten
a super clear answer is "it depends." You give up
fairness when you use STM instead of MVars or
equivalent structures. That means a long running
transaction might get stampeded by many small ones
invalidating it over and over. The long-running
transaction might never clear if the small
transactions keep moving the cheese. I mention this
because transaction runtime and size and count all
affect throughput and latency. What might be ideal for
one pattern of work might not be ideal for another.
Optimizing for overall throughput might make the
contention and fairness problems worse too. I've done
practical tests to optimize this in the past, both for
STM in Haskell and for RDBMS workloads.
<div><br>
</div>
<div>The next step is sometimes figuring out whether
you really need a data structure within a single STM
container or if perhaps you can break up your STM
container boundaries into zones or regions that
roughly map onto update boundaries. That should make
the transactions churn less. On the outside chance
you do need to touch more than one container in a
transaction, well, they compose.
<div><br>
</div>
<div>e.g. <a href="https://hackage.haskell.org/package/stm-containers" target="_blank">https://hackage.haskell.org/package/stm-containers</a></div>
<div><a href="https://hackage.haskell.org/package/ttrie" target="_blank">https://hackage.haskell.org/package/ttrie</a><br>
</div>
<div><br>
</div>
<div>It also sounds a bit like your question bumps
into Amdahl's Law a bit.</div>
<div><br>
</div>
<div>All else fails, stop using STM and find
something more tuned to your problem space.</div>
<div><br>
</div>
<div>Hope this helps,</div>
<div>Chris Allen</div>
<div><br>
</div>
</div>
</div>
<br>
<div class="gmail_quote">
<div dir="ltr" class="gmail_attr">On Thu, Jul 23, 2020
at 9:53 AM YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" target="_blank">haskell-cafe@haskell.org</a>>
wrote:<br>
</div>
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
<div>Hello Cafe,
<div><br>
</div>
<div>I'm working on an in-memory database, in
Client/Server mode I just let each connected
client submit remote procedure call running in
its dedicated lightweight thread, modifying
TVars in RAM per its business needs, then in
case many clients connected concurrently and
trying to insert new data, if they are
triggering global index (some TVar) update, the
throughput would drop drastically. I reduced the
shared state to a simple int counter by TVar,
got same symptom. While the parallelism feels
okay when there's no hot TVar conflicting, or M
is not much greater than N.</div>
<div><br>
</div>
<div>As an empirical test workload, I have a `+RTS
-N10` server process, it handles 10 concurrent
clients okay, got ~5x of single thread
throughput; but in handling 20 concurrent
clients, each of the 10 CPUs can only be driven
to ~10% utilization, the throughput seems even
worse than single thread. More clients can even
drive it thrashing without much progressing.</div>
<div><br>
</div>
<div> I can understand that pure STM doesn't scale
well after reading [1], and I see it suggested
[7] attractive and planned future work toward
that direction.</div>
<div><br>
</div>
<div>But I can't find certain libraries or
frameworks addressing large M over small N
scenarios, [1] experimented with designated N
parallelism, and [7] is rather theoretical to my
empirical needs.</div>
<div><br>
</div>
<div>Can you direct me to some available library
implementing the methodology proposed in [7] or
other ways tackling this problem?</div>
<div><br>
</div>
<div>I think the most difficult one is that a
transaction should commit with global indices
(with possibly unique constraints) atomically
updated, and rollback with any violation of
constraints, i.e. transactions have to cover
global states like indices. Other problems seem
more trivial than this.</div>
<div><br>
</div>
<div>Specifically, [7] states:</div>
<div><br>
</div>
<div>> It must be emphasized that all of the
mechanisms we deploy originate, in one form or
another, in the database literature from the 70s
and 80s. Our contribution is to adapt these
techniques to software transactional memory,
providing more effective solutions to important
STM problems than prior proposals.</div>
<div><br>
</div>
<div>I wonder any STM based library has simplified
those techniques to be composed right away? I
don't really want to implement those mechanisms
by myself, rebuilding many wheels from scratch.</div>
<div><br>
</div>
<div>Best regards,</div>
<div>Compl</div>
<div><br>
</div>
<div><br>
</div>
<div>[1] Comparing the performance of concurrent
linked-list implementations in Haskell </div>
<div><a href="https://simonmar.github.io/bib/papers/concurrent-data.pdf" target="_blank">https://simonmar.github.io/bib/papers/concurrent-data.pdf</a></div>
<div><br>
</div>
<div>[7] M. Herlihy and E. Koskinen. Transactional
boosting: a methodology for highly-concurrent
transactional objects. In Proc. of PPoPP ’08,
pages 207–216. ACM Press, 2008.</div>
<div><a href="https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf" target="_blank">https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf</a></div>
<div><br>
</div>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view archives go
to:<br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman list are
allowed to post.</blockquote>
</div>
<br clear="all">
<div><br>
</div>
-- <br>
<div dir="ltr">
<div dir="ltr">
<div>
<div dir="ltr">
<div dir="ltr">Chris Allen<br>
<div><span style="font-size:12.8px">Currently
working on </span><a href="http://haskellbook.com" target="_blank">http://haskellbook.com</a></div>
</div>
</div>
</div>
</div>
</div>
</blockquote>
</div>
_______________________________________________<br>
Haskell-Cafe mailing list<br>
To (un)subscribe, modify options or view archives go to:<br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
Only members subscribed via the mailman list are allowed to
post.</blockquote>
</div>
</div>
</blockquote>
</div>
</blockquote></div></div>