<html><head><meta http-equiv="Content-Type" content="text/html; charset=us-ascii"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">And I have some uneducated guess also in <a href="https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132559.html" class="">https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132559.html</a><div class=""><br class=""></div><div class="">> <span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">And per my understanding, GHC's GC doesn't seek free segments within a heap, it instead will copy all live objects to a new heap after then swap the new heap to be the live one, so I assume memory (address space) fragmentation doesn't make much trouble for a GHC process, as for other runtimes.</span></div><div class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><br class=""></span></div><div class="">> <span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">I suspect the difficulty resides in the detection of circular/cyclic circumstances wrt live data structures within the old heap, especially the circles form with arbitrary number of pointers of indirection. If the GC has to perform some dict lookup to decide if an object has been copied to new heap, that's O(n*log(n)) complexity in best case, where n is number of live objects in the heap.</span></div><br style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><span style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class="">> To efficiently copy circular structures, one optimization I can imagine is to have a `new ptr` field in every heap object, then in copying another object with a pointer to one object, the `new ptr` can be read out and if not nil, assign the pointer field on another object' in the new heap to that value and it's done; or copy one object' to the new heap, and update the field on one object in the old heap pointing to the new heap. But I don't know details of GHC GC and can't imagine even feasibility of this technique. And even the new nonmoving GC may have similar difficulty to jump out of a circle when following pointers.</span><br style="caret-color: rgb(0, 0, 0); color: rgb(0, 0, 0);" class=""><div class=""><br class=""></div><div class="">FYI<br class=""><div><br class=""><blockquote type="cite" class=""><div class="">On 2020-07-30, at 21:53, YueCompl via ghc-devs <<a href="mailto:ghc-devs@haskell.org" class="">ghc-devs@haskell.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Dear GHC Devs,<div class=""><br class=""></div><div class="">I realize I should seek your helps regarding my current situation.</div><div class=""><br class=""></div><div class="">TL;DR the most serious suffering is: After the heap loaded with many TVars with cyclic structures, GC will dominate my CPU utilization with little business progressing.</div><div class=""><br class=""></div><div class="">Nonmoving GC `-xn` with 8.10.1 helps, but the ceiling is not high enough for my case, obvious performance degrade starts at about ~350MB RSS, and falls unusable as RSS approaching 1GB. While I expect a GHC compiled process to serve 20~30GB in-mem data practically okay.</div><div class=""><br class=""></div><div class=""><a href="https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html" class="">https://mail.haskell.org/pipermail/haskell-cafe/2020-July/132556.html</a> contains most interesting conversations.</div><div class=""><br class=""></div><div class="">I found <a href="https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html" class="">https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html</a> much relevant, they managed to keep 100GB per instance, but switching to immutable data structures to reside within compact regions is not immediately feasible for my case, as the schema has to be re-designed, though I have my colleges started evaluating that option.</div><div class=""><br class=""></div><div class="">Best regards,</div><div class="">Compl</div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">Begin forwarded message:</div><br class="Apple-interchange-newline"><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" class=""><span style="font-family: -webkit-system-font, "Helvetica Neue", Helvetica, sans-serif;" class=""><b class="">From: </b></span><span style="font-family: -webkit-system-font, Helvetica Neue, Helvetica, sans-serif;" class="">YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" class="">haskell-cafe@haskell.org</a>><br class=""></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" class=""><span style="font-family: -webkit-system-font, "Helvetica Neue", Helvetica, sans-serif;" class=""><b class="">Subject: </b></span><span style="font-family: -webkit-system-font, Helvetica Neue, Helvetica, sans-serif;" class=""><b class="">[Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?</b><br class=""></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" class=""><span style="font-family: -webkit-system-font, "Helvetica Neue", Helvetica, sans-serif;" class=""><b class="">Date: </b></span><span style="font-family: -webkit-system-font, Helvetica Neue, Helvetica, sans-serif;" class="">2020-07-30 at 21:28:31 GMT+8<br class=""></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" class=""><span style="font-family: -webkit-system-font, "Helvetica Neue", Helvetica, sans-serif;" class=""><b class="">To: </b></span><span style="font-family: -webkit-system-font, Helvetica Neue, Helvetica, sans-serif;" class="">Haskell Cafe <<a href="mailto:haskell-cafe@haskell.org" class="">haskell-cafe@haskell.org</a>><br class=""></span></div><div style="margin-top: 0px; margin-right: 0px; margin-bottom: 0px; margin-left: 0px;" class=""><span style="font-family: -webkit-system-font, "Helvetica Neue", Helvetica, sans-serif;" class=""><b class="">Reply-To: </b></span><span style="font-family: -webkit-system-font, Helvetica Neue, Helvetica, sans-serif;" class="">YueCompl <<a href="mailto:compl.yue@icloud.com" class="">compl.yue@icloud.com</a>><br class=""></span></div><br class=""><div class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">For the record, overhead of STM over IO (or other means where manual composition of transactions needed) based concurrency control, is a price I'm willing to pay in my use case, as it's not machine-performance critical in distributing input data + parameters to a cluster of worker nodes, and collecting their results into permanent storage or a data pipeline. But to keep professionally crafting well synced, race-free scheduling code is barely affordable by my org, as shape of datasets, relationship between them and algorithms processing them are varying at fast paces, we have difficulty, or lack the willingness, to hire some workforce specifically to keep each new data pipeline race free, it has to be, but better at cost of machine-hours instead of human head counts. <div class=""><br class=""></div><div class="">While easily compositing stm code, wrapped in scriptable procedures, will enable our analysts to author the scheduling scripts without too much concerns. Then our programmers can focus on performance critical parts of the data processing code, like optimization of tight-loops.<div class=""><br class=""></div><div class="">Only if not in the tight loops, I think it's acceptable by us, that up to 2~3 order of <span style="caret-color: rgb(0, 0, 0);" class="">magnitude</span> slower for an stm solution compared to its best rivals, as long as it's scalable. For a (maybe cheating) example, if fully optimized code can return result in 10 ms after an analyst clicked a button, we don't bother if unoptimized stm script needs 10 second, so long as the result is correct.</div><div class=""><br class=""></div><div class="">In a philosophic thinking, I heard that AT&T had UNIX specifically designed for their Control panel, while their Data panel runs separate software (and on separate hardware obviously), while modern systems have powerful CPUs tempting us to squeeze more performance out of it, and SIMD instructions make it even more tempting, I think we'd better resist it when programming something belong to the Control panel per se, but do it in programming something belong to the Data panel. And appears Data panel programs are being shifted to GPUs nowadays, which feels right.</div><div class=""><br class=""></div><div class="">Regards,</div><div class="">Compl</div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On 2020-07-30, at 20:10, YueCompl via Haskell-Cafe <<a href="mailto:haskell-cafe@haskell.org" class="">haskell-cafe@haskell.org</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><meta http-equiv="Content-Type" content="text/html; charset=us-ascii" class=""><div style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">Hi Peter,<div class=""><br class=""></div><div class="">Great to hear from you! </div><div class=""><br class=""></div><div class="">For the record tskiplist (and stm-containers together) did improve my situation a great lot with respect to scalability at concurrency/parallelism!</div><div class=""><br class=""></div><div class="">I'm still far from the stage to squeeze last drops of performance, currently I'm just making sure performance wise concerns be reasonable during my PoC in correctness and ergonomics of my HPC architecture (an in-memory graph + out-of-core (mmap) array DBMS powered computation cluster, with shared storage), and after parallelism appears acceptable, I seemingly suffer from serious GC issue at up scaling on process working memory size. I'm suspecting it's because of the added more TVars and/or aggressive circular structures of them in my case, and can not find a way to overcome it by far.</div><div class=""><br class=""></div><div class="">Thanks for your detailed information!</div><div class=""><br class=""></div><div class="">Best regards,</div><div class="">Compl</div><div class=""><br class=""><div class=""><br class=""><blockquote type="cite" class=""><div class="">On 2020-07-30, at 19:19, Peter Robinson <<a href="mailto:pwr@lowerbound.io" class="">pwr@lowerbound.io</a>> wrote:</div><br class="Apple-interchange-newline"><div class=""><div dir="ltr" class=""><div class="gmail_quote"><div class="">Hi Compl,</div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div dir="ltr" class=""><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div class=""><p class="">>+ <span style="color:rgb(200,195,188);font-family:"PT Sans",-apple-system,BlinkMacSystemFont,"Segoe UI",Roboto,Oxygen-Sans,Cantarell,"Helvetica Neue",sans-serif;font-size:17px;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:0.024px;text-align:left;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(25,27,28);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline" class="">This package provides a
proof-of-concept implementation of a skip list in STM</span></p><p class="">This has to mean something but I can't figure out yet.<br class="">
</p><p class="">Dear Peter Robinson, I hope you can see this message and get in
the loop of discussion. </p></div></blockquote></div></div></blockquote><div class=""><br class=""></div><div class=""> The reason for adding this sentence was that tskiplist hasn't been optimized for production use. Later on, I wrote an implementation of a concurrent skip list with atomic operations that performs significantly better, but it's operations work in the IO monad. </div><br class="">I'm surprised to hear that you're getting poor performance even when using the stm-container package, which I believe was meant to be used in production. A while ago, I ran some benchmarks comparing concurrent dictionary data structures (such as stm-container) under various workloads. While STMContainers.Map wasn't as fast as the concurrent-hashtable package, the results indicate that the performance doesn't degrade too much under larger workloads.<br class=""><br class="">You can find these benchmark results here (10^6 randomly generated insertion/deletion/lookup requests distributed among 32 threads): <br class=""><a href="https://lowerbound.io/blog/bench2-32.html" class="">https://lowerbound.io/blog/bench2-32.html</a> <br class="">And some explanations about the benchmarks are here: <br class=""><a href="https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html" class="">https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html</a><br class=""> <br class="">One issue that I came across when implementing the tskiplist package was this: If a thread wants to insert some item into the skip list, it needs to search for the entry point by performing readTVar operations starting at the list head. So, on average, a thread will read O(log n) TVars (assuming a skip list of n items) and, if any of these O(log n) TVars are modified by a simultaneously running thread, the STM runtime will observe a (false) conflict and rerun the transaction. It's not clear to me how to resolve this issue without access to something like unreadTVar (see [1]). <br class=""><br class="">Best,<br class="">Peter<br class=""><br class="">[1] UnreadTVar: Extending Haskell Software Transactional Memory for Performance (2007) by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , Adrian Cristal , Osman S. Unsal , Mateo Valero.<br class=""></div><div class="gmail_quote"><br class=""></div></div>
_______________________________________________<br class="">Haskell-Cafe mailing list<br class="">To (un)subscribe, modify options or view archives go to:<br class=""><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br class=""></div></div>_______________________________________________<br class="">Haskell-Cafe mailing list<br class="">To (un)subscribe, modify options or view archives go to:<br class=""><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br class=""></div></div></div>_______________________________________________<br class="">Haskell-Cafe mailing list<br class="">To (un)subscribe, modify options or view archives go to:<br class=""><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br class="">Only members subscribed via the mailman list are allowed to post.</div></blockquote></div><br class=""></div></div>_______________________________________________<br class="">ghc-devs mailing list<br class=""><a href="mailto:ghc-devs@haskell.org" class="">ghc-devs@haskell.org</a><br class="">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs<br class=""></div></blockquote></div><br class=""></div></body></html>