From haskell-cafe at chrisdone.com Thu Jul 2 08:42:59 2020 From: haskell-cafe at chrisdone.com (chris done) Date: Thu, 02 Jul 2020 09:42:59 +0100 Subject: [Haskell-cafe] =?utf-8?q?Is_there_a_handy_refl_or_refl_generator_?= =?utf-8?q?for_converting_GADT_types=3F?= In-Reply-To: <5CA586FE-C33A-4770-A7C6-CCBC05AD02F5@richarde.dev> References: <0cfde7a8-d6de-4fb9-b62d-a673d0d072d7@www.fastmail.com> <5CA586FE-C33A-4770-A7C6-CCBC05AD02F5@richarde.dev> Message-ID: One thing that does work nicely is record wild cards. I can write f X{..}=X{..}. So for a product type I’m pretty much set. I can also optionally manually update one or more fields if needed. If there was an equivalent of record wild cards for sum types, that would be swell. E.g. f = \case X a -> ... .. -> .. To have GHC fill out the remainder constructors with a simple verbatim restating of “lhs -> lhs”. But it doesn't exactly match up with GHC's normal exhaustiveness checker which goes deeper than the top constructor. It's fine, though, this doesn't happen _that_ often in my code. Cheers, Chris On Sun, Jun 28, 2020, at 11:51 AM, Richard Eisenberg wrote: > I think the general answer to your question is: no, you can't avoid this pattern match. In your particular case, the domain (Global Renamed) is a subset of the range (Global Generated), and so we can imagine a function that just changes the type without any fuss. This would, I'm pretty sure, be safe. But GHC has no notion of this kind of one-way transformation, so you're stuck just doing it via a manual pattern-match. > > I hope this helps! > Richard > >> On Jun 27, 2020, at 3:23 PM, chris done wrote: >> >> Hi all, >> >> I have stages in my compiler that convert e.g. Global Renamed -> Global Generated, etc. where certain constructors are available in certain stages. >> >> data GlobalRef s where >> FromIntegerGlobal :: GlobalRef s >> FromDecimalGlobal :: GlobalRef s >> InstanceGlobal :: !InstanceName -> GlobalRef Resolved >> >> E.g. after type class resolution is done, the InstanceGlobal constructor is available. But instances can't appear in the AST at other stages. >> >> In three stages, I've had to copy/paste this code: >> >> + refl = >> + case name of >> + FromIntegerGlobal -> FromIntegerGlobal >> + FromDecimalGlobal -> FromDecimalGlobal >> >> Because if I just put `name` then GHC will complain that the types are different, which is correct. But a straight-forward pattern match gives GHC the proof that they can be converted. >> >> Is there a handy way already implemented that could derive either this code or this proof for me? >> >> Cheers, >> >> Chris >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From zocca.marco at gmail.com Sat Jul 4 15:29:31 2020 From: zocca.marco at gmail.com (Marco Zocca) Date: Sat, 4 Jul 2020 17:29:31 +0200 Subject: [Haskell-cafe] ekg-core/RTS : bogus GC stats when running on Google Cloud Run Message-ID: Hi all, I have two services deployed on Google cloud infrastructure; Service 1 runs on Compute Engine and Service 2 on Cloud Run and I'd like to log their memory usage via the `ekg-core` library (https://hackage.haskell.org/package/ekg-core-0.1.1.7/docs/System-Metrics.html) (which is just a thin wrapper around GHC.Stats ). The logging bracket is basically this : mems <- newStore registerGcMetrics mems void $ concurrently io (loop mems) where loop ms = do m <- sampleAll ms ... (lookup the gauges from m and log their values) threadDelay dt loop ms I'm very puzzled by this: both rts.gc.current_bytes_used and rts.gc.max_bytes_used gauges return constant 0 in the case of Service 2 (the Cloud Run one), even though I'm using the same sampling/logging functionality and build options for both services. This is about where my knowledge ends; could this behaviour be due to the Google Cloud Run hypervisor ("gRun") implementing certain syscalls in a non-standard way (gRun syscall guide : https://gvisor.dev/docs/user_guide/compatibility/linux/amd64/) ? Thank you for any pointers to guides/manuals/computer wisdom. Details : Both are built with these options : -optl-pthread -optc-Os -threaded -rtsopts -with-rtsopts=-N -with-rtsopts=-T the only difference is that Service 2 has an additional flag -with-rtsopts=-M2G since Cloud Run services must work with 2 GB of memory at most. The container OS in both cases is Debian 10.4 ("Buster"). From zocca.marco at gmail.com Sat Jul 4 19:01:51 2020 From: zocca.marco at gmail.com (Marco Zocca) Date: Sat, 4 Jul 2020 21:01:51 +0200 Subject: [Haskell-cafe] ekg-core/RTS : bogus GC stats when running on Google Cloud Run In-Reply-To: References: Message-ID: Thinking a bit longer about this, this behaviour is perfectly reasonable in the "serverless" model; resources (both CPU and memory) are throttled down to 0 when the service is not processing requests, which is exactly what ekg picks up. Why logs are printed out even outside of requests is still a bit of a mystery, though .. > I'm very puzzled by this: both rts.gc.current_bytes_used and > rts.gc.max_bytes_used gauges return constant 0 in the case of Service > 2 (the Cloud Run one), even though I'm using the same sampling/logging > functionality and build options for both services. From ida.bzowska at gmail.com Tue Jul 7 13:41:12 2020 From: ida.bzowska at gmail.com (Ida Bzowska) Date: Tue, 7 Jul 2020 15:41:12 +0200 Subject: [Haskell-cafe] Haskell Love Conference (the 31st of July & 1st of August, 2020) In-Reply-To: References: <9ed57968297f0008379366660bf4ba273222492b.camel@joachim-breitner.de> Message-ID: Hey, Quick reminder, you have less than 12 hours if you want to submit something (CFP is open until the 8th of July; 0:01 a.m. PDT). Good luck, I keep my fingers crossed for you! λCheers, Ida Bzowska pt., 26 cze 2020 o 13:39 Ida Bzowska napisał(a): > Hi, > > Indeed, that's a very accurate comment :) so we are waiting for 30 minutes > long talks, but if you have a longer presentation in mind, we will try to > be agile and make a schedule that will handle it. The presentation flow > will be based on screen sharing for sure, but this time probably it would > not be zoom (we used it previously > https://www.youtube.com/watch?v=Z0w_pITUTyU). > <3 We are still waiting for submissions (till the 1st of July). <3 > > λCheers, > Ida Bzowska > > > > czw., 25 cze 2020 o 18:25 Joachim Breitner > napisał(a): > >> Hi, >> >> >> Am Freitag, den 19.06.2020, 12:40 +0200 schrieb Ida Bzowska: >> > Every speaker gets an avatar. If this thing will encourage you to take >> part in CFP, I assure you: you will get one! >> >> I can't deny it does… >> >> But the CFP could have a bit more information, e.g. suggested lengths >> of talks, and how you will handle the technicalities – do speakers just >> screen-share with zoom? >> >> Cheers, >> Joachim >> >> -- >> Joachim Breitner >> mail at joachim-breitner.de >> http://www.joachim-breitner.de/ >> >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From bernardobruno at gmail.com Tue Jul 7 20:07:41 2020 From: bernardobruno at gmail.com (Bruno Bernardo) Date: Tue, 7 Jul 2020 22:07:41 +0200 Subject: [Haskell-cafe] FMBC 2020 - Call for Participation Message-ID: [ Please distribute, apologies for multiple postings. ] ======================================================================== 2nd Workshop on Formal Methods for Blockchains (FMBC) 2020 - Call for Participation https://fmbc.gitlab.io/2020 July 20 and 21, 2020, Online, 6AM-8AM PDT Co-located with the 32nd International Conference on Computer-Aided Verification (CAV 2020) http://i-cav.org/2020/ --------------------------------------------------------- The FMBC workshop is a forum to identify theoretical and practical approaches of formal methods for Blockchain technology. Topics include, but are not limited to: * Formal models of Blockchain applications or concepts * Formal methods for consensus protocols * Formal methods for Blockchain-specific cryptographic primitives or protocols * Design and implementation of Smart Contract languages * Verification of Smart Contracts The list of lightning talks and conditionally accecpted papers is available on the FMBC 2020 website: https://fmbc.gitlab.io/2020/program.html There will be one keynote by Grigore Rosu, Professor at University of Illinois at Urbana-Champaign, USA and Founder of Runtime Verification. Registration Registration to FMBC 2020 is free but required. It is done through the CAV 2020 registration form: http://i-cav.org/2020/attending/ Please register before *July 10, 2020*. From kolar at fit.vut.cz Wed Jul 8 06:17:41 2020 From: kolar at fit.vut.cz (=?utf-8?B?RHXFoWFuIEtvbMOhxZk=?=) Date: Wed, 08 Jul 2020 08:17:41 +0200 Subject: [Haskell-cafe] Test on identity? Message-ID: <1988112.6UlOsRVAxy@pckolar> Dear Café, I'm trying to build a DAG from a binary tree. I don't think there's a big trouble. Nevertheless, I do even some transformations. Thus, I would like to know it is still a DAG, not adding, accidentally, a node. Is there any way, if I have data like data Ex = Val Int | Add Ex Ex so that I can test that some value Val i === Val i ? I mean, the pointers go to the same data box? I could do that via some IORefs, AFAIK, but I don't think it is feasible. Maybe to tune the algorithm... Best regards, Dusan -------------- next part -------------- An HTML attachment was scrubbed... URL: From simon.jakobi at googlemail.com Wed Jul 8 15:48:40 2020 From: simon.jakobi at googlemail.com (Simon Jakobi) Date: Wed, 8 Jul 2020 17:48:40 +0200 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <1988112.6UlOsRVAxy@pckolar> References: <1988112.6UlOsRVAxy@pckolar> Message-ID: Hi Dusan, containers uses pointer equality in some places: https://github.com/haskell/containers/search?q=ptrEq&unscoped_q=ptrEq I'd suggest to read up on reallyUnsafePtrEquality#, before you rely on it though. Hope that helps! Simon Am Mi., 8. Juli 2020 um 08:18 Uhr schrieb Dušan Kolář : > > Dear Café, > > > I'm trying to build a DAG from a binary tree. I don't think there's a big trouble. Nevertheless, I do even some transformations. Thus, I would like to know it is still a DAG, not adding, accidentally, a node. > > > Is there any way, if I have data like > > > data Ex > > = Val Int > > | Add Ex Ex > > > > so that I can test that some value Val i === Val i ? I mean, the pointers go to the same data box? I could do that via some IORefs, AFAIK, but I don't think it is feasible. Maybe to tune the algorithm... > > > Best regards, > > > Dusan > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From ietf-dane at dukhovni.org Wed Jul 8 16:13:34 2020 From: ietf-dane at dukhovni.org (Viktor Dukhovni) Date: Wed, 8 Jul 2020 12:13:34 -0400 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <1988112.6UlOsRVAxy@pckolar> References: <1988112.6UlOsRVAxy@pckolar> Message-ID: <20200708161334.GD20025@straasha.imrryr.org> On Wed, Jul 08, 2020 at 08:17:41AM +0200, Dušan Kolář wrote: > Nevertheless, I do even some transformations. Thus, I would like to know it is still a > DAG, not adding, accidentally, a node. > > Is there any way, if I have data like > > data Ex > = Val Int > | Add Ex Ex > > so that I can test that some value Val i === Val i ? I mean, the pointers go to the > same data box? I could do that via some IORefs, AFAIK, but I don't think it is > feasible. Maybe to tune the algorithm... If, for the same "n", two "distinct" leaf nodes "Val n" are possible, in what sense is what you have still a DAG? If there's a difference between: Add / \ / \ / \ v v Val 1 Val 1 and: Add / \ / \ \ / v v Val 1 then perhaps the data model is flawed by failing to capture the distinguishing attributes of distinct leaf objects. And of course you might also have: Add / \ / \ \ / v v Add / \ / \ / \ v v Val 1 Val 2 So the most portable approach would be to assign a unique serial number all the nodes, both "Add", and "Val", and check that there's only path from the root to each distinct node (by serial number). Or, equivalently, a recursive enumeration of all the serial numbers contains no duplicates. -- Viktor. From kolar at fit.vut.cz Wed Jul 8 16:42:07 2020 From: kolar at fit.vut.cz (=?UTF-8?B?RHXFoWFuIEtvbMOhxZk=?=) Date: Wed, 08 Jul 2020 18:42:07 +0200 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <20200708161334.GD20025@straasha.imrryr.org> References: <1988112.6UlOsRVAxy@pckolar> <20200708161334.GD20025@straasha.imrryr.org> Message-ID: <68747679-1581-4BB4-922A-234A5103E550@fit.vut.cz> Well, it makes a difference for me if I have twice the same subtree or sharing one subtree from several places. Later, I add some markers, thus, I know it is already processed or not. Adding unique numbers and counting them makes sense if I know the result. If not then I don't know how to exploit it. But I may be tired too much already. :-( Anyway, probably more imperative style would be a better option. Thanks all, Dušan 8. července 2020 18:13:34 SELČ, Viktor Dukhovni napsal: >On Wed, Jul 08, 2020 at 08:17:41AM +0200, Dušan Kolář wrote: > >> Nevertheless, I do even some transformations. Thus, I would like to >know it is still a >> DAG, not adding, accidentally, a node. >> >> Is there any way, if I have data like >> >> data Ex >> = Val Int >> | Add Ex Ex >> >> so that I can test that some value Val i === Val i ? I mean, the >pointers go to the >> same data box? I could do that via some IORefs, AFAIK, but I don't >think it is >> feasible. Maybe to tune the algorithm... > >If, for the same "n", two "distinct" leaf nodes "Val n" are possible, >in >what sense is what you have still a DAG? If there's a difference >between: > > Add > / \ > / \ > / \ > v v > Val 1 Val 1 > >and: > > Add > / \ > / \ > \ / > v v > Val 1 > >then perhaps the data model is flawed by failing to capture the >distinguishing attributes of distinct leaf objects. And of course >you might also have: > > Add > / \ > / \ > \ / > v v > Add > / \ > / \ > / \ > v v > Val 1 Val 2 > >So the most portable approach would be to assign a unique serial number >all the nodes, both "Add", and "Val", and check that there's only path >from the root to each distinct node (by serial number). Or, >equivalently, >a recursive enumeration of all the serial numbers contains no >duplicates. > >-- > Viktor. >_______________________________________________ >Haskell-Cafe mailing list >To (un)subscribe, modify options or view archives go to: >http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jo at durchholz.org Wed Jul 8 17:27:09 2020 From: jo at durchholz.org (Joachim Durchholz) Date: Wed, 8 Jul 2020 19:27:09 +0200 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <68747679-1581-4BB4-922A-234A5103E550@fit.vut.cz> References: <1988112.6UlOsRVAxy@pckolar> <20200708161334.GD20025@straasha.imrryr.org> <68747679-1581-4BB4-922A-234A5103E550@fit.vut.cz> Message-ID: <115f8e36-f6eb-3718-f071-a82c7e0679fd@durchholz.org> Hi Dusan, If you go imperative here, all code that builds on that data structure will be imperative as well, and you'll lose much of what makes Haskell interesting. Try this mental model: Step 1: Two objects are equal if all attributes are equal. Just value equality here, i.e. assuming a language where you access all attributes (reference or direct) through an accessor like a.x, two objects a and b are equal if all accessor chains (e.g. x.y.z) that end in primitive values give you primitive-equality (e.g. a.x.y.z == b.x.y.z, and this works for all valid accessor chains). (As you can see, equality is not a simple concept, and in languages where you have no primitives the definition becomes circular, but it's good enough for the model here.) Step 2: Define "identity" to be "equality under change". I.e. a and b are identical that if you assign to a.x.y.z, the same value will be found in b.x.y.z. This "identity is equality under change" definition captures not just two objects at identical addresses, but also proxies, network objects, files, and whatever there is. Step 3: Realize that if you have an immutable object, there is no relevant difference between equality and identity anymore. (You can make various formal statements about this.) Step 4: So for an immutable object, A B / \ is not just equal but identical to / \ / \ \ / x x x Step 5: You want to be able to have A / \ / \ x' x'' after some updates. That's not a problem! You "update" an object by creating a copy. Nothing prevents your code from creating an A(x',x'') tree when given an A(x,x) tree! This train of thought helped my wrap my mind around some ideas in Haskell; I hope it will help you, and possibly other readers. Everybody feel free to make it even more generally helpful :-) Regards, Jo Am 08.07.20 um 18:42 schrieb Dušan Kolář: > Well, it makes a difference for me if I have twice the same subtree or > sharing one subtree from several places. Later, I add some markers, > thus, I know it is already processed or not. > > Adding unique numbers and counting them makes sense if I know the > result. If not then I don't know how to exploit it. But I may be tired > too much already. :-( > > Anyway, probably more imperative style would be a better option. > > Thanks all, > > Dušan > > > 8. července 2020 18:13:34 SELČ, Viktor Dukhovni > napsal: > > On Wed, Jul 08, 2020 at 08:17:41AM +0200, Dušan Kolář wrote: > > Nevertheless, I do even some transformations. Thus, I would like > to know it is still a > DAG, not adding, accidentally, a node. > > Is there any way, if I have data like > > data Ex > = Val Int > | Add Ex Ex > > so that I can test that some value Val i === Val i ? I mean, the > pointers go to the > same data box? I could do that via some IORefs, AFAIK, but I > don't think it is > feasible. Maybe to tune the algorithm... > > > If, for the same "n", two "distinct" leaf nodes "Val n" are possible, in > what sense is what you have still a DAG? If there's a difference > between: > > Add > / \ > / \ > / \ > v v > Val 1 Val 1 > > and: > > Add > / \ > / \ > \ / > v v > Val 1 > > then perhaps the data model is flawed by failing to capture the > distinguishing attributes of distinct leaf objects. And of course > you might also have: > > Add > / \ > / \ > \ / > v v > Add > / \ > / \ > / \ > v v > Val 1 Val 2 > > So the most portable approach would be to assign a unique serial number > all the nodes, both "Add", and "Val", and check that there's only path > from the root to each distinct node (by serial number). Or, equivalently, > a recursive enumeration of all the serial numbers contains no duplicates. > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > From olf at aatal-apotheke.de Wed Jul 8 21:26:11 2020 From: olf at aatal-apotheke.de (Olaf Klinke) Date: Wed, 08 Jul 2020 23:26:11 +0200 Subject: [Haskell-cafe] Test on identity? Message-ID: > Dear Café, > > I'm trying to build a DAG from a binary tree. I don't think there's a > big trouble. > Nevertheless, I do even some transformations. Thus, I would like to > know it is still a > DAG, not adding, accidentally, a node. > > Is there any way, if I have data like > > data Ex > = Val Int > | Add Ex Ex > > so that I can test that some value Val i === Val i ? I mean, the > pointers go to the > same data box? I could do that via some IORefs, AFAIK, but I don't > think it is > feasible. Maybe to tune the algorithm... > > Best regards, > > Dusan So the binary tree is a value e :: Ex, right? And the DAG (directed acyclic graph) is an explicit representation of the internal pointer structure in e? Did I understand you right? This sounds like you should represent Ex as a fixed point and then invoke some fixed point magic like catamorphisms. Your transformation might have to be re-written as a cata/ana/hylomorphism, too. Below is a recipe to turn any element of any fixed point type into a graph, that is, into a list of nodes and edges. Of course this will loop if your data was not a finite DAG, e.g. due to self-reference. -- make type recursion for your type Ex explicit data ExF x = Val Int | Add x x deriving (Show) instance Functor ExF where fmap f (Val i) = Val i fmap f (Add x y) = Add (f x) (f y) instance Foldable ExF where foldMap _ (Val _) = mempty foldMap f (Add x y) = f x <> f y instance Traversable ExF where traverse f (Val i) = pure (Val i) traverse f (Add x y) = Add <$> (f x) <*> (f y) -- represent Ex via the general -- Fix :: (* -> *) -> * -- See e.g. package data-fix or recursion-schemes -- cataM below taken from the data-fix package type Ex = Fix ExF -- = Fix {unFix :: ExF (Fix ExF)} -- Add () () tells you the node is internal type ExNode = ExF () data GraphElem f = Node Int (f ()) | Edge Int Int instance Show (GraphElem ExF) where show (Node n (Val i)) = show n ++ ":Val " ++ show i show (Node n (Add _ _)) = show n ++ ":Add" show (Edge i j) = show i ++ " -> " ++ show j type Graph = [GraphElem ExF] type GraphM = StateT Int (Writer Graph) structure :: (Traversable f, MonadState Int m, MonadWriter [GraphElem f] m) => f Int -> m Int structure fi = do this <- get tell [Node this (void fi)] traverse (\child -> tell [Edge this child]) fi put (this+1) return this -- depth-first traversal. More generally dag has type -- (Traversable f) => Fix f -> [GraphElem f] -- and the Traversable instance determines the order -- of the traversal. dag :: Ex -> Graph dag = snd . runWriter . flip evalStateT 0 . cataM structure -- Cheers, Olaf From carter.schonwald at gmail.com Wed Jul 8 22:33:44 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Wed, 8 Jul 2020 18:33:44 -0400 Subject: [Haskell-cafe] Linking + zstd on FreeBSD In-Reply-To: References: Message-ID: When binding c libraries, I often prefix a custom/unique Descriptive sequence for All linker visible symbols. So Eg you could do vanmach_zstdlib_ as a prefix On Tue, Jun 30, 2020 at 12:32 PM Jinwoo Lee wrote: > The error message seems to say why. The GHC RTS already has xxhash.c and > zstd also has that file. I'm not sure how to resolve this though. > > > On Mon, Jun 29, 2020 at 7:44 AM Vanessa McHale wrote: > >> I tried linking against zstd within a Vagrant simulating FreeBSD. I get >> the following: >> >> ld: error: duplicate symbol: XXH64 >> >>> defined at xxhash.c >> >>> xxhash.o:(XXH64) in archive >> >> /home/vagrant/.cabal/store/ghc-8.8.3/zstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e/lib/libHSzstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e.a >> >>> defined at xxhash.c:693 >> (/wrkdirs/usr/ports/lang/ghc/work/ghc-8.8.3/rts/xxhash.c:693) >> >>> RTS.o:(.text.XXH64+0x0) in archive >> /usr/local/lib/ghc-8.8.3/rts/libHSrts.a >> >> ... >> >> I am using GHC version 8.8.3_1 and cabal-install 3.0 installed by 'sudo >> pkg install ...' >> >> (I was trying to run cabal install language-dickinson) >> >> Cheers, >> Vanessa McHale >> >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From vamchale at gmail.com Thu Jul 9 12:29:38 2020 From: vamchale at gmail.com (Vanessa McHale) Date: Thu, 9 Jul 2020 07:29:38 -0500 Subject: [Haskell-cafe] Linking + zstd on FreeBSD In-Reply-To: References: Message-ID: <6002626f-53bc-3292-feec-9e3d1db45a38@gmail.com> Fair enough, I suppose what I don't understand is: why is this only happening on FreeBSD (in Vagrant)? I am able to compile on Linux, Mac, and Windows! Cheers, Vanessa McHale On 7/8/20 5:33 PM, Carter Schonwald wrote: > When binding c libraries, I often prefix a custom/unique Descriptive > sequence  for All linker visible symbols.  > > So Eg you could do vanmach_zstdlib_ as a prefix > > On Tue, Jun 30, 2020 at 12:32 PM Jinwoo Lee > wrote: > > The error message seems to say why. The GHC RTS already has > xxhash.c and zstd also has that file. I'm not sure how to resolve > this though. > > > On Mon, Jun 29, 2020 at 7:44 AM Vanessa McHale > wrote: > > I tried linking against zstd within a Vagrant simulating > FreeBSD. I get > the following: > > ld: error: duplicate symbol: XXH64 > >>> defined at xxhash.c > >>>            xxhash.o:(XXH64) in archive > /home/vagrant/.cabal/store/ghc-8.8.3/zstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e/lib/libHSzstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e.a > >>> defined at xxhash.c:693 > (/wrkdirs/usr/ports/lang/ghc/work/ghc-8.8.3/rts/xxhash.c:693) > >>>            RTS.o:(.text.XXH64+0x0) in archive > /usr/local/lib/ghc-8.8.3/rts/libHSrts.a > > ... > > I am using GHC version 8.8.3_1 and cabal-install 3.0 installed > by 'sudo > pkg install ...' > > (I was trying to run cabal install language-dickinson) > > Cheers, > Vanessa McHale > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 659 bytes Desc: OpenPGP digital signature URL: From carter.schonwald at gmail.com Thu Jul 9 18:16:45 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Thu, 9 Jul 2020 14:16:45 -0400 Subject: [Haskell-cafe] Linking + zstd on FreeBSD In-Reply-To: <6002626f-53bc-3292-feec-9e3d1db45a38@gmail.com> References: <6002626f-53bc-3292-feec-9e3d1db45a38@gmail.com> Message-ID: That’s a really good question! I guess one way to narrow it down is what’s the linker implementation on each env? I think you can arrange to use the same ld impl on bsd and Linux if you want, since both are elf platforms. Are they both doing dynamic or static linking? Are they using the same Ld implementation? It’s certainly strange to have there be a difference in behavior between two x64 elf platforms, On Thu, Jul 9, 2020 at 8:31 AM Vanessa McHale wrote: > Fair enough, I suppose what I don't understand is: why is this only > happening on FreeBSD (in Vagrant)? I am able to compile on Linux, Mac, and > Windows! > > Cheers, > Vanessa McHale > On 7/8/20 5:33 PM, Carter Schonwald wrote: > > When binding c libraries, I often prefix a custom/unique Descriptive > sequence for All linker visible symbols. > > So Eg you could do vanmach_zstdlib_ as a prefix > > On Tue, Jun 30, 2020 at 12:32 PM Jinwoo Lee wrote: > >> The error message seems to say why. The GHC RTS already has xxhash.c and >> zstd also has that file. I'm not sure how to resolve this though. >> >> >> On Mon, Jun 29, 2020 at 7:44 AM Vanessa McHale >> wrote: >> >>> I tried linking against zstd within a Vagrant simulating FreeBSD. I get >>> the following: >>> >>> ld: error: duplicate symbol: XXH64 >>> >>> defined at xxhash.c >>> >>> xxhash.o:(XXH64) in archive >>> >>> /home/vagrant/.cabal/store/ghc-8.8.3/zstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e/lib/libHSzstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e.a >>> >>> defined at xxhash.c:693 >>> (/wrkdirs/usr/ports/lang/ghc/work/ghc-8.8.3/rts/xxhash.c:693) >>> >>> RTS.o:(.text.XXH64+0x0) in archive >>> /usr/local/lib/ghc-8.8.3/rts/libHSrts.a >>> >>> ... >>> >>> I am using GHC version 8.8.3_1 and cabal-install 3.0 installed by 'sudo >>> pkg install ...' >>> >>> (I was trying to run cabal install language-dickinson) >>> >>> Cheers, >>> Vanessa McHale >>> >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From allbery.b at gmail.com Thu Jul 9 18:23:07 2020 From: allbery.b at gmail.com (Brandon Allbery) Date: Thu, 9 Jul 2020 14:23:07 -0400 Subject: [Haskell-cafe] Linking + zstd on FreeBSD In-Reply-To: References: <6002626f-53bc-3292-feec-9e3d1db45a38@gmail.com> Message-ID: I can think of a few more possibilities. Has anyone checked whether the RTS symbol is, say, conditional on non-glibc? On 7/9/20, Carter Schonwald wrote: > That’s a really good question! > > I guess one way to narrow it down is what’s the linker implementation on > each env? I think you can arrange to use the same ld impl on bsd and Linux > if you want, since both are elf platforms. Are they both doing dynamic or > static linking? Are they using the same Ld implementation? It’s certainly > strange to have there be a difference in behavior between two x64 elf > platforms, > > On Thu, Jul 9, 2020 at 8:31 AM Vanessa McHale wrote: > >> Fair enough, I suppose what I don't understand is: why is this only >> happening on FreeBSD (in Vagrant)? I am able to compile on Linux, Mac, >> and >> Windows! >> >> Cheers, >> Vanessa McHale >> On 7/8/20 5:33 PM, Carter Schonwald wrote: >> >> When binding c libraries, I often prefix a custom/unique Descriptive >> sequence for All linker visible symbols. >> >> So Eg you could do vanmach_zstdlib_ as a prefix >> >> On Tue, Jun 30, 2020 at 12:32 PM Jinwoo Lee wrote: >> >>> The error message seems to say why. The GHC RTS already has xxhash.c and >>> zstd also has that file. I'm not sure how to resolve this though. >>> >>> >>> On Mon, Jun 29, 2020 at 7:44 AM Vanessa McHale >>> wrote: >>> >>>> I tried linking against zstd within a Vagrant simulating FreeBSD. I get >>>> the following: >>>> >>>> ld: error: duplicate symbol: XXH64 >>>> >>> defined at xxhash.c >>>> >>> xxhash.o:(XXH64) in archive >>>> >>>> /home/vagrant/.cabal/store/ghc-8.8.3/zstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e/lib/libHSzstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e.a >>>> >>> defined at xxhash.c:693 >>>> (/wrkdirs/usr/ports/lang/ghc/work/ghc-8.8.3/rts/xxhash.c:693) >>>> >>> RTS.o:(.text.XXH64+0x0) in archive >>>> /usr/local/lib/ghc-8.8.3/rts/libHSrts.a >>>> >>>> ... >>>> >>>> I am using GHC version 8.8.3_1 and cabal-install 3.0 installed by 'sudo >>>> pkg install ...' >>>> >>>> (I was trying to run cabal install language-dickinson) >>>> >>>> Cheers, >>>> Vanessa McHale >>>> >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > -- brandon s allbery kf8nh allbery.b at gmail.com From olf at aatal-apotheke.de Thu Jul 9 18:26:31 2020 From: olf at aatal-apotheke.de (Olaf Klinke) Date: Thu, 09 Jul 2020 20:26:31 +0200 Subject: [Haskell-cafe] Test on identity? Message-ID: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> Joachim Durchholz wrote: > This "identity is equality under change" definition captures not > just > two objects at identical addresses, but also proxies, network > objects, > files, and whatever there is. > > Step 3: Realize that if you have an immutable object, there is no > relevant difference between equality and identity anymore. (You can > make > various formal statements about this.) Is that what is called "extensional equality"? Values a,b :: A are extensionally equal if they behave the same in all contexts. That is, there is no type X and no function f :: A -> X such that f a can be observed to be different from f b, e.g. f a throws an exception and f b does not, or X is in Eq and f a /= f b. Can one write a function (even using reallyUnsafePtrEquality#) that distinguishes the following? a = Add (Val 1) (Val 1) b = let v = Val 1 in Add v v I tried: import GHC.Exts peq :: a -> a -> Bool peq x y = I# (reallyUnsafePtrEquality# x y) == 1 f :: Ex -> Bool f (Val _) = False f (Add x y) = peq x y But I get (even when I make the fields of Add strict): peq a a == True f a == False f b == False Olaf From zemyla at gmail.com Thu Jul 9 20:12:47 2020 From: zemyla at gmail.com (Zemyla) Date: Thu, 9 Jul 2020 15:12:47 -0500 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> References: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> Message-ID: A safer way of doing object identity is with System.StableName. On Thu, Jul 9, 2020, 13:27 Olaf Klinke wrote: > Joachim Durchholz wrote: > > This "identity is equality under change" definition captures not > > just > > two objects at identical addresses, but also proxies, network > > objects, > > files, and whatever there is. > > > > Step 3: Realize that if you have an immutable object, there is no > > relevant difference between equality and identity anymore. (You can > > make > > various formal statements about this.) > > Is that what is called "extensional equality"? Values a,b :: A are > extensionally equal if they behave the same in all contexts. That is, > there is no type X and no function f :: A -> X such that f a can be > observed to be different from f b, e.g. f a throws an exception and f b > does not, or X is in Eq and f a /= f b. > Can one write a function (even using reallyUnsafePtrEquality#) that > distinguishes the following? > a = Add (Val 1) (Val 1) > b = let v = Val 1 in Add v v > > I tried: > > import GHC.Exts > peq :: a -> a -> Bool > peq x y = I# (reallyUnsafePtrEquality# x y) == 1 > f :: Ex -> Bool > f (Val _) = False > f (Add x y) = peq x y > > But I get (even when I make the fields of Add strict): > peq a a == True > f a == False > f b == False > > Olaf > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ietf-dane at dukhovni.org Thu Jul 9 20:22:16 2020 From: ietf-dane at dukhovni.org (Viktor Dukhovni) Date: Thu, 9 Jul 2020 16:22:16 -0400 Subject: [Haskell-cafe] Test on identity? In-Reply-To: References: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> Message-ID: <20200709202216.GO20025@straasha.imrryr.org> On Thu, Jul 09, 2020 at 03:12:47PM -0500, Zemyla wrote: > A safer way of doing object identity is with System.StableName. Minor correction: System.Mem.StableName https://hackage.haskell.org/package/base-4.14.0.0/docs/System-Mem-StableName.html with the internals in: https://hackage.haskell.org/package/base-4.14.0.0/docs/GHC-StableName.html -- Viktor. From carter.schonwald at gmail.com Thu Jul 9 21:44:15 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Thu, 9 Jul 2020 17:44:15 -0400 Subject: [Haskell-cafe] Linking + zstd on FreeBSD In-Reply-To: References: <6002626f-53bc-3292-feec-9e3d1db45a38@gmail.com> Message-ID: That’s a really good point about conditional compilation / cpp in the rts across platforms. On Thu, Jul 9, 2020 at 2:23 PM Brandon Allbery wrote: > I can think of a few more possibilities. Has anyone checked whether > the RTS symbol is, say, conditional on non-glibc? > > On 7/9/20, Carter Schonwald wrote: > > That’s a really good question! > > > > I guess one way to narrow it down is what’s the linker implementation on > > each env? I think you can arrange to use the same ld impl on bsd and > Linux > > if you want, since both are elf platforms. Are they both doing dynamic > or > > static linking? Are they using the same Ld implementation? It’s > certainly > > strange to have there be a difference in behavior between two x64 elf > > platforms, > > > > On Thu, Jul 9, 2020 at 8:31 AM Vanessa McHale > wrote: > > > >> Fair enough, I suppose what I don't understand is: why is this only > >> happening on FreeBSD (in Vagrant)? I am able to compile on Linux, Mac, > >> and > >> Windows! > >> > >> Cheers, > >> Vanessa McHale > >> On 7/8/20 5:33 PM, Carter Schonwald wrote: > >> > >> When binding c libraries, I often prefix a custom/unique Descriptive > >> sequence for All linker visible symbols. > >> > >> So Eg you could do vanmach_zstdlib_ as a prefix > >> > >> On Tue, Jun 30, 2020 at 12:32 PM Jinwoo Lee wrote: > >> > >>> The error message seems to say why. The GHC RTS already has xxhash.c > and > >>> zstd also has that file. I'm not sure how to resolve this though. > >>> > >>> > >>> On Mon, Jun 29, 2020 at 7:44 AM Vanessa McHale > >>> wrote: > >>> > >>>> I tried linking against zstd within a Vagrant simulating FreeBSD. I > get > >>>> the following: > >>>> > >>>> ld: error: duplicate symbol: XXH64 > >>>> >>> defined at xxhash.c > >>>> >>> xxhash.o:(XXH64) in archive > >>>> > >>>> > /home/vagrant/.cabal/store/ghc-8.8.3/zstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e/lib/libHSzstd-0.1.2.0-c8ee757c8e8a7307779ab3cfbc91f3445940e16bbf8f9916f4c90432a0ac499e.a > >>>> >>> defined at xxhash.c:693 > >>>> (/wrkdirs/usr/ports/lang/ghc/work/ghc-8.8.3/rts/xxhash.c:693) > >>>> >>> RTS.o:(.text.XXH64+0x0) in archive > >>>> /usr/local/lib/ghc-8.8.3/rts/libHSrts.a > >>>> > >>>> ... > >>>> > >>>> I am using GHC version 8.8.3_1 and cabal-install 3.0 installed by > 'sudo > >>>> pkg install ...' > >>>> > >>>> (I was trying to run cabal install language-dickinson) > >>>> > >>>> Cheers, > >>>> Vanessa McHale > >>>> > >>>> > >>>> _______________________________________________ > >>>> Haskell-Cafe mailing list > >>>> To (un)subscribe, modify options or view archives go to: > >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >>>> Only members subscribed via the mailman list are allowed to post. > >>> > >>> _______________________________________________ > >>> Haskell-Cafe mailing list > >>> To (un)subscribe, modify options or view archives go to: > >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >>> Only members subscribed via the mailman list are allowed to post. > >> > >> _______________________________________________ > >> Haskell-Cafe mailing list > >> To (un)subscribe, modify options or view archives go to: > >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >> Only members subscribed via the mailman list are allowed to post. > > > > > -- > brandon s allbery kf8nh > allbery.b at gmail.com > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carter.schonwald at gmail.com Thu Jul 9 21:46:28 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Thu, 9 Jul 2020 17:46:28 -0400 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <20200709202216.GO20025@straasha.imrryr.org> References: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> <20200709202216.GO20025@straasha.imrryr.org> Message-ID: Stable names are great! https://hackage.haskell.org/package/data-reify And similar packages on hackage are a pure interfsce for them. I think they’re also used in eds ersatz package sortah. On Thu, Jul 9, 2020 at 4:23 PM Viktor Dukhovni wrote: > On Thu, Jul 09, 2020 at 03:12:47PM -0500, Zemyla wrote: > > > A safer way of doing object identity is with System.StableName. > > Minor correction: System.Mem.StableName > > > https://hackage.haskell.org/package/base-4.14.0.0/docs/System-Mem-StableName.html > > with the internals in: > > > https://hackage.haskell.org/package/base-4.14.0.0/docs/GHC-StableName.html > > -- > Viktor. > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jo at durchholz.org Fri Jul 10 06:49:06 2020 From: jo at durchholz.org (Joachim Durchholz) Date: Fri, 10 Jul 2020 08:49:06 +0200 Subject: [Haskell-cafe] Test on identity? In-Reply-To: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> References: <8b2bfdcb2babdfe69aa99ca87c66c8cb8b0c039c.camel@aatal-apotheke.de> Message-ID: Am 09.07.20 um 20:26 schrieb Olaf Klinke: > Joachim Durchholz wrote: >> This "identity is equality under change" definition captures not >> just >> two objects at identical addresses, but also proxies, network >> objects, >> files, and whatever there is. >> >> Step 3: Realize that if you have an immutable object, there is no >> relevant difference between equality and identity anymore. (You can >> make >> various formal statements about this.) > > Is that what is called "extensional equality"? Values a,b :: A are > extensionally equal if they behave the same in all contexts. To fully treat this, you have to explore some pretty interesting waters. First, "all contexts" can mean pretty different things. E.g. in Haskell (without unsafe* and similar), everything is immutable, so you have contexts that exist in languages with mutability but not in Haskell, and you'd have to define what "all contexts" means for a general term like "extensional equality". Second, "behave the same" means that for all functions, the result it equal. That's a recursive definition. You need fixed-point theory to turn that statement into an actual definition. And that's surprisingly tricky. One fixed point is to say "the largest set of equal objects", which means everything is equal to everything - fits the above definition (applying any function to anything will trivially return results that are equal under this definition), but is obviously not what we wanted. Another one would be to have what we'd intuitively define as equality. Say, in Haskell, the minimum set of equalities that makes different constructor calls unequal. (Plus a handful of clerical definitions to catch special cases like integers.) Another one would be to have what we'd intuitively define as identity. Plus various other fixed points. For example, consider proxies - say, an object that talks to a remote machine to produce its function results. Now if you assume a proxy for integer values, is the proxy equal to an arbitrary integer you may have locally? Again, this depends on what fixed point you choose for your equality definition. You can define that as an exception, you consider two objects equal if their functions return equal results, except for those that inspect the proxy-specific state; then proxy and local value are equal. Or you don't make an exception, then the two are not equal. From a mathematical standpoint, either fixed point will satisfy the above recursive definition (definition template, if you will). From a computing standpoint, you'll find that depending on context, you want one or the other! There are different types of proxy objects, and you can have different kinds of equality depending on how you treat them. That multitude of equality functions is pretty useless in a programming context; nobody wants to mentally deal with a gazillion of subtly different equalities! So I belive what one should do in practice is to have converter functions. E.g. one that turns an Int proxy into an Int, merely by stripping the proxy-specific functions. That keeps the special cases to the place where they belong - all those types that have funky special kinds of equality. Mutable data is another case of this. The equality of a mutable object can be defined as identity, and a converter function returns an immutable copy so that equality is what's usually considered "value equality" (equals() in Java). (Languages that do not cleanly separate mutable and immutable types will still have to deal with two equalities, value equality and identity... well, the above is type and language design theory, practical languages are always a set of trade-offs, restricted by the limited knowledge and experience of the language designer. I guess that's why we have so many programming languages.) Regards, Jo From johannes.waldmann at htwk-leipzig.de Fri Jul 10 12:51:43 2020 From: johannes.waldmann at htwk-leipzig.de (Johannes Waldmann) Date: Fri, 10 Jul 2020 14:51:43 +0200 Subject: [Haskell-cafe] Test on identity? Message-ID: > Stable names are great! ... > I think they’re also used in eds ersatz package sortah. Yes. This is the package https://hackage.haskell.org/package/ersatz This is where they are used https://github.com/ekmett/ersatz/blob/master/src/Ersatz/Problem.hs#L149 - J.W. From olf at aatal-apotheke.de Sat Jul 11 14:19:07 2020 From: olf at aatal-apotheke.de (Olaf Klinke) Date: Sat, 11 Jul 2020 16:19:07 +0200 Subject: [Haskell-cafe] Test on identity? Message-ID: <395f7a9196f76468e23f036a36d43405d90a1632.camel@aatal-apotheke.de> > Am 09.07.20 um 20:26 schrieb Olaf Klinke: > > Joachim Durchholz wrote: > >> This "identity is equality under change" definition captures not > >> just > >> two objects at identical addresses, but also proxies, network > >> objects, > >> files, and whatever there is. > >> > >> Step 3: Realize that if you have an immutable object, there is no > >> relevant difference between equality and identity anymore. (You > can > >> make > >> various formal statements about this.) > > > > Is that what is called "extensional equality"? Values a,b :: A are > > extensionally equal if they behave the same in all contexts. > > To fully treat this, you have to explore some pretty interesting > waters. > > First, "all contexts" can mean pretty different things. > E.g. in Haskell (without unsafe* and similar), everything is > immutable, > so you have contexts that exist in languages with mutability but not > in > Haskell, and you'd have to define what "all contexts" means for a > general term like "extensional equality". > > Second, "behave the same" means that for all functions, the result it > equal. > That's a recursive definition. You need fixed-point theory to turn > that > statement into an actual definition. > And that's surprisingly tricky. > > One fixed point is to say "the largest set of equal objects", which > means everything is equal to everything - fits the above definition > (applying any function to anything will trivially return results > that > are equal under this definition), but is obviously not what we > wanted. > > Another one would be to have what we'd intuitively define as > equality. > Say, in Haskell, the minimum set of equalities that makes different > constructor calls unequal. (Plus a handful of clerical definitions > to > catch special cases like integers.) > > Another one would be to have what we'd intuitively define as > identity. > > Plus various other fixed points. > > For example, consider proxies - say, an object that talks to a > remote > machine to produce its function results. > Now if you assume a proxy for integer values, is the proxy equal to > an > arbitrary integer you may have locally? > Again, this depends on what fixed point you choose for your equality > definition. You can define that as an exception, you consider two > objects equal if their functions return equal results, except for > those > that inspect the proxy-specific state; then proxy and local value > are > equal. Or you don't make an exception, then the two are not equal. > From a mathematical standpoint, either fixed point will satisfy the > above recursive definition (definition template, if you will). From > a > computing standpoint, you'll find that depending on context, you > want > one or the other! > > There are different types of proxy objects, and you can have > different > kinds of equality depending on how you treat them. > > That multitude of equality functions is pretty useless in a > programming > context; nobody wants to mentally deal with a gazillion of subtly > different equalities! > So I belive what one should do in practice is to have converter > functions. E.g. one that turns an Int proxy into an Int, merely by > stripping the proxy-specific functions. > That keeps the special cases to the place where they belong - all > those > types that have funky special kinds of equality. > Mutable data is another case of this. The equality of a mutable > object > can be defined as identity, and a converter function returns an > immutable copy so that equality is what's usually considered "value > equality" (equals() in Java). > (Languages that do not cleanly separate mutable and immutable types > will > still have to deal with two equalities, value equality and > identity... > well, the above is type and language design theory, practical > languages > are always a set of trade-offs, restricted by the limited knowledge > and > experience of the language designer. I guess that's why we have so > many > programming languages.) > I find it easier to intuitively define what extensional inequality means: As a domain theorist I declare two values unequal if there is an open set (a semi-decidable property) containing one but not the other value. For pure types t (not involving IO) that means there is a function semidecide :: t -> () that returns () for the one value and bottom for the other. I can not say how this would extend to impure types and I tend to agree with you that the notion of equality then depends on the intended semantics. Below the semidecision would have type t -> I0 () However, using StableName (thanks, Zemyla and Viktor!) indeed one is able to detect sharing. It works only after forcing the values, though. Apparently StableName does not work on thunks. Olaf ghci> import System.Mem.StableName ghci> import Control.Applicative ghci> data Ex = Val Int | Add Ex Ex deriving (Show) ghci> a = Add (Val 1) (Val 1) ghci> b = let v = Val 1 in Add v v ghci> show a "Add (Val 1) (Val 1)" ghci> show b "Add (Val 1) (Val 1)" ghci> let f :: Ex -> IO Bool; f (Val _) = return False; f (Add x y) = liftA2 eqStableName (makeStableName x) (makeStableName y) ghci> f a False ghci> f b True From kot.tom97 at gmail.com Sat Jul 11 21:26:28 2020 From: kot.tom97 at gmail.com (Tom Westerhout) Date: Sat, 11 Jul 2020 21:26:28 +0000 Subject: [Haskell-cafe] Caching functions compiled with llvm-hs Message-ID: Hello, Suppose I've written some function f :: Foo -> Bar using LLVM IR. Now I'd like to compile it on first invocation with llvm-hs and cache the obtained function pointer. Basically something like this f :: Foo -> Bar f x = fImpl x where fImpl = unsafePerformIO $ flag <- isAlreadyCompiled -- check the cache if flag fetchFunc -- get the compiled code from cache else compileWithLLVM -- build LLVM IR, compile it, and update the cache A very primitive JIT compiler. What is the best way to do this with llvm-hs? All standard examples using LLVM.OrcJIT or LLVM.ExecutionEngine show how to compile a function and then immediately execute it. I can't quite figure out a safe way to keep the FunPtr... Big code bases like Accelerate or JuliaLang do achieve this somehow, but are quite difficult to understand for the uninitiated. Any advice is highly appreciated! Cheers, Tom From lemming at henning-thielemann.de Sat Jul 11 22:08:35 2020 From: lemming at henning-thielemann.de (Henning Thielemann) Date: Sun, 12 Jul 2020 00:08:35 +0200 (CEST) Subject: [Haskell-cafe] Caching functions compiled with llvm-hs In-Reply-To: References: Message-ID: On Sat, 11 Jul 2020, Tom Westerhout wrote: > A very primitive JIT compiler. What is the best way to do this with > llvm-hs? All standard examples using LLVM.OrcJIT or > LLVM.ExecutionEngine show how to compile a function and then > immediately execute it. I can't quite figure out a safe way to keep > the FunPtr... As far as I know, the compiled function is valid as long as the LLVM.ExecutionEngine exists. Thus I would define the following: data JITFunPtr f = JITFunPtr (ForeignPtr ExecutionEngine) (FunPtr f) with DisposeExecutionEngine as finalizer for the ForeignPtr. After every call of the FunPtr function you have to 'touchForeignPtr executionEngine'. Alternatively, you could work with 'f' instead of 'FunPtr f' and add a (touchForeignPtr executionEngine) to the imported 'f'. This is what I do in llvm-tf: http://hackage.haskell.org/package/llvm-tf-9.2/docs/LLVM-ExecutionEngine.html#v:getExecutionFunction From kot.tom97 at gmail.com Sun Jul 12 10:28:06 2020 From: kot.tom97 at gmail.com (Tom Westerhout) Date: Sun, 12 Jul 2020 10:28:06 +0000 Subject: [Haskell-cafe] Caching functions compiled with llvm-hs In-Reply-To: References: Message-ID: On 11/07/2020, Georgi Lyubenov wrote: > Is there something wrong with your idea? (other than ideological issues > with unsafePerformIO - I guess then the standard approach would be to use > some State holding your compiled functions or a Reader over an MVar holding > your compiled functions) The part that feels wrong here is that one has to create a new Module for every single function. I always thought of LLVM Modules as kind of compilation units. Or is this an okay-ish approach? > > The only thing you should make sure to do is add a NOINLINE to wherever the > unsafePerformIO is, so that it doesn't get inlined and executed more than > once. Oh I totally forgot! Thank you for reminding! Cheers, Tom From lemming at henning-thielemann.de Sun Jul 12 10:32:27 2020 From: lemming at henning-thielemann.de (Henning Thielemann) Date: Sun, 12 Jul 2020 12:32:27 +0200 (CEST) Subject: [Haskell-cafe] Caching functions compiled with llvm-hs In-Reply-To: References: Message-ID: On Sun, 12 Jul 2020, Tom Westerhout wrote: > On 11/07/2020, Georgi Lyubenov wrote: >> Is there something wrong with your idea? (other than ideological issues >> with unsafePerformIO - I guess then the standard approach would be to >> use some State holding your compiled functions or a Reader over an MVar >> holding your compiled functions) > > The part that feels wrong here is that one has to create a new Module > for every single function. I always thought of LLVM Modules as kind of > compilation units. Right. Your module can contain multiple functions and they are compiled and optimized together by LLVM. But if you want to cache every single function then a one-function-module is fine. From kot.tom97 at gmail.com Sun Jul 12 10:33:58 2020 From: kot.tom97 at gmail.com (Tom Westerhout) Date: Sun, 12 Jul 2020 10:33:58 +0000 Subject: [Haskell-cafe] Caching functions compiled with llvm-hs In-Reply-To: References: Message-ID: On 11/07/2020, Henning Thielemann wrote: > > On Sat, 11 Jul 2020, Tom Westerhout wrote: > >> A very primitive JIT compiler. What is the best way to do this with >> llvm-hs? All standard examples using LLVM.OrcJIT or >> LLVM.ExecutionEngine show how to compile a function and then >> immediately execute it. I can't quite figure out a safe way to keep >> the FunPtr... > > As far as I know, the compiled function is valid as long as the > LLVM.ExecutionEngine exists. Thus I would define the following: > > data JITFunPtr f = JITFunPtr (ForeignPtr ExecutionEngine) (FunPtr f) > > with DisposeExecutionEngine as finalizer for the ForeignPtr. After every > call of the FunPtr function you have to 'touchForeignPtr executionEngine'. > > Alternatively, you could work with 'f' instead of 'FunPtr f' and add a > (touchForeignPtr executionEngine) to the imported 'f'. This is what I do > in llvm-tf: > > http://hackage.haskell.org/package/llvm-tf-9.2/docs/LLVM-ExecutionEngine.html#v:getExecutionFunction > Your getExecutionFunction implementation is indeed quite helpful, thank you! Cheers, Tom From kot.tom97 at gmail.com Sun Jul 12 10:35:24 2020 From: kot.tom97 at gmail.com (Tom Westerhout) Date: Sun, 12 Jul 2020 10:35:24 +0000 Subject: [Haskell-cafe] Caching functions compiled with llvm-hs In-Reply-To: References: Message-ID: On 12/07/2020, Henning Thielemann wrote: > > On Sun, 12 Jul 2020, Tom Westerhout wrote: >> The part that feels wrong here is that one has to create a new Module >> for every single function. I always thought of LLVM Modules as kind of >> compilation units. > > Right. Your module can contain multiple functions and they are compiled > and optimized together by LLVM. But if you want to cache every single > function then a one-function-module is fine. > Ah that settles it then. Thanks a lot! Cheers, Tom From icfp.publicity at googlemail.com Tue Jul 14 19:54:53 2020 From: icfp.publicity at googlemail.com (Sam Tobin-Hochstadt) Date: Tue, 14 Jul 2020 15:54:53 -0400 Subject: [Haskell-cafe] Call for Participation: ICFP 2020 Message-ID: <5f0e0d8d2a20d_93ca21c163f@homer.mail> ===================================================================== Call for Participation ICFP 2020 25th ACM SIGPLAN International Conference on Functional Programming and affiliated events August 23 - August 28, 2020 Online http://icfp20.sigplan.org/ Early Registration until August 8! The ICFP Programming Contest starts on July 17! ===================================================================== ICFP provides a forum for researchers and developers to hear about the latest work on the design, implementations, principles, and uses of functional programming. The conference covers the entire spectrum of work, from practice to theory, including its peripheries. This year, the conference will be a virtual event. All activities will take place online. The ICFP Programming competition will be July 17th through 20th, 2020! The main conference will take place from August 24-26, 2020 during two time bands. The first band will be 9AM-5:30PM New York, and will include both technical and social activities. The second band will repeat (with some variation) the technical program and social activities 12 hours later, 9AM-5:30PM Beijing, the following day. We’re excited to announce our two invited speakers for 2020: Evan Czaplicki, covering the Elm programming language and hard lessons learned on driving adoption of new programming languages; and Audrey Tang, Haskeller and Taiwan’s Digital Minister, on how software developers can contribute to fighting the pandemic. ICFP has officially accepted 37 exciting papers, and (as a fresh experiment this year) there will also be presentations of 8 papers accepted recently to the Journal of Functional Programming. Co-located symposia and workshops will take place the day before and two days immediately after the main conference. Registration is now open. The early registration deadline is August 8th, 2020. Registration is not free, but is significantly lower than usual. Students who are ACM or SIGPLAN members may register for FREE before the early deadline. https://regmaster.com/2020conf/ICFP20/register.php New this year: Attendees will be able to sign-up for the ICFP Mentoring Program (either to be a mentor, receive mentorship or both). * Overview and affiliated events: http://icfp20.sigplan.org/home * Accepted papers: http://icfp20.sigplan.org/track/icfp-2020-papers#event-overview * JFP Talks: https://icfp20.sigplan.org/track/icfp-2020-jfp-talks#event-overview * Registration is available via: https://regmaster.com/2020conf/ICFP20/register.php Early registration ends 8 August, 2020. * Programming contest: https://icfpcontest2020.github.io/ The Programming Contest begins July 17th! * Student Research Competition: https://icfp20.sigplan.org/track/icfp-2020-Student-Research-Competition * Follow us on Twitter for the latest news: http://twitter.com/icfp_conference This year, there are 10 events co-located with ICFP: * Erlang Workshop (8/23) * Haskell Implementors' Workshop (8/28) * Haskell Symposium (8/27-8/28) * Higher-Order Programming with Effects (8/23) * miniKanren Workshop (8/27) * ML Family Workshop (8/27) * OCaml Workshop (8/28) * Programming Languages Mentoring Workshop (8/23) * Scheme Workshop (8/28) * Type-Driven Development (8/23) ### ICFP Organizers General Chair: Stephanie Weirich (University of Pennsylvania, USA) Program Chair: Adam Chlipala (MIT, USA) Artifact Evaluation Co-Chairs: Brent Yorgey (Hendrix College, USA) Ben Lippmeier (Ghost Locomotion, Australia) Industrial Relations Chair: Alan Jeffrey (Mozilla Research, USA) Programming Contest Organizer: Igor Lukanin (Kontur, Russia) Publicity and Web Chair: Sam Tobin-Hochstadt (Indiana University, USA) Student Research Competition Chair: Youyou Cong (Tokyo Institute of Technology, Japan) Workshops Co-Chair: Jennifer Hackett (University of Nottingham, UK) Leonidas Lampropoulos (University of Pennsylvania, USA) Video Chair: Leif Andersen (Northeastern University, USA) Student Volunteer Co-Chair: Hanneli Tavante (McGill University, Canada) Victor Lanvin (IRIF, Université Paris Diderot, France) From david.feuer at gmail.com Tue Jul 14 20:02:52 2020 From: david.feuer at gmail.com (David Feuer) Date: Tue, 14 Jul 2020 16:02:52 -0400 Subject: [Haskell-cafe] containers-0.6.3.1 Message-ID: At long last, we have released containers-0.6.3.1. The most important changes in this release are bug fixes for IntMap traversals: * Fix traverse and traverseWithKey for IntMap, which would previously produce invalid IntMaps when the input contained negative keys (Thanks, Felix Paulusma). * Fix the traversal order of various functions for Data.IntMap: traverseWithKey, traverseMaybeWithKey, filterWithKeyA, minimum, maximum, mapAccum, mapAccumWithKey, mapAccumL, mapAccumRWithKey, mergeA (Thanks, Felix Paulusma, Simon Jakobi). These now traverse in key order; previously they would traverse non-negative keys before negative keys. If you traverse any IntMaps, please take note of these changes. We also have several additions to the API: * Add compose for Map and IntMap (Thanks, Alexandre Esteves). * Add alterF for Set and IntSet (Thanks, Simon Jakobi). * Add Data.IntSet.mapMonotonic (Thanks, Javran Cheng). * Add instance Bifoldable Map (Thanks, Joseph C. Sible). Performance improvements of note: * Make (<*) for Data.Sequence incrementally asymptotically optimal (Thanks, David Feuer). This finally completes the task, begun in December 2014, of making all the Applicative methods for sequences asymptotically optimal even when their results are consumed incrementally. Many thanks to Li-Yao Xia and Bertram Felgenhauer for helping to clean up and begin to document this rather tricky code. * Speed up fromList and related functions in Data.IntSet, Data.IntMap and Data.IntMap.Strict (Thanks, Bertram Felgenhauer). * Use count{Leading,Trailing}Zeros in Data.IntSet internals (Thanks, Alex Biehl). There are also numerous documentation improvements and packaging updates. Please see the changelog for full details. Thanks to all the contributors, The containers team From carter.schonwald at gmail.com Tue Jul 14 23:19:37 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Tue, 14 Jul 2020 19:19:37 -0400 Subject: [Haskell-cafe] containers-0.6.3.1 In-Reply-To: References: Message-ID: Great stuff! On Tue, Jul 14, 2020 at 4:04 PM David Feuer wrote: > At long last, we have released containers-0.6.3.1. The most important > changes in this release are bug fixes for IntMap traversals: > > * Fix traverse and traverseWithKey for IntMap, which would previously > produce invalid IntMaps when the input contained negative keys > (Thanks, Felix Paulusma). > > * Fix the traversal order of various functions for Data.IntMap: > traverseWithKey, traverseMaybeWithKey, filterWithKeyA, minimum, > maximum, mapAccum, mapAccumWithKey, mapAccumL, mapAccumRWithKey, > mergeA (Thanks, Felix Paulusma, Simon Jakobi). These now traverse in > key order; previously they would traverse non-negative keys before > negative keys. > > If you traverse any IntMaps, please take note of these changes. > > We also have several additions to the API: > > * Add compose for Map and IntMap (Thanks, Alexandre Esteves). > > * Add alterF for Set and IntSet (Thanks, Simon Jakobi). > > * Add Data.IntSet.mapMonotonic (Thanks, Javran Cheng). > > * Add instance Bifoldable Map (Thanks, Joseph C. Sible). > > Performance improvements of note: > > * Make (<*) for Data.Sequence incrementally asymptotically optimal > (Thanks, David Feuer). This finally completes the task, begun in > December 2014, of making all the Applicative methods for sequences > asymptotically optimal even when their results are consumed > incrementally. Many thanks to Li-Yao Xia and Bertram Felgenhauer for > helping to clean up and begin to document this rather tricky code. > > * Speed up fromList and related functions in Data.IntSet, Data.IntMap > and Data.IntMap.Strict (Thanks, Bertram Felgenhauer). > > * Use count{Leading,Trailing}Zeros in Data.IntSet internals (Thanks, > Alex Biehl). > > There are also numerous documentation improvements and packaging > updates. Please see the changelog for full details. > > Thanks to all the contributors, > The containers team > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From godzbanebane at gmail.com Wed Jul 15 12:05:19 2020 From: godzbanebane at gmail.com (Georgi Lyubenov) Date: Wed, 15 Jul 2020 15:05:19 +0300 Subject: [Haskell-cafe] How does the RTS "know" about ExitCode Message-ID: Hi! I'm wondering how values stored in ExitCode "get to" the RTS (i.e. actually make the program exit with the given number). As far as I can tell ExitCode is used as an everyday normal-looking exception (which are defined entirely in "library code", by using the raise#/catch# primitives), without any direct link to the RTS. What am I missing? ====== Georgi -------------- next part -------------- An HTML attachment was scrubbed... URL: From publicityifl at gmail.com Wed Jul 15 13:35:31 2020 From: publicityifl at gmail.com (Jurriaan Hage) Date: Wed, 15 Jul 2020 09:35:31 -0400 Subject: [Haskell-cafe] Second call for draft papers for IFL 2020 (Implementation and Application of Functional Languages) Message-ID: Hello, Please, find below the second call for draft papers for IFL 2020. Please forward these to anyone you think may be interested. Apologies for any duplicates you may receive. best regards, Jurriaan Hage Publicity Chair of IFL ================================================================================ IFL 2020 32nd Symposium on Implementation and Application of Functional Languages venue: online 2nd - 4th September 2020 https://www.cs.kent.ac.uk/events/2020/ifl20/ ================================================================================ ### Scope The goal of the IFL symposia is to bring together researchers actively engaged in the implementation and application of functional and function-based programming languages. IFL 2020 will be a venue for researchers to present and discuss new ideas and concepts, work in progress, and publication-ripe results related to the implementation and application of functional languages and function-based programming. Topics of interest to IFL include, but are not limited to: - language concepts - type systems, type checking, type inferencing - compilation techniques - staged compilation - run-time function specialisation - run-time code generation - partial evaluation - (abstract) interpretation - meta-programming - generic programming - automatic program generation - array processing - concurrent/parallel programming - concurrent/parallel program execution - embedded systems - web applications - (embedded) domain specific languages - security - novel memory management techniques - run-time profiling performance measurements - debugging and tracing - virtual/abstract machine architectures - validation, verification of functional programs - tools and programming techniques - (industrial) applications ### Post-symposium peer-review Following IFL tradition, IFL 2020 will use a post-symposium review process to produce the formal proceedings. Before the symposium authors submit draft papers. These draft papers will be screened by the program chair to make sure that they are within the scope of IFL. The draft papers will be made available to all participants at the symposium. Each draft paper is presented by one of the authors at the symposium. After the symposium every presenter is invited to submit a full paper, incorporating feedback from discussions at the symposium. Work submitted to IFL may not be simultaneously submitted to other venues; submissions must adhere to ACM SIGPLAN's republication policy. The program committee will evaluate these submissions according to their correctness, novelty, originality, relevance, significance, and clarity, and will thereby determine whether the paper is accepted or rejected for the formal proceedings. We plan to publish these proceedings in the International Conference Proceedings Series of the ACM Digital Library, as in previous years. ### Important dates Submission deadline of draft papers: 17 August 2020 Notification of acceptance for presentation: 19 August 2020 Registration deadline: 31 August 2020 IFL Symposium: 2-4 September 2020 Submission of papers for proceedings: 7 December 2020 Notification of acceptance: 3 February 2021 Camera-ready version: 15 March 2021 ### Submission details All contributions must be written in English. Papers must use the ACM two columns conference format, which can be found at: http://www.acm.org/publications/proceedings-template ### Peter Landin Prize The Peter Landin Prize is awarded to the best paper presented at the symposium every year. The honoured article is selected by the program committee based on the submissions received for the formal review process. The prize carries a cash award equivalent to 150 Euros. ### Programme committee Kenichi Asai, Ochanomizu University, Japan Olaf Chitil, University of Kent, United Kingdom (chair) Martin Erwig, Oregon State University,United States Daniel Horpacsi, Eotvos Lorand University, Hungary Zhenjiang Hu, Peking University, China Hans-Wolfgang Loidl, Heriot-Watt University, United Kingdom Neil Mitchell, Facebook, UK Marco T. Morazan, Seton Hall University, United States Rinus Plasmeijer, Radboud University, Netherlands Colin Runciman, University of York, United Kingdom Mary Sheeran, Chalmers University of Technology, Sweden Josep Silva, Universitat Politecnica de Valencia, Spain Jurrien Stutterheim, Standard Chartered, Singapore Josef Svenningsson, Facebook, UK Peter Thiemann, University of Freiburg, Germany Kanae Tsushima, National Institute of Informatics, Japan. Marcos Viera, Universidad de la Republica, Montevideo, Uruguay Janis Voigtlander, University of Duisburg-Essen, Germany ### Virtual symposium Because of the Covid-19 pandemic, this year IFL 2020 will be an online event, consisting of paper presentations, discussions and virtual social gatherings. Registered participants can take part from anywhere in the world. ### Acknowledgments This call-for-papers is an adaptation and evolution of content from previous instances of IFL. We are grateful to prior organisers for their work, which is reused here. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lemming at henning-thielemann.de Wed Jul 15 15:12:42 2020 From: lemming at henning-thielemann.de (Henning Thielemann) Date: Wed, 15 Jul 2020 17:12:42 +0200 (CEST) Subject: [Haskell-cafe] setting ISO-8859-1 encoding on Raspbian for GHC Message-ID: On Raspbian Buster I get: pi at raspberrypi:~ $ LANG=de_DE ghc -e 'print System.IO.localeEncoding' UTF-8 pi at raspberrypi:~ $ LANG=de_DE.iso88591 ghc -e 'print System.IO.localeEncoding' UTF-8 pi at raspberrypi:~ $ LANG=de_DE at euro ghc -e 'print System.IO.localeEncoding' UTF-8 Which is not, what I want. On Debian Buster it is correct: $ LANG=de_DE ghc -e 'print System.IO.localeEncoding' ISO-8859-1 How can I check whether this is a GHC problem and has anyone an idea how to fix it? From ietf-dane at dukhovni.org Wed Jul 15 15:40:18 2020 From: ietf-dane at dukhovni.org (Viktor Dukhovni) Date: Wed, 15 Jul 2020 11:40:18 -0400 Subject: [Haskell-cafe] setting ISO-8859-1 encoding on Raspbian for GHC In-Reply-To: References: Message-ID: <20200715154018.GA59671@straasha.imrryr.org> On Wed, Jul 15, 2020 at 05:12:42PM +0200, Henning Thielemann wrote: > On Raspbian Buster I get: > > pi at raspberrypi:~ $ LANG=de_DE ghc -e 'print System.IO.localeEncoding' > UTF-8 > pi at raspberrypi:~ $ LANG=de_DE.iso88591 ghc -e 'print System.IO.localeEncoding' > UTF-8 > pi at raspberrypi:~ $ LANG=de_DE at euro ghc -e 'print System.IO.localeEncoding' > UTF-8 Do you have any other pertinent environment variables set? In particular, either LC_ALL or LC_CTYPE? What is the output of: $ locale -a | grep de_DE > On Debian Buster it is correct: > > $ LANG=de_DE ghc -e 'print System.IO.localeEncoding' > ISO-8859-1 Do you have any other pertinent environment variables set? In particular, either LC_ALL or LC_CTYPE? What is the output of: $ locale -a | grep de_DE -- Viktor. From lemming at henning-thielemann.de Wed Jul 15 15:56:32 2020 From: lemming at henning-thielemann.de (Henning Thielemann) Date: Wed, 15 Jul 2020 17:56:32 +0200 (CEST) Subject: [Haskell-cafe] setting ISO-8859-1 encoding on Raspbian for GHC In-Reply-To: <20200715154018.GA59671@straasha.imrryr.org> References: <20200715154018.GA59671@straasha.imrryr.org> Message-ID: On Wed, 15 Jul 2020, Viktor Dukhovni wrote: > On Wed, Jul 15, 2020 at 05:12:42PM +0200, Henning Thielemann wrote: > >> On Raspbian Buster I get: >> >> pi at raspberrypi:~ $ LANG=de_DE ghc -e 'print System.IO.localeEncoding' >> UTF-8 >> pi at raspberrypi:~ $ LANG=de_DE.iso88591 ghc -e 'print System.IO.localeEncoding' >> UTF-8 >> pi at raspberrypi:~ $ LANG=de_DE at euro ghc -e 'print System.IO.localeEncoding' >> UTF-8 > > Do you have any other pertinent environment variables set? In > particular, either LC_ALL or LC_CTYPE? Aha: pi at raspberrypi:~ $ echo $LC_ALL de_DE.UTF-8 debian-buster$ echo $LC_ALL Actually, setting LC_ALL to the empty string solves the problem! From ietf-dane at dukhovni.org Wed Jul 15 16:14:24 2020 From: ietf-dane at dukhovni.org (Viktor Dukhovni) Date: Wed, 15 Jul 2020 12:14:24 -0400 Subject: [Haskell-cafe] setting ISO-8859-1 encoding on Raspbian for GHC In-Reply-To: References: <20200715154018.GA59671@straasha.imrryr.org> Message-ID: <20200715161424.GB59671@straasha.imrryr.org> On Wed, Jul 15, 2020 at 05:56:32PM +0200, Henning Thielemann wrote: > >> pi at raspberrypi:~ $ LANG=de_DE.iso88591 ghc -e 'print System.IO.localeEncoding' > >> UTF-8 > > > > Do you have any other pertinent environment variables set? In > > particular, either LC_ALL or LC_CTYPE? > > Aha: > > pi at raspberrypi:~ $ echo $LC_ALL > de_DE.UTF-8 > debian-buster$ echo $LC_ALL > > Actually, setting LC_ALL to the empty string solves the problem! Or better yet, "unset LC_ALL", no point it having an empty setting. On a Fedora 31 system, locale(7) states: If the second argument to setlocale(3) is an empty string, "", for the default locale, it is determined using the following steps: 1. If there is a non-null environment variable LC_ALL, the value of LC_ALL is used. 2. If an environment variable with the same name as one of the categories above exists and is non-null, its value is used for that category. 3. If there is a non-null environment variable LANG, the value of LANG is used. Where by "non-null", the author must have meant non-empty, since the value of an environment variable (that has a value) cannot be NULL, but it can be empty. -- Viktor. From donn at avvanta.com Wed Jul 15 16:44:59 2020 From: donn at avvanta.com (Donn Cave) Date: Wed, 15 Jul 2020 09:44:59 -0700 (PDT) Subject: [Haskell-cafe] setting ISO-8859-1 encoding on Raspbian for GHC In-Reply-To: <20200715161424.GB59671@straasha.imrryr.org> References: <20200715154018.GA59671@straasha.imrryr.org><20200715161424.GB59671@straasha.imrryr.org> Message-ID: <20200715164459.940D5276C41@mail.avvanta.com> quoth Viktor Dukhovni ... > Or better yet, "unset LC_ALL", no point it having an empty setting. > On a Fedora 31 system, locale(7) states: > > If the second argument to setlocale(3) is an empty string, "", for > the default locale, it is determined using the following steps: > > 1. If there is a non-null environment variable LC_ALL, the value of > LC_ALL is used. > > 2. If an environment variable with the same name as one of the > categories above exists and is non-null, its value is used for that > category. > > 3. If there is a non-null environment variable LANG, the value of > LANG is used. > > Where by "non-null", the author must have meant non-empty, since the > value of an environment variable (that has a value) cannot be NULL, but > it can be empty. Yes, that's what null means in this context - zero length. I would think the main question is whether other applications are going to be affected by a change to LC_ALL. If that's in doubt, it may be more convenient to apply this as he has done in his examples, on the command line. Donn From lists at utdemir.com Wed Jul 15 21:11:42 2020 From: lists at utdemir.com (Utku Demir) Date: Thu, 16 Jul 2020 09:11:42 +1200 Subject: [Haskell-cafe] How does the RTS "know" about ExitCode In-Reply-To: References: Message-ID: <593c294a-7f81-46e8-ab74-16aba63cc20f@www.fastmail.com> I also didn't know that, so I looked at the source, and found out that GHC has a wrapper[1] around `main` which (besides a few other stuff) is responsible for catching exceptions including the ExitCode and figuring out the appropriate exit code. In order to actually exit the program, `shutdownHaskellAndExit` C function from RTS [2] is called from there which takes the exit code as a parameter. This was good to know, thanks for the question! Utku [1]: https://gitlab.haskell.org/ghc/ghc/-/blob/ae11bdfd98a10266bfc7de9e16b500be220307ac/libraries/base/GHC/TopHandler.hs [2]: https://gitlab.haskell.org/ghc/ghc/-/blob/ae11bdfd98a10266bfc7de9e16b500be220307ac/rts/RtsStartup.c#L554 On Thu, Jul 16, 2020, at 12:05 AM, Georgi Lyubenov wrote: > Hi! > > I'm wondering how values stored in ExitCode "get to" the RTS (i.e. actually make the program exit with the given number). > > As far as I can tell ExitCode is used as an everyday normal-looking exception (which are defined entirely in "library code", by using the raise#/catch# primitives), without any direct link to the RTS. > > What am I missing? > > ====== > Georgi > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From godzbanebane at gmail.com Wed Jul 15 21:38:14 2020 From: godzbanebane at gmail.com (Georgi Lyubenov) Date: Thu, 16 Jul 2020 00:38:14 +0300 Subject: [Haskell-cafe] How does the RTS "know" about ExitCode In-Reply-To: <593c294a-7f81-46e8-ab74-16aba63cc20f@www.fastmail.com> References: <593c294a-7f81-46e8-ab74-16aba63cc20f@www.fastmail.com> Message-ID: Thanks! There is still one missing bit for me though - at what point/in what way does the user's main hook into the TopHandler stuff? (I'm interested in how practical it is to roll your own base, at least in regards to the exceptions mechanism.) ====== Georgi -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben at well-typed.com Wed Jul 15 23:38:31 2020 From: ben at well-typed.com (Ben Gamari) Date: Wed, 15 Jul 2020 19:38:31 -0400 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available Message-ID: <87blkgfix7.fsf@smart-cactus.org> Hello everyone, The GHC team is proud to announce the release of GHC 8.8.4. The source distribution, binary distributions, and documentation are available at https://downloads.haskell.org/~ghc/8.8.4 Release notes are also available [1]. This release fixes a handful of issues affecting 8.8.3: - Fixes a bug in process creation on Windows (#17926). Due to this fix we strongly encourage all Windows users to upgrade immediately. - Works around a Linux kernel bug in the implementation of timerfd (#18033) - Fixes a few linking issues affecting ARM - Fixes "missing interface file" error triggered by some uses of Data.Ord.Ordering (#18185) - Fixes an integer overflow in the compact-normal-form import implementation (#16992) - `configure` now accepts a `--enable-numa` flag to enable/disable `numactl` support on Linux. - Fixes potentially lost sharing due to the desugaring of left operator sections (#18151). - Fixes a build-system bug resulting in potential miscompilation by unregisteised compilers (#18024) As always, if anything looks amiss do let us know. Happy compiling! Cheers, - Ben [1] https://downloads.haskell.org/ghc/8.8.4/docs/html/users_guide/8.8.4-notes.html -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: From icfp.publicity at googlemail.com Thu Jul 16 13:56:54 2020 From: icfp.publicity at googlemail.com (Sam Tobin-Hochstadt) Date: Thu, 16 Jul 2020 09:56:54 -0400 Subject: [Haskell-cafe] Final Call for Tutorials, Discussions, and Social Events: ICFP 2020 Message-ID: <5f105ca64171e_d5ed294165d@homer.mail> FINAL CALL FOR TUTORIAL, DISCUSSION, AND SOCIAL EVENT PROPOSALS ICFP 2020 25th ACM SIGPLAN International Conference on Functional Programming August 23 - 28, 2020 Virtual https://icfp20.sigplan.org/ The 25th ACM SIGPLAN International Conference on Functional Programming will be held virtually on August 23-28, 2020. ICFP provides a forum for researchers and developers to hear about the latest work on the design, implementations, principles, and uses of functional programming. Proposals are invited for tutorials, lasting approximately 3 hours each, to be presented during ICFP and its co-located workshops and other events. These tutorials are the successor to the CUFP tutorials from previous years, but we also welcome tutorials whose primary audience is researchers rather than practitioners. Tutorials may focus either on a concrete technology or on a theoretical or mathematical tool. Ideally, tutorials will have a concrete result, such as "Learn to do X with Y" rather than "Learn language Y". To increase social interaction on the first ICFP virtual conference, this year we invite proposals for social events on topics of broader interest to the PL community. Such events can be panels and discussions (in the lines of the successful #ShutDownPL event), focused discussions (e.g., problem identifications, retrospective analysis, technical demos), social activities (e.g., treasure hunt, bingo, problem solving, artistic challenges). The typical duration of such events ranges from 30 minutes to one hour, but can be of any length. Tutorials may occur before or after ICFP, co-located with the associated workshops, on August 23 or August 27-28. Social events may be scheduled throughout the week. ---------------------------------------------------------------------- Submission details Deadline for submission: July 17th, 2020 Notification of acceptance: July 22nd, 2020 Prospective organizers of tutorials are invited to submit a completed tutorial proposal form in plain text format to the ICFP 2020 workshop co-chairs (Jennifer Hackett and Leonidas Lampropoulos), via email to icfp-workshops-2020 at googlegroups.com by July 17th, 2020. Please note that this is a firm deadline. Organizers will be notified if their event proposal is accepted by July 22nd, 2020. The proposal form is available at: http://www.icfpconference.org/icfp2020-files/icfp20-panel-form.txt http://www.icfpconference.org/icfp2020-files/icfp20-tutorials-form.txt ---------------------------------------------------------------------- Selection committee The proposals will be evaluated by a committee comprising the following members of the ICFP 2020 organizing committee. Tutorials Co-Chair: Jennifer Hackett (University of Nottingham) Tutorials Co-Chair: Leonidas Lampropoulos (University of Maryland) General Chair: Stephanie Weirich (University of Pennsylvania) Program Chair: Adam Chlipala (MIT) ---------------------------------------------------------------------- Further information Any queries should be addressed to the tutorial co-chairs (Jennifer Hackett and Leonidas Lampropoulos), via email to icfp-workshops-2020 at googlegroups.com From me at abn.sh Fri Jul 17 14:49:23 2020 From: me at abn.sh (Alexander Ben Nasrallah) Date: Fri, 17 Jul 2020 16:49:23 +0200 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: <87blkgfix7.fsf@smart-cactus.org> References: <87blkgfix7.fsf@smart-cactus.org> Message-ID: <20200717144923.GB10699@scherox.fritz.box> On Wed, Jul 15, 2020 at 07:38:31PM -0400, Ben Gamari wrote: > The GHC team is proud to announce the release of GHC 8.8.4. The source > distribution, binary distributions, and documentation are available at > > https://downloads.haskell.org/~ghc/8.8.4 Thanks to the GHC team for your great work. I added a Docker image neosimsim/ghc:8.8.4 to my Docker hub with GHC 8.8.4 installed build with musl and integer-simple to support static linking. The Docker image can be used in build pipelines, e.g. with GitLab. https://hub.docker.com/r/neosimsim/ghc I hope it comes in handy. Cheers, Alex From guthrie at miu.edu Sat Jul 18 12:50:43 2020 From: guthrie at miu.edu (Gregory Guthrie) Date: Sat, 18 Jul 2020 12:50:43 +0000 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available Message-ID: I don't see any windows binaries listed; are they there, elsewhere? Lots of Linux (=0.7% desktop market share) versions, but Windows (=78%) only source?. ---------------------------------------------------------------- -----Original Message----- From: Haskell-Cafe On Behalf Of haskell-cafe- On Wed, Jul 15, 2020 at 07:38:31PM -0400, Ben Gamari wrote: > The GHC team is proud to announce the release of GHC 8.8.4. The source > distribution, binary distributions, and documentation are available at > > https://downloads.haskell.org/~ghc/8.8.4 From fa-ml at ariis.it Sat Jul 18 13:30:58 2020 From: fa-ml at ariis.it (Francesco Ariis) Date: Sat, 18 Jul 2020 15:30:58 +0200 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: References: Message-ID: <20200718133058.GA11445@extensa> Il 18 luglio 2020 alle 12:50 Gregory Guthrie ha scritto: > I don't see any windows binaries listed; are they there, elsewhere? Not a Windows user, but if I recall correctly as today the way to install Haskell on Windows is via Chocolatey: https://www.haskell.org/platform/windows.html > Lots of Linux (=0.7% desktop market share) versions, but Windows > (=78%) only source?. In the Haskell world the numbers are reversed [1]: Linux-based developers lead the pack. There are comparatively few Windows-based devs and even fewer willing to dedicate their free time to maintain the GHC for Windows build. [1] https://taylor.fausak.me/2019/11/16/haskell-survey-results/#s1q2 From guthrie at miu.edu Sat Jul 18 16:22:43 2020 From: guthrie at miu.edu (Gregory Guthrie) Date: Sat, 18 Jul 2020 16:22:43 +0000 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: <20200718133058.GA11445@extensa> References: <20200718133058.GA11445@extensa> Message-ID: Thanks - yes, but this is for Platform, not the current GHC. Platform is several versions older than the current GHC being announced below. I am not sure if one can update the GHC, with an older Platform installation? ---------------------------------------------------------------- -----Original Message----- From: Haskell-Cafe On Behalf Of Francesco Ariis Sent: Saturday, July 18, 2020 8:31 AM To: haskell-cafe at haskell.org Subject: Re: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available Il 18 luglio 2020 alle 12:50 Gregory Guthrie ha scritto: > I don't see any windows binaries listed; are they there, elsewhere? Not a Windows user, but if I recall correctly as today the way to install Haskell on Windows is via Chocolatey: https://www.haskell.org/platform/windows.html > Lots of Linux (=0.7% desktop market share) versions, but Windows > (=78%) only source?. In the Haskell world the numbers are reversed [1]: Linux-based developers lead the pack. There are comparatively few Windows-based devs and even fewer willing to dedicate their free time to maintain the GHC for Windows build. [1] https://taylor.fausak.me/2019/11/16/haskell-survey-results/#s1q2 _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. From falsifian at falsifian.org Sat Jul 18 17:33:38 2020 From: falsifian at falsifian.org (James Cook) Date: Sat, 18 Jul 2020 17:33:38 +0000 Subject: [Haskell-cafe] What causes this "Ambiguous type variable" message (involving an existential type)? Message-ID: <626b0fa2-af9d-7e8a-5e94-a4fc1a3e3220@falsifian.org> Hi haskell-cafe, I've run into a strange error. It's easy for me to work around, but I still would like to know what causes it. Here's a minimal example. {-# LANGUAGE Rank2Types #-} module Example where data T = T (forall n . (Show n) => n) d :: a -> Int d = undefined f :: T -> Int f (T t) = d t when I try to compile it (ghc Example.hs) I see the following error: Example.hs:11:13: error: • Ambiguous type variable ‘a0’ arising from a use of ‘t’ prevents the constraint ‘(Show a0)’ from being solved. Probable fix: use a type annotation to specify what ‘a0’ should be. These potential instances exist: instance Show Ordering -- Defined in ‘GHC.Show’ instance Show Integer -- Defined in ‘GHC.Show’ instance Show a => Show (Maybe a) -- Defined in ‘GHC.Show’ ...plus 22 others ...plus 12 instances involving out-of-scope types (use -fprint-potential-instances to see them all) • In the first argument of ‘d’, namely ‘t’ In the expression: d t In an equation for ‘f’: f (T t) = d t | 11 | f (T t) = d t | ^ My question: what causes this error? This seems backward to me. The way I see it, I've told the compiler that any value of type T is guaranteed to contain a value of a type implementing Show, so there should be no question about solving a constraint involving Show. Moreover, the function "d" doesn't even require its input to implement Show. I'm guessing there's some basic rule about existential types that I'm violating here, but I'm not sure where to look if I want to read about that. I've skimmed https://wiki.haskell.org/Existential_type and didn't find anything, but maybe I skimmed too quickly. (Note: if I use Num or Fractional instead of Show, I don't get the error. I guess it's because of defaulting rules for numeric type classes.) If you're interested in how I ran into this, here's a summary. I have this type Scene g with two record fields. Notice the gl_camera_info field doesn't mention the type parameter g. data Scene g where Scene :: RandomGenD g => { part_ :: Part g , gl_camera_info :: GLCameraInfo } -> Scene g Then later I have an existentially quantified scene passed to the VScene constructor here... data BaseVal = VIint Int | VS String | VScene (forall g . (RandomGenD g) => Scene g) | VIO (ShellState -> IO ShellState) | VError String data Val = Val { val_f :: Val -> Val , val_base :: BaseVal } and later (somewhere inside a "where" clause): f (Val _ (VScene scene)) = base_val (VIO (fork_render (gl_camera_info scene) (RLS.muts scene 5e-2 1e-1))) (I don't think the definitions of base_val or RLS.muts matter here.) The use of gl_camera_info seems to cause the problem in this case. I see Shell.hs:137:79: error: • Ambiguous type variable ‘g0’ arising from a use of ‘scene’ prevents the constraint ‘(RandomGenD g0)’ from being solved. Probable fix: use a type annotation to specify what ‘g0’ should be. These potential instances exist: instance [safe] RandomGenD g => RandomGenD (RList g) -- Defined in ‘RList’ instance [safe] RandomGenD StdGen -- Defined in ‘Rand’ • In the first argument of ‘gl_camera_info’, namely ‘scene’ In the first argument of ‘fork_render’, namely ‘(gl_camera_info scene)’ In the first argument of ‘VIO’, namely ‘(fork_render (gl_camera_info scene) (RLS.muts scene 5e-2 1e-1))’ | 137 | f (Val _ (VScene scene)) = base_val (VIO (fork_render (gl_camera_info scene) (RLS.muts scene 5e-2 1e-1))) | ^^^^^ My planned solution is simply to not use an existential type here. I can just make g a parameter of the BaseVal type. But I'm still curious to understand what went wrong. -- James From falsifian at falsifian.org Sat Jul 18 17:49:41 2020 From: falsifian at falsifian.org (James Cook) Date: Sat, 18 Jul 2020 17:49:41 +0000 Subject: [Haskell-cafe] What causes this "Ambiguous type variable" message (involving an existential type)? In-Reply-To: <626b0fa2-af9d-7e8a-5e94-a4fc1a3e3220@falsifian.org> References: <626b0fa2-af9d-7e8a-5e94-a4fc1a3e3220@falsifian.org> Message-ID: On 2020-07-18 5:33 p.m., James Cook wrote: > Hi haskell-cafe, > > I've run into a strange error. It's easy for me to work around, but I > still would like to know what causes it. > > Here's a minimal example. Oops, I just realized I'm not using existential types at all. A value of type T needs to be able to produce *any* instance of show requested, so the error makes sense. Sorry for the noise. I'm resurrecting some old code of mine and misinterpreted what I was doing. -- James From takenobu.hs at gmail.com Sun Jul 19 01:53:50 2020 From: takenobu.hs at gmail.com (Takenobu Tani) Date: Sun, 19 Jul 2020 10:53:50 +0900 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: References: Message-ID: Hi Gregory, Windows 64-bit binary is here on the list [1]: ghc-8.8.4-x86_64-unknown-mingw32.tar.xz [1]: https://www.haskell.org/ghc/download_ghc_8_8_4.html#windows64 Regards, Takenobu On Sat, Jul 18, 2020 at 9:51 PM Gregory Guthrie wrote: > > I don't see any windows binaries listed; are they there, elsewhere? > > Lots of Linux (=0.7% desktop market share) versions, but Windows (=78%) only source?. > ---------------------------------------------------------------- > -----Original Message----- > From: Haskell-Cafe On Behalf Of haskell-cafe- > On Wed, Jul 15, 2020 at 07:38:31PM -0400, Ben Gamari wrote: > > The GHC team is proud to announce the release of GHC 8.8.4. The source > > distribution, binary distributions, and documentation are available at > > > > https://downloads.haskell.org/~ghc/8.8.4 > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From guthrie at miu.edu Sun Jul 19 03:13:30 2020 From: guthrie at miu.edu (Gregory Guthrie) Date: Sun, 19 Jul 2020 03:13:30 +0000 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: References: Message-ID: Thank you - I didn't recognize it!! Very helpful. Can I install this and have it work with Haskell Platform, which installs 8.6.5? ---------------------------------------------------------------- -----Original Message----- Subject: Re: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available Hi Gregory, Windows 64-bit binary is here on the list [1]: ghc-8.8.4-x86_64-unknown-mingw32.tar.xz [1]: https://www.haskell.org/ghc/download_ghc_8_8_4.html#windows64 Regards, Takenobu From takenobu.hs at gmail.com Sun Jul 19 09:28:21 2020 From: takenobu.hs at gmail.com (Takenobu Tani) Date: Sun, 19 Jul 2020 18:28:21 +0900 Subject: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available In-Reply-To: References: Message-ID: I'm not sure about the combination of ghc 8.8.4 and haskell-platform 8.6.5 on Windows. Perhaps, it couldn't perform. There are several ways to use ghc 8.8.4 and packages: * Use cabal-install for ghc 8.8.4 [1] * Wait stackage's LTS for ghc-8.8.4 [2] Does anyone have a better way? [1]: https://www.haskell.org/cabal/ [2]: https://www.stackage.org/ Regards, Takenobu On Sun, Jul 19, 2020 at 12:13 PM Gregory Guthrie wrote: > > Thank you - I didn't recognize it!! > Very helpful. > > Can I install this and have it work with Haskell Platform, which installs 8.6.5? > ---------------------------------------------------------------- > > -----Original Message----- > Subject: Re: [Haskell-cafe] [ANNOUNCE] GHC 8.8.4 is now available > > Hi Gregory, > > Windows 64-bit binary is here on the list [1]: > ghc-8.8.4-x86_64-unknown-mingw32.tar.xz > > [1]: https://www.haskell.org/ghc/download_ghc_8_8_4.html#windows64 > > Regards, > Takenobu From tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk Mon Jul 20 18:18:11 2020 From: tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk (Tom Ellis) Date: Mon, 20 Jul 2020 19:18:11 +0100 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand Message-ID: <20200720181811.GA28485@cloudinit-builder> I can define the following import Data.Typeable data Foo where Foo :: Typeable x => x -> Foo eq = (\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) and then these expressions work as expected > case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } "It was a string" > case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } "It was not a string" But if I omit the 'Nothing' branch (as below) I get "Couldn't match expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". Can anyone explain why this happens? > case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } From ida.bzowska at gmail.com Mon Jul 20 18:48:52 2020 From: ida.bzowska at gmail.com (Ida Bzowska) Date: Mon, 20 Jul 2020 20:48:52 +0200 Subject: [Haskell-cafe] Haskell Love Conference (the 31st of July & 1st of August, 2020) In-Reply-To: References: <9ed57968297f0008379366660bf4ba273222492b.camel@joachim-breitner.de> Message-ID: Hi there, We just opened the registration! Grab a ticket (it's free!) https://www.eventbrite.com/e/haskell-love-tickets-113273839102 λCheers, Ida Bzowska wt., 7 lip 2020 o 15:41 Ida Bzowska napisał(a): > Hey, > > Quick reminder, you have less than 12 hours if you want to submit > something (CFP is open until the 8th of July; 0:01 a.m. PDT). Good luck, I > keep my fingers crossed for you! > > λCheers, > Ida Bzowska > > pt., 26 cze 2020 o 13:39 Ida Bzowska napisał(a): > >> Hi, >> >> Indeed, that's a very accurate comment :) so we are waiting for 30 >> minutes long talks, but if you have a longer presentation in mind, we will >> try to be agile and make a schedule that will handle it. The presentation >> flow will be based on screen sharing for sure, but this time probably it >> would not be zoom (we used it previously >> https://www.youtube.com/watch?v=Z0w_pITUTyU). >> <3 We are still waiting for submissions (till the 1st of July). <3 >> >> λCheers, >> Ida Bzowska >> >> >> >> czw., 25 cze 2020 o 18:25 Joachim Breitner >> napisał(a): >> >>> Hi, >>> >>> >>> Am Freitag, den 19.06.2020, 12:40 +0200 schrieb Ida Bzowska: >>> > Every speaker gets an avatar. If this thing will encourage you to take >>> part in CFP, I assure you: you will get one! >>> >>> I can't deny it does… >>> >>> But the CFP could have a bit more information, e.g. suggested lengths >>> of talks, and how you will handle the technicalities – do speakers just >>> screen-share with zoom? >>> >>> Cheers, >>> Joachim >>> >>> -- >>> Joachim Breitner >>> mail at joachim-breitner.de >>> http://www.joachim-breitner.de/ >>> >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From adam at well-typed.com Mon Jul 20 20:03:14 2020 From: adam at well-typed.com (Adam Gundry) Date: Mon, 20 Jul 2020 21:03:14 +0100 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand In-Reply-To: <20200720181811.GA28485@cloudinit-builder> References: <20200720181811.GA28485@cloudinit-builder> Message-ID: <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> Hi Tom, The mention of "untouchable" type variables indicates that this is a type inference problem, and indeed, if you add a `:: String` type signature your expression will be accepted. The problem is determining the type of the whole expression, which is what the unification variable `p` stands for. When type-checking pattern-matches on GADTs, the GADT brings new constraints into scope (e.g. your match on `Just Refl` brings into scope a constraint that the type of `x1` must be `String`). However, this means that any constraints that arise under the match cannot be used to solve for unification variables "outside" the match, because in general there may be multiple solutions. In your example, the RHS of the `Just Refl` case leads to a constraint that `p ~ String`, which cannot be solved directly. When there is a `Nothing` case, its RHS also leads to a `p ~ String` constraint, this time not under a GADT pattern match, so `p` gets solved with `String` and type inference succeeds. But in the absence of the `Nothing` case, there is no reason for type inference to pick that solution. In fact, if we consider just case eq x1 "" of { Just Refl -> "It was not a string" } in isolation, and suppose `x1 :: t`, this can be given two incomparable most general types, namely `String` and `t`. So type inference refuses to pick, even though in your case only `String` would work out later, but seeing that requires non-local reasoning about escaped existentials. Hope this helps, Adam On 20/07/2020 19:18, Tom Ellis wrote: > I can define the following > > import Data.Typeable > data Foo where Foo :: Typeable x => x -> Foo > eq = (\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) > > and then these expressions work as expected > > >> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > > "It was a string" > >> case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > "It was not a string" > > > But if I omit the 'Nothing' branch (as below) I get "Couldn't match > expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". > > Can anyone explain why this happens? > > >> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } >> case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } -- Adam Gundry, Haskell Consultant Well-Typed LLP, https://www.well-typed.com/ Registered in England & Wales, OC335890 118 Wymering Mansions, Wymering Road, London W9 2NF, England From tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk Tue Jul 21 07:13:26 2020 From: tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk (Tom Ellis) Date: Tue, 21 Jul 2020 08:13:26 +0100 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand In-Reply-To: <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> References: <20200720181811.GA28485@cloudinit-builder> <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> Message-ID: <20200721071326.GA21436@cloudinit-builder> Thanks, your example is much simpler and clearer, and shows it has nothing to do with the GADT "Foo". But I'm confused about the relevance of `x1 :: t`. In fact the following example is even simpler and clearer and doesn't mention `eq` at all. Could you explain why the existential that comes from matching on Refl means that the return value cannot be inferred as `String`? * Works \case { Just Refl -> "Same"; Nothing -> "Different" } * Does not work \case { Just Refl -> "Same" } On Mon, Jul 20, 2020 at 09:03:14PM +0100, Adam Gundry wrote: > In fact, if we consider just > > case eq x1 "" of { Just Refl -> "It was not a string" } > > in isolation, and suppose `x1 :: t`, this can be given two incomparable > most general types, namely `String` and `t`. So type inference refuses > to pick, even though in your case only `String` would work out later, > but seeing that requires non-local reasoning about escaped existentials. > > On 20/07/2020 19:18, Tom Ellis wrote: > > I can define the following > > > > import Data.Typeable > > data Foo where Foo :: Typeable x => x -> Foo > > eq = (\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) > > > > and then these expressions work as expected > > > > > >> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > > > > "It was a string" > > > >> case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > > "It was not a string" > > > > > > But if I omit the 'Nothing' branch (as below) I get "Couldn't match > > expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". > > > > Can anyone explain why this happens? > > > > > >> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > >> case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > > -- > Adam Gundry, Haskell Consultant > Well-Typed LLP, https://www.well-typed.com/ > > Registered in England & Wales, OC335890 > 118 Wymering Mansions, Wymering Road, London W9 2NF, England > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From adam at well-typed.com Tue Jul 21 08:01:52 2020 From: adam at well-typed.com (Adam Gundry) Date: Tue, 21 Jul 2020 09:01:52 +0100 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand In-Reply-To: <20200721071326.GA21436@cloudinit-builder> References: <20200720181811.GA28485@cloudinit-builder> <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> <20200721071326.GA21436@cloudinit-builder> Message-ID: <4042eb46-d6e0-a47f-0a04-1a2d0005f8a7@well-typed.com> On 21/07/2020 08:13, Tom Ellis wrote: > Thanks, your example is much simpler and clearer, and shows it has > nothing to do with the GADT "Foo". But I'm confused about the > relevance of `x1 :: t`. In fact the following example is even simpler > and clearer and doesn't mention `eq` at all. Could you explain why > the existential that comes from matching on Refl means that the return > value cannot be inferred as `String`? > > * Works > > \case { Just Refl -> "Same"; Nothing -> "Different" } > > * Does not work > > \case { Just Refl -> "Same" } Sure, that's a nice simple example, and shows that the crucial aspect is really the GADT pattern match. Let's recall the definition of type equality (modulo details): data a :~: b where Refl :: (a ~ b) => a :~: b There aren't any existential type variables here, just an equality constraint, which will be "provided" when pattern-matching on `Refl`. In both your examples, type inference determines that the type of the expression must be `Maybe (a :~: b) -> p` for some as-yet-unknown `a`, `b` and `p`. The RHSs of the patterns must then be used to determine `p`. But if all we have is \case { Just Refl -> "Same" } then the pattern-match on `Just Refl` introduces a given constraint `a ~ b` and we need to solve `p ~ String` under that assumption. The presence of the assumption means that simply unifying `p` with `String` isn't necessarily correct (more precisely, it isn't necessarily a unique most general solution). Thus type inference must refrain from unifying them. If the assumption isn't present (as in the `Nothing` case), it can just go ahead and unify. The underlying reason for this restriction is that type inference should return principal types (i.e. every possible type of the expression should be an instance of the inferred type). But with GADTs this isn't always possible. Notice that your second case can be given any of the types Maybe (a :~: b) -> String Maybe (a :~: String) -> a Maybe (String :~: a) -> a so it doesn't have a principal type for type inference to find. But when the `Nothing` branch is present, only the first of these types is possible. Does this make things clearer? Adam > On Mon, Jul 20, 2020 at 09:03:14PM +0100, Adam Gundry wrote: >> In fact, if we consider just >> >> case eq x1 "" of { Just Refl -> "It was not a string" } >> >> in isolation, and suppose `x1 :: t`, this can be given two incomparable >> most general types, namely `String` and `t`. So type inference refuses >> to pick, even though in your case only `String` would work out later, >> but seeing that requires non-local reasoning about escaped existentials. >> >> On 20/07/2020 19:18, Tom Ellis wrote: >>> I can define the following >>> >>> import Data.Typeable >>> data Foo where Foo :: Typeable x => x -> Foo >>> eq = (\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) >>> >>> and then these expressions work as expected >>> >>> >>>> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } >>> >>> "It was a string" >>> >>>> case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } >>> "It was not a string" >>> >>> >>> But if I omit the 'Nothing' branch (as below) I get "Couldn't match >>> expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". >>> >>> Can anyone explain why this happens? >>> >>> >>>> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } >>>> case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } -- Adam Gundry, Haskell Consultant Well-Typed LLP, https://www.well-typed.com/ Registered in England & Wales, OC335890 118 Wymering Mansions, Wymering Road, London W9 2NF, England From tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk Tue Jul 21 09:19:11 2020 From: tom-lists-haskell-cafe-2017 at jaguarpaw.co.uk (Tom Ellis) Date: Tue, 21 Jul 2020 10:19:11 +0100 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand In-Reply-To: <4042eb46-d6e0-a47f-0a04-1a2d0005f8a7@well-typed.com> References: <20200720181811.GA28485@cloudinit-builder> <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> <20200721071326.GA21436@cloudinit-builder> <4042eb46-d6e0-a47f-0a04-1a2d0005f8a7@well-typed.com> Message-ID: <20200721091911.GB21436@cloudinit-builder> Ah yes! All of the following are valid (i.e. when the type signature is provided explicitly). * (\case { Just Refl -> "Same" }) :: Maybe (a :~: b) -> String * (\case { Just Refl -> "Same" }) :: Maybe (String :~: b) -> b * (\case { Just Refl -> "Same" }) :: Maybe (a :~: String) -> a Furthermore, both of these are valid * -- inferred :: Typeable p => p -> p \b -> case eq "Hello" b of { Just Refl -> "Same"; Nothing -> b } * -- inferred :: Typeable b => b -> String \b -> case eq "Hello" b of { Just Refl -> "Same"; Nothing -> "Different" } So we could fill in the `Nothing` branch with either something of type `b` or something of type `String`. If we omit the `Nothing` branch and the type signature then the type inference engine has no way to know which one we meant! In the earlier examples, `b` was of type `String` so they would work out to the same thing (as was my implicit expectation), but this requires "non-local reasoning", as you mentioned. Parenthetically, I wonder if a "quick look" approach could resolve this particular case, but I see that making it work in general may be impossible. Thanks, Adam, for providing those enlightening examples. When I get far enough away from Hindley-Milner I lose the ability to predict how these things are going to work but your examples give me some useful guidance. Tom On Tue, Jul 21, 2020 at 09:01:52AM +0100, Adam Gundry wrote: [...] > The underlying reason for this restriction is that type inference should > return principal types (i.e. every possible type of the expression > should be an instance of the inferred type). But with GADTs this isn't > always possible. Notice that your second case can be given any of the types > > Maybe (a :~: b) -> String > Maybe (a :~: String) -> a > Maybe (String :~: a) -> a > > so it doesn't have a principal type for type inference to find. But when > the `Nothing` branch is present, only the first of these types is possible. > > On 21/07/2020 08:13, Tom Ellis wrote: > > On Mon, Jul 20, 2020 at 09:03:14PM +0100, Adam Gundry wrote: > >> In fact, if we consider just > >> > >> case eq x1 "" of { Just Refl -> "It was not a string" } > >> > >> in isolation, and suppose `x1 :: t`, this can be given two incomparable > >> most general types, namely `String` and `t`. So type inference refuses > >> to pick, even though in your case only `String` would work out later, > >> but seeing that requires non-local reasoning about escaped existentials. > >> > >> On 20/07/2020 19:18, Tom Ellis wrote: > >>> I can define the following > >>> > >>> import Data.Typeable > >>> data Foo where Foo :: Typeable x => x -> Foo > >>> eq = (\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) > >>> > >>> and then these expressions work as expected > >>> > >>> > >>>> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > >>> > >>> "It was a string" > >>> > >>>> case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > >>> "It was not a string" > >>> > >>> > >>> But if I omit the 'Nothing' branch (as below) I get "Couldn't match > >>> expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". > >>> > >>> Can anyone explain why this happens? > >>> > >>> > >>>> case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > >>>> case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } From dominik.schrempf at gmail.com Wed Jul 22 11:40:39 2020 From: dominik.schrempf at gmail.com (Dominik Schrempf) Date: Wed, 22 Jul 2020 13:40:39 +0200 Subject: [Haskell-cafe] Question about zippers on trees Message-ID: <87imefydzc.fsf@gmail.com> Hello Cafe! I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node labels with specific indices in the form of @[Int]@ as they are defined in, for example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. However, I also need to 1. modify the node label using information from nearby nodes (e.g., the children); 2. modify the tree structure itself; for example, I may want to change the sub-forest. Basically, I need a lens that focuses not on the node label, but on the node itself. I perceived that this is more difficult. I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, albeit in a complicated way: (1) I need to go downwards to focus the specific node; (2) I need to traverse the children to collect data and save the data somewhere (how? in let bindings?); (3) I then go back upwards and change the node label using the collected data. Even so, I do not really manage to change the actual structure of the tree. I also briefly had a look at plates, but do not manage to use them in a proper way, maybe because the depth of my structures may be several hundred levels. Did you encounter similar problems in the past or could you point me to resources discussing these issues? Thank you! Dominik From andrew.thaddeus at gmail.com Wed Jul 22 11:54:19 2020 From: andrew.thaddeus at gmail.com (Andrew Martin) Date: Wed, 22 Jul 2020 07:54:19 -0400 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: <87imefydzc.fsf@gmail.com> References: <87imefydzc.fsf@gmail.com> Message-ID: >From containers, Tree is defined as: data Tree a = Node { label :: a , children :: [Tree a] } (I've renamed the record labels.) What is a zipper into such a tree? I think that the [rosezipper]( https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html ) library gives a good definition. I'll specialized it to rose trees: data TreePos a = Loc { _content :: Tree a -- ^ The currently selected tree. , _before :: [Tree a] -- ^ Forest to the left , _after :: [Tree a] -- ^ Forest to the right , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the selected tree } I think that does it. I wouldn't recommend using a library for this kind though. Just define `TreePos` in your code and then write the functions that you happen to need. On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf wrote: > Hello Cafe! > > I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node > labels > with specific indices in the form of @[Int]@ as they are defined in, for > example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. > > However, I also need to > 1. modify the node label using information from nearby nodes (e.g., the > children); > 2. modify the tree structure itself; for example, I may want to change the > sub-forest. > > Basically, I need a lens that focuses not on the node label, but on the > node > itself. I perceived that this is more difficult. > > I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, > albeit in > a complicated way: (1) I need to go downwards to focus the specific node; > (2) I > need to traverse the children to collect data and save the data somewhere > (how? > in let bindings?); (3) I then go back upwards and change the node label > using > the collected data. Even so, I do not really manage to change the actual > structure of the tree. I also briefly had a look at plates, but do not > manage to > use them in a proper way, maybe because the depth of my structures may be > several hundred levels. > > Did you encounter similar problems in the past or could you point me to > resources discussing these issues? > > Thank you! > Dominik > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -- -Andrew Thaddeus Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominik.schrempf at gmail.com Wed Jul 22 12:02:19 2020 From: dominik.schrempf at gmail.com (Dominik Schrempf) Date: Wed, 22 Jul 2020 14:02:19 +0200 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: References: <87imefydzc.fsf@gmail.com> Message-ID: <87ft9jycz8.fsf@gmail.com> Thank you for your fast answer. A direct implementation without using a library is interesting, thank you. I refrained from doing that, because I thought that Control.Zipper would actually do this for me. Actually, I was pretty successful with using Control.Zipper to change node labels, but failed doing more complicated stuff. Andrew Martin writes: > From containers, Tree is defined as: > > data Tree a = Node > { label :: a > , children :: [Tree a] > } > > (I've renamed the record labels.) What is a zipper into such a tree? I think > that the [rosezipper]( > https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html > ) > library gives a good definition. I'll specialized it to rose trees: > > data TreePos a = Loc > { _content :: Tree a -- ^ The currently selected tree. > , _before :: [Tree a] -- ^ Forest to the left > , _after :: [Tree a] -- ^ Forest to the right > , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the selected > tree > } > > I think that does it. I wouldn't recommend using a library for this kind > though. Just define `TreePos` in your code and then write the functions > that you happen to need. > > On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf > wrote: > >> Hello Cafe! >> >> I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node >> labels >> with specific indices in the form of @[Int]@ as they are defined in, for >> example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. >> >> However, I also need to >> 1. modify the node label using information from nearby nodes (e.g., the >> children); >> 2. modify the tree structure itself; for example, I may want to change the >> sub-forest. >> >> Basically, I need a lens that focuses not on the node label, but on the >> node >> itself. I perceived that this is more difficult. >> >> I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, >> albeit in >> a complicated way: (1) I need to go downwards to focus the specific node; >> (2) I >> need to traverse the children to collect data and save the data somewhere >> (how? >> in let bindings?); (3) I then go back upwards and change the node label >> using >> the collected data. Even so, I do not really manage to change the actual >> structure of the tree. I also briefly had a look at plates, but do not >> manage to >> use them in a proper way, maybe because the depth of my structures may be >> several hundred levels. >> >> Did you encounter similar problems in the past or could you point me to >> resources discussing these issues? >> >> Thank you! >> Dominik >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. From andrew.thaddeus at gmail.com Wed Jul 22 12:16:36 2020 From: andrew.thaddeus at gmail.com (Andrew Martin) Date: Wed, 22 Jul 2020 08:16:36 -0400 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: <87ft9jycz8.fsf@gmail.com> References: <87imefydzc.fsf@gmail.com> <87ft9jycz8.fsf@gmail.com> Message-ID: It appears that you've already discovered this, but my experience with zippers has been that abstractions that generalize them are cumbersome, make type signatures more confusing, and don't really do that much work for me. Maybe there are other users that don't have this experience, but I prefer a more direct route with less generalization when I'm working with zippers. On Wed, Jul 22, 2020 at 8:02 AM Dominik Schrempf wrote: > Thank you for your fast answer. > > A direct implementation without using a library is interesting, thank you. > I > refrained from doing that, because I thought that Control.Zipper would > actually > do this for me. Actually, I was pretty successful with using > Control.Zipper to > change node labels, but failed doing more complicated stuff. > > Andrew Martin writes: > > > From containers, Tree is defined as: > > > > data Tree a = Node > > { label :: a > > , children :: [Tree a] > > } > > > > (I've renamed the record labels.) What is a zipper into such a tree? I > think > > that the [rosezipper]( > > > https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html > > ) > > library gives a good definition. I'll specialized it to rose trees: > > > > data TreePos a = Loc > > { _content :: Tree a -- ^ The currently selected tree. > > , _before :: [Tree a] -- ^ Forest to the left > > , _after :: [Tree a] -- ^ Forest to the right > > , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the > selected > > tree > > } > > > > I think that does it. I wouldn't recommend using a library for this kind > > though. Just define `TreePos` in your code and then write the functions > > that you happen to need. > > > > On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf < > dominik.schrempf at gmail.com> > > wrote: > > > >> Hello Cafe! > >> > >> I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node > >> labels > >> with specific indices in the form of @[Int]@ as they are defined in, for > >> example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. > >> > >> However, I also need to > >> 1. modify the node label using information from nearby nodes (e.g., the > >> children); > >> 2. modify the tree structure itself; for example, I may want to change > the > >> sub-forest. > >> > >> Basically, I need a lens that focuses not on the node label, but on the > >> node > >> itself. I perceived that this is more difficult. > >> > >> I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, > >> albeit in > >> a complicated way: (1) I need to go downwards to focus the specific > node; > >> (2) I > >> need to traverse the children to collect data and save the data > somewhere > >> (how? > >> in let bindings?); (3) I then go back upwards and change the node label > >> using > >> the collected data. Even so, I do not really manage to change the actual > >> structure of the tree. I also briefly had a look at plates, but do not > >> manage to > >> use them in a proper way, maybe because the depth of my structures may > be > >> several hundred levels. > >> > >> Did you encounter similar problems in the past or could you point me to > >> resources discussing these issues? > >> > >> Thank you! > >> Dominik > >> > >> _______________________________________________ > >> Haskell-Cafe mailing list > >> To (un)subscribe, modify options or view archives go to: > >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > >> Only members subscribed via the mailman list are allowed to post. > > -- -Andrew Thaddeus Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From anka.213 at gmail.com Wed Jul 22 12:59:34 2020 From: anka.213 at gmail.com (=?utf-8?Q?Andreas_K=C3=A4llberg?=) Date: Wed, 22 Jul 2020 14:59:34 +0200 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: <87imefydzc.fsf@gmail.com> References: <87imefydzc.fsf@gmail.com> Message-ID: Nice timing, I was just reading Conal Elliott’s take on zippers [1][2] when you sent this question. It is interesting because it produces a concrete representation of the zippers via derivatives, like the manually constructed zippers, while still being completely generic*. I wonder if it is possible to combine the approach with lenses to get the same convenience while still being more concrete and flexible. Regards, Andreas [1]: http://conal.net/blog/posts/another-angle-on-zippers [2]: http://hackage.haskell.org/package/functor-combo-0.3.6/docs/FunctorCombo-ZipperFix.html * The library is sadly not implemented in terms of `GHC.Generics`, even though it uses the same combinators, since it predated it. > 22 Jul 2020 kl. 13:40 skrev Dominik Schrempf : > > Hello Cafe! > > I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node labels > with specific indices in the form of @[Int]@ as they are defined in, for > example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. > > However, I also need to > 1. modify the node label using information from nearby nodes (e.g., the > children); > 2. modify the tree structure itself; for example, I may want to change the > sub-forest. > > Basically, I need a lens that focuses not on the node label, but on the node > itself. I perceived that this is more difficult. > > I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, albeit in > a complicated way: (1) I need to go downwards to focus the specific node; (2) I > need to traverse the children to collect data and save the data somewhere (how? > in let bindings?); (3) I then go back upwards and change the node label using > the collected data. Even so, I do not really manage to change the actual > structure of the tree. I also briefly had a look at plates, but do not manage to > use them in a proper way, maybe because the depth of my structures may be > several hundred levels. > > Did you encounter similar problems in the past or could you point me to > resources discussing these issues? > > Thank you! > Dominik > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jclites at mac.com Wed Jul 22 14:18:36 2020 From: jclites at mac.com (Jeff Clites) Date: Wed, 22 Jul 2020 07:18:36 -0700 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: References: <87imefydzc.fsf@gmail.com> Message-ID: Shouldn’t that be: _before :: [TreePos a] etc.? Jeff > On Jul 22, 2020, at 4:54 AM, Andrew Martin wrote: > > From containers, Tree is defined as: > > data Tree a = Node > { label :: a > , children :: [Tree a] > } > > (I've renamed the record labels.) What is a zipper into such a tree? I think > that the [rosezipper](https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html) > library gives a good definition. I'll specialized it to rose trees: > > data TreePos a = Loc > { _content :: Tree a -- ^ The currently selected tree. > , _before :: [Tree a] -- ^ Forest to the left > , _after :: [Tree a] -- ^ Forest to the right > , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the selected tree > } > > I think that does it. I wouldn't recommend using a library for this kind > though. Just define `TreePos` in your code and then write the functions > that you happen to need. > >> On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf wrote: >> Hello Cafe! >> >> I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node labels >> with specific indices in the form of @[Int]@ as they are defined in, for >> example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. >> >> However, I also need to >> 1. modify the node label using information from nearby nodes (e.g., the >> children); >> 2. modify the tree structure itself; for example, I may want to change the >> sub-forest. >> >> Basically, I need a lens that focuses not on the node label, but on the node >> itself. I perceived that this is more difficult. >> >> I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, albeit in >> a complicated way: (1) I need to go downwards to focus the specific node; (2) I >> need to traverse the children to collect data and save the data somewhere (how? >> in let bindings?); (3) I then go back upwards and change the node label using >> the collected data. Even so, I do not really manage to change the actual >> structure of the tree. I also briefly had a look at plates, but do not manage to >> use them in a proper way, maybe because the depth of my structures may be >> several hundred levels. >> >> Did you encounter similar problems in the past or could you point me to >> resources discussing these issues? >> >> Thank you! >> Dominik >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > > -- > -Andrew Thaddeus Martin > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mgajda at mimuw.edu.pl Wed Jul 22 15:08:14 2020 From: mgajda at mimuw.edu.pl (Michal J Gajda) Date: Wed, 22 Jul 2020 17:08:14 +0200 Subject: [Haskell-cafe] Fwd: Judge for programming language contest? In-Reply-To: References: Message-ID: Hi, I just noticed that in the age of cambrian explosion among new programming languages we also have programming language jam organized this August: https://blog.repl.it/langjam Given that there are many prolific programming language designers and implementors here -- I wonder if we get a Haskell judge so that advanced type systems can get a fair ruling? (In case you got another round of reviews from traditional PL conferences that described your paper as "badly typeset" and "wrong formatting was intentional", "repetitive" and "hard to understand", or even rude "expect a boring talk".) -- Cheers Michał From mikapoehls99 at gmx.de Wed Jul 22 15:11:29 2020 From: mikapoehls99 at gmx.de (=?UTF-8?Q?Mika_P=c3=b6hls?=) Date: Wed, 22 Jul 2020 17:11:29 +0200 Subject: [Haskell-cafe] ghc profiling Message-ID: Hello everyone, I have a question regarding ghc profiling. The .hp file, created during profiling, displays the memory usage until a certain point, which seems to be the end of the execution of the program (based on the exectuion time the program usually takes), as exprected. But after that an extra sample is created, that contains no data (no cost centers, it is just begin sample followed by end sample). This sample is also created out of the 0.1 second gaps used before. For example the last "correct" sample is at 17s and the empty sample is at 40s. When plotting the data, this results in an monotonous increasing graph (for all centers), that has its maximum at the last correct sample (17s) followed by a straight line to zero (40s). Does anyone know, what leads to this behavior and how it can be prevented? Thanks in advance Mika From dxld at darkboxed.org Wed Jul 22 15:42:44 2020 From: dxld at darkboxed.org (Daniel =?iso-8859-1?Q?Gr=F6ber?=) Date: Wed, 22 Jul 2020 17:42:44 +0200 Subject: [Haskell-cafe] ghc profiling In-Reply-To: References: Message-ID: <20200722154244.GA6916@Eli.clients.dxld.at> Hi, tl;dr: I got a fix for this merged: https://gitlab.haskell.org/ghc/ghc/-/merge_requests/3091 However I don't think this has landed in a GHC release yet. On Wed, Jul 22, 2020 at 05:11:29PM +0200, Mika Pöhls wrote: > But after that an extra sample is created, that contains no data (no > cost centers, it is just begin sample followed by end sample). This > sample is also created out of the 0.1 second gaps used before. FYI the final sample is created when the RTS is being torn down, that's why it's potentially outside of the sampling interval but the reason for the time skew is that we were using the wrong timebase when printing that last sample in the RTS. > For example the last "correct" sample is at 17s and the empty sample is > at 40s. When plotting the data, this results in an monotonous increasing > graph (for all centers), that has its maximum at the last correct sample > (17s) followed by a straight line to zero (40s). The way us GHC devs used to deal with that is to just edit that last sample out of the hp file ;) A quick shell one-liner to do this: tac foo.hp | tail -n+3 | tac > foo.fixed.hp The `tac` reverses the order of lines and `tail -n+3` ignores the first two lines. I'm not sure why the RTS emits this last empty sample to begin with, maybe we should just remove it since the tools don't seem to care either way. > Does anyone know, what leads to this behavior and how it can be prevented? I think you could also try using eventlog profiling instead of the old .hp stuff to work around this if you like, but I'm not very familliar with that either so I can't help there. --Daniel From jeffbrown.the at gmail.com Wed Jul 22 16:44:51 2020 From: jeffbrown.the at gmail.com (Jeffrey Brown) Date: Wed, 22 Jul 2020 11:44:51 -0500 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: References: <87imefydzc.fsf@gmail.com> Message-ID: If you want an abstract solution, there's https://hackage.haskell.org/package/lens-3.2/docs/Control-Lens-Zipper.html. On Wed, Jul 22, 2020 at 9:20 AM Jeff Clites via Haskell-Cafe < haskell-cafe at haskell.org> wrote: > Shouldn’t that be: > > _before :: [TreePos a] > > etc.? > > Jeff > > On Jul 22, 2020, at 4:54 AM, Andrew Martin > wrote: > > From containers, Tree is defined as: > > data Tree a = Node > { label :: a > , children :: [Tree a] > } > > (I've renamed the record labels.) What is a zipper into such a tree? I > think > that the [rosezipper]( > https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html > ) > library gives a good definition. I'll specialized it to rose trees: > > data TreePos a = Loc > { _content :: Tree a -- ^ The currently selected tree. > , _before :: [Tree a] -- ^ Forest to the left > , _after :: [Tree a] -- ^ Forest to the right > , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the > selected tree > } > > I think that does it. I wouldn't recommend using a library for this kind > though. Just define `TreePos` in your code and then write the functions > that you happen to need. > > On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf < > dominik.schrempf at gmail.com> wrote: > >> Hello Cafe! >> >> I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node >> labels >> with specific indices in the form of @[Int]@ as they are defined in, for >> example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. >> >> However, I also need to >> 1. modify the node label using information from nearby nodes (e.g., the >> children); >> 2. modify the tree structure itself; for example, I may want to change the >> sub-forest. >> >> Basically, I need a lens that focuses not on the node label, but on the >> node >> itself. I perceived that this is more difficult. >> >> I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, >> albeit in >> a complicated way: (1) I need to go downwards to focus the specific node; >> (2) I >> need to traverse the children to collect data and save the data somewhere >> (how? >> in let bindings?); (3) I then go back upwards and change the node label >> using >> the collected data. Even so, I do not really manage to change the actual >> structure of the tree. I also briefly had a look at plates, but do not >> manage to >> use them in a proper way, maybe because the depth of my structures may be >> several hundred levels. >> >> Did you encounter similar problems in the past or could you point me to >> resources discussing these issues? >> >> Thank you! >> Dominik >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > > > -- > -Andrew Thaddeus Martin > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -- Jeff Brown | Jeffrey Benjamin Brown Website | Facebook | LinkedIn (spammy, so I often miss messages here) | Github -------------- next part -------------- An HTML attachment was scrubbed... URL: From olf at aatal-apotheke.de Wed Jul 22 20:10:35 2020 From: olf at aatal-apotheke.de (Olaf Klinke) Date: Wed, 22 Jul 2020 22:10:35 +0200 Subject: [Haskell-cafe] Question about zippers on trees Message-ID: > A direct implementation without using a library is interesting, thank > you. I > refrained from doing that, because I thought that Control.Zipper > would actually > do this for me. Actually, I was pretty successful with using > Control.Zipper to > change node labels, but failed doing more complicated stuff. > Isn't that a strong indicator that zippers are an improper abstraction for your purpose? Perhaps after rolling your own implementation you can more easily discover how to represent the algrorithm as a zipper. Olaf From alan.zimm at gmail.com Wed Jul 22 22:39:22 2020 From: alan.zimm at gmail.com (Alan & Kim Zimmerman) Date: Wed, 22 Jul 2020 23:39:22 +0100 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: References: Message-ID: I am a bit late to this discussion, but do recall using a the rosezipper in HaRe. And I just took a look and recall doing some experiments with zippers, which are at https://github.com/alanz/HaRe/tree/98f390b6e9d48537429863ca890aa853afcd7c79/experiments The actual code I used for the move definition refactoring is at https://github.com/alanz/HaRe/blob/98f390b6e9d48537429863ca890aa853afcd7c79/src/Language/Haskell/Refact/Refactoring/MoveDef.hs#L355 I added a couple of helper functions too. I am sure it is all horrible code, I was learning at the time, and it sort of blew my mind. Alan On Wed, 22 Jul 2020 at 21:11, Olaf Klinke wrote: > > A direct implementation without using a library is interesting, thank > > you. I > > refrained from doing that, because I thought that Control.Zipper > > would actually > > do this for me. Actually, I was pretty successful with using > > Control.Zipper to > > change node labels, but failed doing more complicated stuff. > > > Isn't that a strong indicator that zippers are an improper abstraction > for your purpose? Perhaps after rolling your own implementation you can > more easily discover how to represent the algrorithm as a zipper. > > Olaf > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Thu Jul 23 14:51:21 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 23 Jul 2020 22:51:21 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? Message-ID: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> Hello Cafe, I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. Specifically, [7] states: > It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. Best regards, Compl [1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf -------------- next part -------------- An HTML attachment was scrubbed... URL: From jo at durchholz.org Thu Jul 23 16:25:12 2020 From: jo at durchholz.org (Joachim Durchholz) Date: Thu, 23 Jul 2020 18:25:12 +0200 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> Message-ID: While I can't contribute any Haskell knowledge, I know that many threads updating the same variable is the worst thing you can do; not only do you create a single bottleneck, if you have your threads running on multiple cores you get CPU pipeline stalls, L1 cache line flushes, and/or complicated cache coherency protocols executed between cores. It's not cheap: each of these mechanisms can take hundreds of CPU cycles, for a CPU that can execute multiple instructions per CPU cycle. Incrementing a global counter is a really hard problem in multithreading... I believe this is the reason why databases typically implement a SEQUENCE mechanism, and these sequences are usually implemented as "whenever a transaction asks for a sequence number, reserve a block of 1,000 numbers for it so it can retrieve 999 additional numbers without the synchronization overhead. This is also why real databases use transactions - these do not just isolate processes from each other's updates, they also allow the DB to let the transaction work on a snapshot and do all the synchronization once, during COMMIT. And, as you just discovered, it's one of the major optimization areas in database engines :-) TL;DR for the bad news: I suspect your problem is just unavoidable However, I see a workaround: delayed index update. Have each index twice: last-known and next-future. last-known is what was created during the last index update. You need an append-only list of records that had an index field update, and all searches that use the index will also have to do a linear search in that list. next-future is built in the background. It takes last-known and the updates from the append-only list, and generates a new index. Once next-future is finished, replace last-known with it. You still need to to a global lock while replacing indexes, but you don't have to lock the index for every single update but just once. You'll have to twiddle with parameters such as "at what point do I start a new index build", and you'll have to make sure that your linear list isn't yet another bottleneck (there are lock-free data structures to achieve such a thing, but these are complicated; or you can tell application programmers to try and collect as many updates as possible in a transaction so the number of synchronization points is smaller; however, too-large transactions can generate CPU cache overflows if the collected update data becomes too large, so there's a whole lot of tweaking, studying real performance data, hopefully finding the right set of diagnostic information to collect that allow the DB to automatically choose the right point to do its updates, etc. pp.) TL;DR for the good news: You can coalesce N updates into one and divide the CPU core coordination overhead by a factor of N. You'll increase the bus pressure, so there's tons of fine tuning you can do (or avoid) after getting the first 90% of the speedup. (I'm drawing purely speculative numbers out of my hat here.) Liability: You will want to add transactions and (likely) optimistic locking, if you don't have that already: Transaction boundaries are the natural point for coalescing updates. Regards, Jo From cma at bitemyapp.com Thu Jul 23 16:57:51 2020 From: cma at bitemyapp.com (Christopher Allen) Date: Thu, 23 Jul 2020 11:57:51 -0500 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> Message-ID: It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie It also sounds a bit like your question bumps into Amdahl's Law a bit. All else fails, stop using STM and find something more tuned to your problem space. Hope this helps, Chris Allen On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < haskell-cafe at haskell.org> wrote: > Hello Cafe, > > I'm working on an in-memory database, in Client/Server mode I just let > each connected client submit remote procedure call running in its dedicated > lightweight thread, modifying TVars in RAM per its business needs, then in > case many clients connected concurrently and trying to insert new data, if > they are triggering global index (some TVar) update, the throughput would > drop drastically. I reduced the shared state to a simple int counter by > TVar, got same symptom. While the parallelism feels okay when there's no > hot TVar conflicting, or M is not much greater than N. > > As an empirical test workload, I have a `+RTS -N10` server process, it > handles 10 concurrent clients okay, got ~5x of single thread throughput; > but in handling 20 concurrent clients, each of the 10 CPUs can only be > driven to ~10% utilization, the throughput seems even worse than single > thread. More clients can even drive it thrashing without much progressing. > > I can understand that pure STM doesn't scale well after reading [1], and > I see it suggested [7] attractive and planned future work toward that > direction. > > But I can't find certain libraries or frameworks addressing large M over > small N scenarios, [1] experimented with designated N parallelism, and [7] > is rather theoretical to my empirical needs. > > Can you direct me to some available library implementing the methodology > proposed in [7] or other ways tackling this problem? > > I think the most difficult one is that a transaction should commit with > global indices (with possibly unique constraints) atomically updated, and > rollback with any violation of constraints, i.e. transactions have to cover > global states like indices. Other problems seem more trivial than this. > > Specifically, [7] states: > > > It must be emphasized that all of the mechanisms we deploy originate, in > one form or another, in the database literature from the 70s and 80s. Our > contribution is to adapt these techniques to software transactional memory, > providing more effective solutions to important STM problems than prior > proposals. > > I wonder any STM based library has simplified those techniques to be > composed right away? I don't really want to implement those mechanisms by > myself, rebuilding many wheels from scratch. > > Best regards, > Compl > > > [1] Comparing the performance of concurrent linked-list implementations in > Haskell > https://simonmar.github.io/bib/papers/concurrent-data.pdf > > [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for > highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages > 207–216. ACM Press, 2008. > https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -- Chris Allen Currently working on http://haskellbook.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Fri Jul 24 06:11:03 2020 From: compl.yue at icloud.com (Compl Yue) Date: Fri, 24 Jul 2020 14:11:03 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> Message-ID: <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. Best regards, Compl On 2020/7/24 上午12:57, Christopher Allen wrote: > It seems like you know how to run practical tests for tuning thread > count and contention for throughput. Part of the reason you haven't > gotten a super clear answer is "it depends." You give up fairness when > you use STM instead of MVars or equivalent structures. That means a > long running transaction might get stampeded by many small ones > invalidating it over and over. The long-running transaction might > never clear if the small transactions keep moving the cheese. I > mention this because transaction runtime and size and count all affect > throughput and latency. What might be ideal for one pattern of work > might not be ideal for another. Optimizing for overall throughput > might make the contention and fairness problems worse too. I've done > practical tests to optimize this in the past, both for STM in Haskell > and for RDBMS workloads. > > The next step is sometimes figuring out whether you really need a data > structure within a single STM container or if perhaps you can break up > your STM container boundaries into zones or regions that roughly map > onto update boundaries. That should make the transactions churn less. > On the outside chance you do need to touch more than one container in > a transaction, well, they compose. > > e.g. https://hackage.haskell.org/package/stm-containers > https://hackage.haskell.org/package/ttrie > > It also sounds a bit like your question bumps into Amdahl's Law a bit. > > All else fails, stop using STM and find something more tuned to your > problem space. > > Hope this helps, > Chris Allen > > > On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe > > wrote: > > Hello Cafe, > > I'm working on an in-memory database, in Client/Server mode I just > let each connected client submit remote procedure call running in > its dedicated lightweight thread, modifying TVars in RAM per its > business needs, then in case many clients connected concurrently > and trying to insert new data, if they are triggering global index > (some TVar) update, the throughput would drop drastically. I > reduced the shared state to a simple int counter by TVar, got same > symptom. While the parallelism feels okay when there's no hot TVar > conflicting, or M is not much greater than N. > > As an empirical test workload, I have a `+RTS -N10` server > process, it handles 10 concurrent clients okay, got ~5x of single > thread throughput; but in handling 20 concurrent clients, each of > the 10 CPUs can only be driven to ~10% utilization, the throughput > seems even worse than single thread. More clients can even drive > it thrashing without much  progressing. > >  I can understand that pure STM doesn't scale well after reading > [1], and I see it suggested [7] attractive and planned future work > toward that direction. > > But I can't find certain libraries or frameworks addressing large > M over small N scenarios, [1] experimented with designated N > parallelism, and [7] is rather theoretical to my empirical needs. > > Can you direct me to some available library implementing the > methodology proposed in [7] or other ways tackling this problem? > > I think the most difficult one is that a transaction should commit > with global indices (with possibly unique constraints) atomically > updated, and rollback with any violation of constraints, i.e. > transactions have to cover global states like indices. Other > problems seem more trivial than this. > > Specifically, [7] states: > > > It must be emphasized that all of the mechanisms we deploy > originate, in one form or another, in the database literature from > the 70s and 80s. Our contribution is to adapt these techniques to > software transactional memory, providing more effective solutions > to important STM problems than prior proposals. > > I wonder any STM based library has simplified those techniques to > be composed right away? I don't really want to implement those > mechanisms by myself, rebuilding many wheels from scratch. > > Best regards, > Compl > > > [1] Comparing the performance of concurrent linked-list > implementations in Haskell > https://simonmar.github.io/bib/papers/concurrent-data.pdf > > [7] M. Herlihy and E. Koskinen. Transactional boosting: a > methodology for highly-concurrent transactional objects. In Proc. > of PPoPP ’08, pages 207–216. ACM Press, 2008. > https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > > > > -- > Chris Allen > Currently working on http://haskellbook.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From dominik.schrempf at gmail.com Fri Jul 24 06:34:07 2020 From: dominik.schrempf at gmail.com (Dominik Schrempf) Date: Fri, 24 Jul 2020 08:34:07 +0200 Subject: [Haskell-cafe] Question about zippers on trees In-Reply-To: References: <87imefydzc.fsf@gmail.com> Message-ID: <874kpx4e1s.fsf@gmail.com> Thank you for the replies! A specialized zipper suggested by Andrew Martin does the job quite well! By the way, Control.Lens.Zipper, was factored out of lens into Control.Zipper in newer versions. Best, Dominik Jeffrey Brown writes: > If you want an abstract solution, there's > https://hackage.haskell.org/package/lens-3.2/docs/Control-Lens-Zipper.html. > > > > On Wed, Jul 22, 2020 at 9:20 AM Jeff Clites via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > >> Shouldn’t that be: >> >> _before :: [TreePos a] >> >> etc.? >> >> Jeff >> >> On Jul 22, 2020, at 4:54 AM, Andrew Martin >> wrote: >> >> From containers, Tree is defined as: >> >> data Tree a = Node >> { label :: a >> , children :: [Tree a] >> } >> >> (I've renamed the record labels.) What is a zipper into such a tree? I >> think >> that the [rosezipper]( >> https://hackage.haskell.org/package/rosezipper-0.2/docs/Data-Tree-Zipper.html >> ) >> library gives a good definition. I'll specialized it to rose trees: >> >> data TreePos a = Loc >> { _content :: Tree a -- ^ The currently selected tree. >> , _before :: [Tree a] -- ^ Forest to the left >> , _after :: [Tree a] -- ^ Forest to the right >> , _parents :: [([Tree a], a, [Tree a])] -- ^ Finger to the >> selected tree >> } >> >> I think that does it. I wouldn't recommend using a library for this kind >> though. Just define `TreePos` in your code and then write the functions >> that you happen to need. >> >> On Wed, Jul 22, 2020 at 7:41 AM Dominik Schrempf < >> dominik.schrempf at gmail.com> wrote: >> >>> Hello Cafe! >>> >>> I am trying to modify a large 'Data.Tree.Tree'. I managed to modify node >>> labels >>> with specific indices in the form of @[Int]@ as they are defined in, for >>> example, 'Control.Lens.At.Ixed' or 'Lens.Micro.GHC'. >>> >>> However, I also need to >>> 1. modify the node label using information from nearby nodes (e.g., the >>> children); >>> 2. modify the tree structure itself; for example, I may want to change the >>> sub-forest. >>> >>> Basically, I need a lens that focuses not on the node label, but on the >>> node >>> itself. I perceived that this is more difficult. >>> >>> I tried to use 'Control.Zipper'. I can use zippers to achieve point 1, >>> albeit in >>> a complicated way: (1) I need to go downwards to focus the specific node; >>> (2) I >>> need to traverse the children to collect data and save the data somewhere >>> (how? >>> in let bindings?); (3) I then go back upwards and change the node label >>> using >>> the collected data. Even so, I do not really manage to change the actual >>> structure of the tree. I also briefly had a look at plates, but do not >>> manage to >>> use them in a proper way, maybe because the depth of my structures may be >>> several hundred levels. >>> >>> Did you encounter similar problems in the past or could you point me to >>> resources discussing these issues? >>> >>> Thank you! >>> Dominik >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> >> >> -- >> -Andrew Thaddeus Martin >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. From profunctor at pm.me Fri Jul 24 11:02:08 2020 From: profunctor at pm.me (Marcin Szamotulski) Date: Fri, 24 Jul 2020 11:02:08 +0000 Subject: [Haskell-cafe] GADT/Typeable/existential behaviour that I don't understand In-Reply-To: <20200721091911.GB21436@cloudinit-builder> References: <20200720181811.GA28485@cloudinit-builder> <1af47882-a904-f74e-aa1b-6ad09e6d7e72@well-typed.com> <20200721071326.GA21436@cloudinit-builder> <4042eb46-d6e0-a47f-0a04-1a2d0005f8a7@well-typed.com> <20200721091911.GB21436@cloudinit-builder> Message-ID: Another interesting case: ``` λ :t \case { Just Refl -> undefined; } \case { Just Refl -> undefined; } :: Maybe (a :~: b) -> p ``` Cheers, Marcin ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Tuesday, July 21, 2020 11:19 AM, Tom Ellis wrote: > Ah yes! All of the following are valid (i.e. when the type signature > is provided explicitly). > > - (\case { Just Refl -> "Same" }) :: Maybe (a :~: b) -> String > > - (\case { Just Refl -> "Same" }) :: Maybe (String :~: b) -> b > > - (\case { Just Refl -> "Same" }) :: Maybe (a :~: String) -> a > > Furthermore, both of these are valid > > - -- inferred :: Typeable p => p -> p > \b -> case eq "Hello" b of { Just Refl -> "Same"; Nothing -> b } > > > - -- inferred :: Typeable b => b -> String > \b -> case eq "Hello" b of { Just Refl -> "Same"; Nothing -> "Different" } > > > So we could fill in the`Nothing` branch with either something of type > `b` or something of type `String`. If we omit the `Nothing` branch > and the type signature then the type inference engine has no way to > know which one we meant! In the earlier examples, `b` was of type > `String` so they would work out to the same thing (as was my implicit > expectation), but this requires "non-local reasoning", as you > mentioned. Parenthetically, I wonder if a "quick look" approach could > resolve this particular case, but I see that making it work in general > may be impossible. > > Thanks, Adam, for providing those enlightening examples. When I get > far enough away from Hindley-Milner I lose the ability to predict how > these things are going to work but your examples give me some useful > guidance. > > Tom > > On Tue, Jul 21, 2020 at 09:01:52AM +0100, Adam Gundry wrote: > [...] > > > The underlying reason for this restriction is that type inference should > > return principal types (i.e. every possible type of the expression > > should be an instance of the inferred type). But with GADTs this isn't > > always possible. Notice that your second case can be given any of the types > > > > Maybe (a :~: b) -> String > > Maybe (a :~: String) -> a > > Maybe (String :~: a) -> a > > > > > > so it doesn't have a principal type for type inference to find. But when > > the `Nothing` branch is present, only the first of these types is possible. > > On 21/07/2020 08:13, Tom Ellis wrote: > > > > > On Mon, Jul 20, 2020 at 09:03:14PM +0100, Adam Gundry wrote: > > > > > > > In fact, if we consider just > > > > > > > > case eq x1 "" of { Just Refl -> "It was not a string" } > > > > > > > > > > > > in isolation, and suppose `x1 :: t`, this can be given two incomparable > > > > most general types, namely `String` and `t`. So type inference refuses > > > > to pick, even though in your case only `String` would work out later, > > > > but seeing that requires non-local reasoning about escaped existentials. > > > > On 20/07/2020 19:18, Tom Ellis wrote: > > > > > > > > > I can define the following > > > > > > > > > > import Data.Typeable > > > > > data Foo where Foo :: Typeable x => x -> Foo > > > > > eq = (\\a b -> eqT) :: (Typeable a, Typeable b) => a -> b -> Maybe (a :~: b) > > > > > > > > > > > > > > > and then these expressions work as expected > > > > > > > > > > > case Foo "Hello" of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > > > > > > > > > > "It was a string" > > > > > > > > > > > case Foo 1 of Foo x1 -> case eq x1 "" of { Nothing -> "It was not a string"; Just Refl -> "It was a string" } > > > > > > "It was not a string" > > > > > > > > > > But if I omit the 'Nothing' branch (as below) I get "Couldn't match > > > > > expected type ‘p’ with actual type ‘[Char]’ ‘p’ is untouchable". > > > > > Can anyone explain why this happens? > > > > > > > > > > > case Foo "Hello" of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > > > > > > case Foo 1 of Foo x1 -> case eq x1 "" of { Just Refl -> "It was not a string" } > > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 477 bytes Desc: OpenPGP digital signature URL: From fryguybob at gmail.com Fri Jul 24 14:03:42 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Fri, 24 Jul 2020 10:03:42 -0400 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> Message-ID: Hi Compl, Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. Ryan On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < haskell-cafe at haskell.org> wrote: > Thanks Chris, I confess I didn't pay enough attention to STM specialized > container libraries by far, I skimmed through the description of > stm-containers and ttrie, and feel they would definitely improve my code's > performance in case I limit the server's parallelism within hardware > capabilities. That may because I'm still prototyping the api and > infrastructure for correctness, so even `TVar (HashMap k v)` performs okay > for me at the moment, only if at low contention (surely there're plenty of > CPU cycles to be optimized out in next steps). I model my data after graph > model, so most data, even most indices are localized to nodes and edges, > those can be manipulated without conflict, that's why I assumed I have a > low contention use case since the very beginning - until I found there are > still (though minor) needs for global indices to guarantee global > uniqueness, I feel faithful with stm-containers/ttrie to implement a more > scalable global index data structure, thanks for hinting me. > > So an evident solution comes into my mind now, is to run the server with a > pool of tx processing threads, matching number of CPU cores, client RPC > requests then get queued to be executed in some thread from the pool. But > I'm really fond of the mechanism of M:N scheduler which solves > massive/dynamic concurrency so elegantly. I had some good result with Go in > this regard, and see GHC at par in doing this, I don't want to give up this > enjoyable machinery. > > But looked at the stm implementation in GHC, it seems written TVars are > exclusively locked during commit of a tx, I suspect this is the culprit > when there're large M lightweight threads scheduled upon a small N hardware > capabilities, that is when a lightweight thread yield control during an stm > transaction commit, the TVars it locked will stay so until it's scheduled > again (and again) till it can finish the commit. This way, descheduled > threads could hold live threads from progressing. I haven't gone into more > details there, but wonder if there can be some improvement for GHC RTS to > keep an stm committing thread from descheduled, but seemingly that may > impose more starvation potential; or stm can be improved to have its TVar > locks preemptable when the owner trec/thread is in descheduled state? > Neither should be easy but I'd really love massive lightweight threads > doing STM practically well. > > Best regards, > > Compl > > > On 2020/7/24 上午12:57, Christopher Allen wrote: > > It seems like you know how to run practical tests for tuning thread count > and contention for throughput. Part of the reason you haven't gotten a > super clear answer is "it depends." You give up fairness when you use STM > instead of MVars or equivalent structures. That means a long running > transaction might get stampeded by many small ones invalidating it over and > over. The long-running transaction might never clear if the small > transactions keep moving the cheese. I mention this because transaction > runtime and size and count all affect throughput and latency. What might be > ideal for one pattern of work might not be ideal for another. Optimizing > for overall throughput might make the contention and fairness problems > worse too. I've done practical tests to optimize this in the past, both for > STM in Haskell and for RDBMS workloads. > > The next step is sometimes figuring out whether you really need a data > structure within a single STM container or if perhaps you can break up your > STM container boundaries into zones or regions that roughly map onto update > boundaries. That should make the transactions churn less. On the outside > chance you do need to touch more than one container in a transaction, well, > they compose. > > e.g. https://hackage.haskell.org/package/stm-containers > https://hackage.haskell.org/package/ttrie > > It also sounds a bit like your question bumps into Amdahl's Law a bit. > > All else fails, stop using STM and find something more tuned to your > problem space. > > Hope this helps, > Chris Allen > > > On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > >> Hello Cafe, >> >> I'm working on an in-memory database, in Client/Server mode I just let >> each connected client submit remote procedure call running in its dedicated >> lightweight thread, modifying TVars in RAM per its business needs, then in >> case many clients connected concurrently and trying to insert new data, if >> they are triggering global index (some TVar) update, the throughput would >> drop drastically. I reduced the shared state to a simple int counter by >> TVar, got same symptom. While the parallelism feels okay when there's no >> hot TVar conflicting, or M is not much greater than N. >> >> As an empirical test workload, I have a `+RTS -N10` server process, it >> handles 10 concurrent clients okay, got ~5x of single thread throughput; >> but in handling 20 concurrent clients, each of the 10 CPUs can only be >> driven to ~10% utilization, the throughput seems even worse than single >> thread. More clients can even drive it thrashing without much progressing. >> >> I can understand that pure STM doesn't scale well after reading [1], and >> I see it suggested [7] attractive and planned future work toward that >> direction. >> >> But I can't find certain libraries or frameworks addressing large M over >> small N scenarios, [1] experimented with designated N parallelism, and [7] >> is rather theoretical to my empirical needs. >> >> Can you direct me to some available library implementing the methodology >> proposed in [7] or other ways tackling this problem? >> >> I think the most difficult one is that a transaction should commit with >> global indices (with possibly unique constraints) atomically updated, and >> rollback with any violation of constraints, i.e. transactions have to cover >> global states like indices. Other problems seem more trivial than this. >> >> Specifically, [7] states: >> >> > It must be emphasized that all of the mechanisms we deploy originate, >> in one form or another, in the database literature from the 70s and 80s. >> Our contribution is to adapt these techniques to software transactional >> memory, providing more effective solutions to important STM problems than >> prior proposals. >> >> I wonder any STM based library has simplified those techniques to be >> composed right away? I don't really want to implement those mechanisms by >> myself, rebuilding many wheels from scratch. >> >> Best regards, >> Compl >> >> >> [1] Comparing the performance of concurrent linked-list implementations >> in Haskell >> https://simonmar.github.io/bib/papers/concurrent-data.pdf >> >> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for >> highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >> 207–216. ACM Press, 2008. >> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > > > -- > Chris Allen > Currently working on http://haskellbook.com > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Fri Jul 24 15:22:32 2020 From: compl.yue at icloud.com (Compl Yue) Date: Fri, 24 Jul 2020 23:22:32 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> Message-ID: <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: > The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. Thanks with regards, Compl On 2020/7/24 下午10:03, Ryan Yates wrote: > Hi Compl, > > Having a pool of transaction processing threads can be helpful in a > certain way.  If the body of the transaction takes more time to > execute then the Haskell thread is allowed and it yields, the > suspended thread won't get in the way of other thread, but when it is > rescheduled, will have a low probability of success.  Even worse, it > will probably not discover that it is doomed to failure until commit > time.  If transactions are more likely to reach commit without > yielding, they are more likely to succeed.  If the transactions are > not conflicting, it doesn't make much difference other than cache churn. > > The Haskell capability that is committing a transaction will not yield > to another Haskell thread while it is doing the commit.  The OS thread > may be preempted, but once commit starts the haskell scheduler is not > invoked until after locks are released. > > To get good performance from STM you must pay attention to what TVars > are involved in a commit.  All STM systems are working under the > assumption of low contention, so you want to minimize "false" > conflicts (conflicts that are not essential to the computation).    > Something like `TVar (HashMap k v)` will work pretty well for a low > thread count, but every transaction that writes to that structure will > be in conflict with every other transaction that accesses it.  Pushing > the `TVar` into the nodes of the structure reduces the possibilities > for conflict, while increasing the amount of bookkeeping STM has to > do.  I would like to reduce the cost of that bookkeeping using better > structures, but we need to do so without harming performance in the > low TVar count case.  Right now it is optimized for good cache > performance with a handful of TVars. > > There is another way to play with performance by moving work into and > out of the transaction body.  A transaction body that executes quickly > will reach commit faster.  But it may be delaying work that moves into > another transaction.  Forcing values at the right time can make a big > difference. > > Ryan > > On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe > > wrote: > > Thanks Chris, I confess I didn't pay enough attention to STM > specialized container libraries by far, I skimmed through the > description of stm-containers and ttrie, and feel they would > definitely improve my code's performance in case I limit the > server's parallelism within hardware capabilities. That may > because I'm still prototyping the api and infrastructure for > correctness, so even `TVar (HashMap k v)` performs okay for me at > the moment, only if at low contention (surely there're plenty of > CPU cycles to be optimized out in next steps). I model my data > after graph model, so most data, even most indices are localized > to nodes and edges, those can be manipulated without conflict, > that's why I assumed I have a low contention use case since the > very beginning - until I found there are still (though minor) > needs for global indices to guarantee global uniqueness, I feel > faithful with stm-containers/ttrie to implement a more scalable > global index data structure, thanks for hinting me. > > So an evident solution comes into my mind now, is to run the > server with a pool of tx processing threads, matching number of > CPU cores, client RPC requests then get queued to be executed in > some thread from the pool. But I'm really fond of the mechanism of > M:N scheduler which solves massive/dynamic concurrency so > elegantly. I had some good result with Go in this regard, and see > GHC at par in doing this, I don't want to give up this enjoyable > machinery. > > But looked at the stm implementation in GHC, it seems written > TVars are exclusively locked during commit of a tx, I suspect this > is the culprit when there're large M lightweight threads scheduled > upon a small N hardware capabilities, that is when a lightweight > thread yield control during an stm transaction commit, the TVars > it locked will stay so until it's scheduled again (and again) till > it can finish the commit. This way, descheduled threads could hold > live threads from progressing. I haven't gone into more details > there, but wonder if there can be some improvement for GHC RTS to > keep an stm committing thread from descheduled, but seemingly that > may impose more starvation potential; or stm can be improved to > have its TVar locks preemptable when the owner trec/thread is in > descheduled state? Neither should be easy but I'd really love > massive lightweight threads doing STM practically well. > > Best regards, > > Compl > > > On 2020/7/24 上午12:57, Christopher Allen wrote: >> It seems like you know how to run practical tests for tuning >> thread count and contention for throughput. Part of the reason >> you haven't gotten a super clear answer is "it depends." You give >> up fairness when you use STM instead of MVars or equivalent >> structures. That means a long running transaction might get >> stampeded by many small ones invalidating it over and over. The >> long-running transaction might never clear if the small >> transactions keep moving the cheese. I mention this because >> transaction runtime and size and count all affect throughput and >> latency. What might be ideal for one pattern of work might not be >> ideal for another. Optimizing for overall throughput might make >> the contention and fairness problems worse too. I've done >> practical tests to optimize this in the past, both for STM in >> Haskell and for RDBMS workloads. >> >> The next step is sometimes figuring out whether you really need a >> data structure within a single STM container or if perhaps you >> can break up your STM container boundaries into zones or regions >> that roughly map onto update boundaries. That should make the >> transactions churn less. On the outside chance you do need to >> touch more than one container in a transaction, well, they compose. >> >> e.g. https://hackage.haskell.org/package/stm-containers >> https://hackage.haskell.org/package/ttrie >> >> It also sounds a bit like your question bumps into Amdahl's Law a >> bit. >> >> All else fails, stop using STM and find something more tuned to >> your problem space. >> >> Hope this helps, >> Chris Allen >> >> >> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe >> > wrote: >> >> Hello Cafe, >> >> I'm working on an in-memory database, in Client/Server mode I >> just let each connected client submit remote procedure call >> running in its dedicated lightweight thread, modifying TVars >> in RAM per its business needs, then in case many clients >> connected concurrently and trying to insert new data, if they >> are triggering global index (some TVar) update, the >> throughput would drop drastically. I reduced the shared state >> to a simple int counter by TVar, got same symptom. While the >> parallelism feels okay when there's no hot TVar conflicting, >> or M is not much greater than N. >> >> As an empirical test workload, I have a `+RTS -N10` server >> process, it handles 10 concurrent clients okay, got ~5x of >> single thread throughput; but in handling 20 concurrent >> clients, each of the 10 CPUs can only be driven to ~10% >> utilization, the throughput seems even worse than single >> thread. More clients can even drive it thrashing without much >>  progressing. >> >>  I can understand that pure STM doesn't scale well after >> reading [1], and I see it suggested [7] attractive and >> planned future work toward that direction. >> >> But I can't find certain libraries or frameworks addressing >> large M over small N scenarios, [1] experimented with >> designated N parallelism, and [7] is rather theoretical to my >> empirical needs. >> >> Can you direct me to some available library implementing the >> methodology proposed in [7] or other ways tackling this problem? >> >> I think the most difficult one is that a transaction should >> commit with global indices (with possibly unique constraints) >> atomically updated, and rollback with any violation of >> constraints, i.e. transactions have to cover global states >> like indices. Other problems seem more trivial than this. >> >> Specifically, [7] states: >> >> > It must be emphasized that all of the mechanisms we deploy >> originate, in one form or another, in the database literature >> from the 70s and 80s. Our contribution is to adapt these >> techniques to software transactional memory, providing more >> effective solutions to important STM problems than prior >> proposals. >> >> I wonder any STM based library has simplified those >> techniques to be composed right away? I don't really want to >> implement those mechanisms by myself, rebuilding many wheels >> from scratch. >> >> Best regards, >> Compl >> >> >> [1] Comparing the performance of concurrent linked-list >> implementations in Haskell >> https://simonmar.github.io/bib/papers/concurrent-data.pdf >> >> [7] M. Herlihy and E. Koskinen. Transactional boosting: a >> methodology for highly-concurrent transactional objects. In >> Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. >> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> >> >> -- >> Chris Allen >> Currently working on http://haskellbook.com > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fryguybob at gmail.com Fri Jul 24 15:46:38 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Fri, 24 Jul 2020 11:46:38 -0400 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> Message-ID: > Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. Ryan On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: > Thanks very much for the insightful information Ryan! I'm glad my suspect > was wrong about the Haskell scheduler: > > > The Haskell capability that is committing a transaction will not yield > to another Haskell thread while it is doing the commit. The OS thread may > be preempted, but once commit starts the haskell scheduler is not invoked > until after locks are released. > So best effort had already been made in GHC and I just need to cooperate > better with its design. Then to explain the low CPU utilization (~10%), am > I right to understand it as that upon reading a TVar locked by another > committing tx, a lightweight thread will put itself into `waiting STM` and > descheduled state, so the CPUs can only stay idle as not so many threads > are willing to proceed? > > Anyway, I see light with better data structures to improve my situation, > let me try them and report back. Actually I later changed `TVar (HaskMap k > v)` to be `TVar (HashMap k Int)` where the `Int` being array index into > `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation > semantic of dict entries (like that in Python 3.7+), then it's very hopeful > to incorporate stm-containers' Map or ttrie to approach free of contention. > > Thanks with regards, > > Compl > > > On 2020/7/24 下午10:03, Ryan Yates wrote: > > Hi Compl, > > Having a pool of transaction processing threads can be helpful in a > certain way. If the body of the transaction takes more time to execute > then the Haskell thread is allowed and it yields, the suspended thread > won't get in the way of other thread, but when it is rescheduled, will have > a low probability of success. Even worse, it will probably not discover > that it is doomed to failure until commit time. If transactions are more > likely to reach commit without yielding, they are more likely to succeed. > If the transactions are not conflicting, it doesn't make much difference > other than cache churn. > > The Haskell capability that is committing a transaction will not yield to > another Haskell thread while it is doing the commit. The OS thread may be > preempted, but once commit starts the haskell scheduler is not invoked > until after locks are released. > > To get good performance from STM you must pay attention to what TVars are > involved in a commit. All STM systems are working under the assumption of > low contention, so you want to minimize "false" conflicts (conflicts that > are not essential to the computation). Something like `TVar (HashMap k > v)` will work pretty well for a low thread count, but every transaction > that writes to that structure will be in conflict with every other > transaction that accesses it. Pushing the `TVar` into the nodes of the > structure reduces the possibilities for conflict, while increasing the > amount of bookkeeping STM has to do. I would like to reduce the cost of > that bookkeeping using better structures, but we need to do so without > harming performance in the low TVar count case. Right now it is optimized > for good cache performance with a handful of TVars. > > There is another way to play with performance by moving work into and out > of the transaction body. A transaction body that executes quickly will > reach commit faster. But it may be delaying work that moves into another > transaction. Forcing values at the right time can make a big difference. > > Ryan > > On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > >> Thanks Chris, I confess I didn't pay enough attention to STM specialized >> container libraries by far, I skimmed through the description of >> stm-containers and ttrie, and feel they would definitely improve my code's >> performance in case I limit the server's parallelism within hardware >> capabilities. That may because I'm still prototyping the api and >> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay >> for me at the moment, only if at low contention (surely there're plenty of >> CPU cycles to be optimized out in next steps). I model my data after graph >> model, so most data, even most indices are localized to nodes and edges, >> those can be manipulated without conflict, that's why I assumed I have a >> low contention use case since the very beginning - until I found there are >> still (though minor) needs for global indices to guarantee global >> uniqueness, I feel faithful with stm-containers/ttrie to implement a more >> scalable global index data structure, thanks for hinting me. >> >> So an evident solution comes into my mind now, is to run the server with >> a pool of tx processing threads, matching number of CPU cores, client RPC >> requests then get queued to be executed in some thread from the pool. But >> I'm really fond of the mechanism of M:N scheduler which solves >> massive/dynamic concurrency so elegantly. I had some good result with Go in >> this regard, and see GHC at par in doing this, I don't want to give up this >> enjoyable machinery. >> >> But looked at the stm implementation in GHC, it seems written TVars are >> exclusively locked during commit of a tx, I suspect this is the culprit >> when there're large M lightweight threads scheduled upon a small N hardware >> capabilities, that is when a lightweight thread yield control during an stm >> transaction commit, the TVars it locked will stay so until it's scheduled >> again (and again) till it can finish the commit. This way, descheduled >> threads could hold live threads from progressing. I haven't gone into more >> details there, but wonder if there can be some improvement for GHC RTS to >> keep an stm committing thread from descheduled, but seemingly that may >> impose more starvation potential; or stm can be improved to have its TVar >> locks preemptable when the owner trec/thread is in descheduled state? >> Neither should be easy but I'd really love massive lightweight threads >> doing STM practically well. >> >> Best regards, >> >> Compl >> >> >> On 2020/7/24 上午12:57, Christopher Allen wrote: >> >> It seems like you know how to run practical tests for tuning thread count >> and contention for throughput. Part of the reason you haven't gotten a >> super clear answer is "it depends." You give up fairness when you use STM >> instead of MVars or equivalent structures. That means a long running >> transaction might get stampeded by many small ones invalidating it over and >> over. The long-running transaction might never clear if the small >> transactions keep moving the cheese. I mention this because transaction >> runtime and size and count all affect throughput and latency. What might be >> ideal for one pattern of work might not be ideal for another. Optimizing >> for overall throughput might make the contention and fairness problems >> worse too. I've done practical tests to optimize this in the past, both for >> STM in Haskell and for RDBMS workloads. >> >> The next step is sometimes figuring out whether you really need a data >> structure within a single STM container or if perhaps you can break up your >> STM container boundaries into zones or regions that roughly map onto update >> boundaries. That should make the transactions churn less. On the outside >> chance you do need to touch more than one container in a transaction, well, >> they compose. >> >> e.g. https://hackage.haskell.org/package/stm-containers >> https://hackage.haskell.org/package/ttrie >> >> It also sounds a bit like your question bumps into Amdahl's Law a bit. >> >> All else fails, stop using STM and find something more tuned to your >> problem space. >> >> Hope this helps, >> Chris Allen >> >> >> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < >> haskell-cafe at haskell.org> wrote: >> >>> Hello Cafe, >>> >>> I'm working on an in-memory database, in Client/Server mode I just let >>> each connected client submit remote procedure call running in its dedicated >>> lightweight thread, modifying TVars in RAM per its business needs, then in >>> case many clients connected concurrently and trying to insert new data, if >>> they are triggering global index (some TVar) update, the throughput would >>> drop drastically. I reduced the shared state to a simple int counter by >>> TVar, got same symptom. While the parallelism feels okay when there's no >>> hot TVar conflicting, or M is not much greater than N. >>> >>> As an empirical test workload, I have a `+RTS -N10` server process, it >>> handles 10 concurrent clients okay, got ~5x of single thread throughput; >>> but in handling 20 concurrent clients, each of the 10 CPUs can only be >>> driven to ~10% utilization, the throughput seems even worse than single >>> thread. More clients can even drive it thrashing without much progressing. >>> >>> I can understand that pure STM doesn't scale well after reading [1], >>> and I see it suggested [7] attractive and planned future work toward that >>> direction. >>> >>> But I can't find certain libraries or frameworks addressing large M over >>> small N scenarios, [1] experimented with designated N parallelism, and [7] >>> is rather theoretical to my empirical needs. >>> >>> Can you direct me to some available library implementing the methodology >>> proposed in [7] or other ways tackling this problem? >>> >>> I think the most difficult one is that a transaction should commit with >>> global indices (with possibly unique constraints) atomically updated, and >>> rollback with any violation of constraints, i.e. transactions have to cover >>> global states like indices. Other problems seem more trivial than this. >>> >>> Specifically, [7] states: >>> >>> > It must be emphasized that all of the mechanisms we deploy originate, >>> in one form or another, in the database literature from the 70s and 80s. >>> Our contribution is to adapt these techniques to software transactional >>> memory, providing more effective solutions to important STM problems than >>> prior proposals. >>> >>> I wonder any STM based library has simplified those techniques to be >>> composed right away? I don't really want to implement those mechanisms by >>> myself, rebuilding many wheels from scratch. >>> >>> Best regards, >>> Compl >>> >>> >>> [1] Comparing the performance of concurrent linked-list implementations >>> in Haskell >>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>> >>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology >>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >>> 207–216. ACM Press, 2008. >>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> >> >> -- >> Chris Allen >> Currently working on http://haskellbook.com >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Fri Jul 24 15:48:44 2020 From: compl.yue at icloud.com (Compl Yue) Date: Fri, 24 Jul 2020 23:48:44 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> Message-ID: <5268f36c-a71b-7ed7-fcb2-c2b4d146ec77@icloud.com> Hi Jo, I think you are totally right about the situation, and I just want to make it clear that, I already chose STM and lightweight threads as GHC implemented them, STM for transactions and optimistic locking, lightweight threads for sophisticated smart scheduling. The global counter is only used to reveal the technical traits of my situation, it's of course not a requirement of my business needs. I'm not hurry for thorough performance optimization at current stage (PoC prototyping not finished yet), as long as the performance is reasonable, but the thrashing behavior really frightened me and I have to take it as a serious concern for the time being. Fortunately it doesn't feel so scary as it first appeared to me, after taking others' suggestions, I'll experiment more with these new information to me and see what will come out. Thanks with regards, Compl On 2020/7/24 上午12:25, Joachim Durchholz wrote: > While I can't contribute any Haskell knowledge, I know that many > threads updating the same variable is the worst thing you can do; not > only do you create a single bottleneck, if you have your threads > running on multiple cores you get CPU pipeline stalls, L1 cache line > flushes, and/or complicated cache coherency protocols executed between > cores. It's not cheap: each of these mechanisms can take hundreds of > CPU cycles, for a CPU that can execute multiple instructions per CPU > cycle. > > Incrementing a global counter is a really hard problem in > multithreading... > > I believe this is the reason why databases typically implement a > SEQUENCE mechanism, and these sequences are usually implemented as > "whenever a transaction asks for a sequence number, reserve a block of > 1,000 numbers for it so it can retrieve 999 additional numbers without > the synchronization overhead. > > This is also why real databases use transactions - these do not just > isolate processes from each other's updates, they also allow the DB to > let the transaction work on a snapshot and do all the synchronization > once, during COMMIT. > And, as you just discovered, it's one of the major optimization areas > in database engines :-) > > TL;DR for the bad news: I suspect your problem is just unavoidable > > However, I see a workaround: delayed index update. > Have each index twice: last-known and next-future. > last-known is what was created during the last index update. You need > an append-only list of records that had an index field update, and all > searches that use the index will also have to do a linear search in > that list. > next-future is built in the background. It takes last-known and the > updates from the append-only list, and generates a new index. Once > next-future is finished, replace last-known with it. > You still need to to a global lock while replacing indexes, but you > don't have to lock the index for every single update but just once. > You'll have to twiddle with parameters such as "at what point do I > start a new index build", and you'll have to make sure that your > linear list isn't yet another bottleneck (there are lock-free data > structures to achieve such a thing, but these are complicated; or you > can tell application programmers to try and collect as many updates as > possible in a transaction so the number of synchronization points is > smaller; however, too-large transactions can generate CPU cache > overflows if the collected update data becomes too large, so there's a > whole lot of tweaking, studying real performance data, hopefully > finding the right set of diagnostic information to collect that allow > the DB to automatically choose the right point to do its updates, etc. > pp.) > > TL;DR for the good news: You can coalesce N updates into one and > divide the CPU core coordination overhead by a factor of N. You'll > increase the bus pressure, so there's tons of fine tuning you can do > (or avoid) after getting the first 90% of the speedup. (I'm drawing > purely speculative numbers out of my hat here.) > Liability: You will want to add transactions and (likely) optimistic > locking, if you don't have that already: Transaction boundaries are > the natural point for coalescing updates. > > Regards, > Jo > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From compl.yue at icloud.com Fri Jul 24 16:35:20 2020 From: compl.yue at icloud.com (Compl Yue) Date: Sat, 25 Jul 2020 00:35:20 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> Message-ID: <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. And I have something in my code to track STM retry like this: ``` -- blocking wait not expected, track stm retries explicitly trackSTM:: Int-> IO(Either() a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ... ``` No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. So I believe no retry has ever been triggered. What can going on there? On 2020/7/24 下午11:46, Ryan Yates wrote: > > Then to explain the low CPU utilization (~10%), am I right to > understand it as that upon reading a TVar locked by another committing > tx, a lightweight thread will put itself into `waiting STM` and > descheduled state, so the CPUs can only stay idle as not so many > threads are willing to proceed? > > Since the commit happens in finite steps, the expectation is that the > lock will be released very soon.  Given this when the body of a > transaction executes `readTVar` it spins (active CPU!) until the > `TVar` is observed unlocked.  If a lock is observed while commiting, > it immediately starts the transaction again from the beginning.  To > get the behavior of suspending a transaction you have to successfully > commit a transaction that executed `retry`.  Then the transaction is > put on the wakeup lists of its read set and subsequent commits will > wake it up if its write set overlaps. > > I don't think any of these things would explain low CPU utilization.  > You could try running with `perf` and see if lots of time is spent in > some recognizable part of the RTS. > > Ryan > > > On Fri, Jul 24, 2020 at 11:22 AM Compl Yue > wrote: > > Thanks very much for the insightful information Ryan! I'm glad my > suspect was wrong about the Haskell scheduler: > > > The Haskell capability that is committing a transaction will not > yield to another Haskell thread while it is doing the commit.  The > OS thread may be preempted, but once commit starts the haskell > scheduler is not invoked until after locks are released. > > So best effort had already been made in GHC and I just need to > cooperate better with its design. Then to explain the low CPU > utilization (~10%), am I right to understand it as that upon > reading a TVar locked by another committing tx, a lightweight > thread will put itself into `waiting STM` and descheduled state, > so the CPUs can only stay idle as not so many threads are willing > to proceed? > > Anyway, I see light with better data structures to improve my > situation, let me try them and report back. Actually I later > changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where > the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, > in pursuing insertion order preservation semantic of dict entries > (like that in Python 3.7+), then it's very hopeful to incorporate > stm-containers' Map or ttrie to approach free of contention. > > Thanks with regards, > > Compl > > > On 2020/7/24 下午10:03, Ryan Yates wrote: >> Hi Compl, >> >> Having a pool of transaction processing threads can be helpful in >> a certain way.  If the body of the transaction takes more time to >> execute then the Haskell thread is allowed and it yields, the >> suspended thread won't get in the way of other thread, but when >> it is rescheduled, will have a low probability of success.  Even >> worse, it will probably not discover that it is doomed to failure >> until commit time.  If transactions are more likely to reach >> commit without yielding, they are more likely to succeed.  If the >> transactions are not conflicting, it doesn't make much difference >> other than cache churn. >> >> The Haskell capability that is committing a transaction will not >> yield to another Haskell thread while it is doing the commit.  >> The OS thread may be preempted, but once commit starts the >> haskell scheduler is not invoked until after locks are released. >> >> To get good performance from STM you must pay attention to what >> TVars are involved in a commit. All STM systems are working under >> the assumption of low contention, so you want to minimize "false" >> conflicts (conflicts that are not essential to the computation).  >>   Something like `TVar (HashMap k v)` will work pretty well for a >> low thread count, but every transaction that writes to that >> structure will be in conflict with every other transaction that >> accesses it.  Pushing the `TVar` into the nodes of the structure >> reduces the possibilities for conflict, while increasing the >> amount of bookkeeping STM has to do.  I would like to reduce the >> cost of that bookkeeping using better structures, but we need to >> do so without harming performance in the low TVar count case.  >> Right now it is optimized for good cache performance with a >> handful of TVars. >> >> There is another way to play with performance by moving work into >> and out of the transaction body.  A transaction body that >> executes quickly will reach commit faster.  But it may be >> delaying work that moves into another transaction.  Forcing >> values at the right time can make a big difference. >> >> Ryan >> >> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe >> > wrote: >> >> Thanks Chris, I confess I didn't pay enough attention to STM >> specialized container libraries by far, I skimmed through the >> description of stm-containers and ttrie, and feel they would >> definitely improve my code's performance in case I limit the >> server's parallelism within hardware capabilities. That may >> because I'm still prototyping the api and infrastructure for >> correctness, so even `TVar (HashMap k v)` performs okay for >> me at the moment, only if at low contention (surely there're >> plenty of CPU cycles to be optimized out in next steps). I >> model my data after graph model, so most data, even most >> indices are localized to nodes and edges, those can be >> manipulated without conflict, that's why I assumed I have a >> low contention use case since the very beginning - until I >> found there are still (though minor) needs for global indices >> to guarantee global uniqueness, I feel faithful with >> stm-containers/ttrie to implement a more scalable global >> index data structure, thanks for hinting me. >> >> So an evident solution comes into my mind now, is to run the >> server with a pool of tx processing threads, matching number >> of CPU cores, client RPC requests then get queued to be >> executed in some thread from the pool. But I'm really fond of >> the mechanism of M:N scheduler which solves massive/dynamic >> concurrency so elegantly. I had some good result with Go in >> this regard, and see GHC at par in doing this, I don't want >> to give up this enjoyable machinery. >> >> But looked at the stm implementation in GHC, it seems written >> TVars are exclusively locked during commit of a tx, I suspect >> this is the culprit when there're large M lightweight threads >> scheduled upon a small N hardware capabilities, that is when >> a lightweight thread yield control during an stm transaction >> commit, the TVars it locked will stay so until it's scheduled >> again (and again) till it can finish the commit. This way, >> descheduled threads could hold live threads from progressing. >> I haven't gone into more details there, but wonder if there >> can be some improvement for GHC RTS to keep an stm committing >> thread from descheduled, but seemingly that may impose more >> starvation potential; or stm can be improved to have its TVar >> locks preemptable when the owner trec/thread is in >> descheduled state? Neither should be easy but I'd really love >> massive lightweight threads doing STM practically well. >> >> Best regards, >> >> Compl >> >> >> On 2020/7/24 上午12:57, Christopher Allen wrote: >>> It seems like you know how to run practical tests for tuning >>> thread count and contention for throughput. Part of the >>> reason you haven't gotten a super clear answer is "it >>> depends." You give up fairness when you use STM instead of >>> MVars or equivalent structures. That means a long running >>> transaction might get stampeded by many small ones >>> invalidating it over and over. The long-running transaction >>> might never clear if the small transactions keep moving the >>> cheese. I mention this because transaction runtime and size >>> and count all affect throughput and latency. What might be >>> ideal for one pattern of work might not be ideal for >>> another. Optimizing for overall throughput might make the >>> contention and fairness problems worse too. I've done >>> practical tests to optimize this in the past, both for STM >>> in Haskell and for RDBMS workloads. >>> >>> The next step is sometimes figuring out whether you really >>> need a data structure within a single STM container or if >>> perhaps you can break up your STM container boundaries into >>> zones or regions that roughly map onto update boundaries. >>> That should make the transactions churn less. On the outside >>> chance you do need to touch more than one container in a >>> transaction, well, they compose. >>> >>> e.g. https://hackage.haskell.org/package/stm-containers >>> https://hackage.haskell.org/package/ttrie >>> >>> It also sounds a bit like your question bumps into Amdahl's >>> Law a bit. >>> >>> All else fails, stop using STM and find something more tuned >>> to your problem space. >>> >>> Hope this helps, >>> Chris Allen >>> >>> >>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe >>> > >>> wrote: >>> >>> Hello Cafe, >>> >>> I'm working on an in-memory database, in Client/Server >>> mode I just let each connected client submit remote >>> procedure call running in its dedicated lightweight >>> thread, modifying TVars in RAM per its business needs, >>> then in case many clients connected concurrently and >>> trying to insert new data, if they are triggering global >>> index (some TVar) update, the throughput would drop >>> drastically. I reduced the shared state to a simple int >>> counter by TVar, got same symptom. While the parallelism >>> feels okay when there's no hot TVar conflicting, or M is >>> not much greater than N. >>> >>> As an empirical test workload, I have a `+RTS -N10` >>> server process, it handles 10 concurrent clients okay, >>> got ~5x of single thread throughput; but in handling 20 >>> concurrent clients, each of the 10 CPUs can only be >>> driven to ~10% utilization, the throughput seems even >>> worse than single thread. More clients can even drive it >>> thrashing without much  progressing. >>> >>>  I can understand that pure STM doesn't scale well after >>> reading [1], and I see it suggested [7] attractive and >>> planned future work toward that direction. >>> >>> But I can't find certain libraries or frameworks >>> addressing large M over small N scenarios, [1] >>> experimented with designated N parallelism, and [7] is >>> rather theoretical to my empirical needs. >>> >>> Can you direct me to some available library implementing >>> the methodology proposed in [7] or other ways tackling >>> this problem? >>> >>> I think the most difficult one is that a transaction >>> should commit with global indices (with possibly unique >>> constraints) atomically updated, and rollback with any >>> violation of constraints, i.e. transactions have to >>> cover global states like indices. Other problems seem >>> more trivial than this. >>> >>> Specifically, [7] states: >>> >>> > It must be emphasized that all of the mechanisms we >>> deploy originate, in one form or another, in the >>> database literature from the 70s and 80s. Our >>> contribution is to adapt these techniques to software >>> transactional memory, providing more effective solutions >>> to important STM problems than prior proposals. >>> >>> I wonder any STM based library has simplified those >>> techniques to be composed right away? I don't really >>> want to implement those mechanisms by myself, rebuilding >>> many wheels from scratch. >>> >>> Best regards, >>> Compl >>> >>> >>> [1] Comparing the performance of concurrent linked-list >>> implementations in Haskell >>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>> >>> [7] M. Herlihy and E. Koskinen. Transactional boosting: >>> a methodology for highly-concurrent transactional >>> objects. In Proc. of PPoPP ’08, pages 207–216. ACM >>> Press, 2008. >>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed >>> to post. >>> >>> >>> >>> -- >>> Chris Allen >>> Currently working on http://haskellbook.com >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From fryguybob at gmail.com Fri Jul 24 18:02:17 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Fri, 24 Jul 2020 14:02:17 -0400 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> Message-ID: To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up. [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 Ryan On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: > I'm not familiar with profiling GHC yet, may need more time to get myself > proficient with it. > > And a bit more details of my test workload for diagnostic: the db clients > are Python processes from a cluster of worker nodes, consulting the db > server to register some path for data files, under a data dir within a > shared filesystem, then mmap those data files and fill in actual array > data. So the db server don't have much computation to perform, but puts the > data file path into a global index, which at the same validates its > uniqueness. As there are many client processes trying to insert one meta > data record concurrently, with my naive implementation, the global index's > TVar will almost always in locked state by one client after another, from a > queue never fall empty. > > So if `readTVar` should spinning waiting, I doubt the threads should > actually make high CPU utilization, because at any instant of time, all > threads except the committing one will be doing that one thing. > > And I have something in my code to track STM retry like this: > > ``` > -- blocking wait not expected, track stm retries explicitly > trackSTM :: Int -> IO (Either () a) > trackSTM !rtc = do > when -- todo increase the threshold of reporting? > (rtc > 0) $ do > -- trace out the retries so the end users can be aware of them > tid <- myThreadId > trace > ( "🔙\n" > <> show callCtx > <> "🌀 " > <> show tid > <> " stm retry #" > <> show rtc > ) > $ return () > atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case > Nothing -> -- stm failed, do a tracked retry > trackSTM (rtc + 1) > Just ... -> ... > > ``` > > No such trace msg fires during my test, neither in single thread run, nor > in runs with pressure. I'm sure this tracing mechanism works, as I can see > such traces fire, in case e.g. posting a TMVar to a TQueue for some other > thread to fill it, then read the result out, if these 2 ops are composed > into a single tx, then of course it's infinite retry loop, and a sequence > of such msgs are logged with ever increasing rtc #. > > So I believe no retry has ever been triggered. > > What can going on there? > > > On 2020/7/24 下午11:46, Ryan Yates wrote: > > > Then to explain the low CPU utilization (~10%), am I right to understand > it as that upon reading a TVar locked by another committing tx, a > lightweight thread will put itself into `waiting STM` and descheduled > state, so the CPUs can only stay idle as not so many threads are willing to > proceed? > > Since the commit happens in finite steps, the expectation is that the lock > will be released very soon. Given this when the body of a transaction > executes `readTVar` it spins (active CPU!) until the `TVar` is observed > unlocked. If a lock is observed while commiting, it immediately starts the > transaction again from the beginning. To get the behavior of suspending a > transaction you have to successfully commit a transaction that executed > `retry`. Then the transaction is put on the wakeup lists of its read set > and subsequent commits will wake it up if its write set overlaps. > > I don't think any of these things would explain low CPU utilization. You > could try running with `perf` and see if lots of time is spent in some > recognizable part of the RTS. > > Ryan > > > On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: > >> Thanks very much for the insightful information Ryan! I'm glad my suspect >> was wrong about the Haskell scheduler: >> >> > The Haskell capability that is committing a transaction will not yield >> to another Haskell thread while it is doing the commit. The OS thread may >> be preempted, but once commit starts the haskell scheduler is not invoked >> until after locks are released. >> So best effort had already been made in GHC and I just need to cooperate >> better with its design. Then to explain the low CPU utilization (~10%), am >> I right to understand it as that upon reading a TVar locked by another >> committing tx, a lightweight thread will put itself into `waiting STM` and >> descheduled state, so the CPUs can only stay idle as not so many threads >> are willing to proceed? >> >> Anyway, I see light with better data structures to improve my situation, >> let me try them and report back. Actually I later changed `TVar (HaskMap k >> v)` to be `TVar (HashMap k Int)` where the `Int` being array index into >> `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation >> semantic of dict entries (like that in Python 3.7+), then it's very hopeful >> to incorporate stm-containers' Map or ttrie to approach free of contention. >> >> Thanks with regards, >> >> Compl >> >> >> On 2020/7/24 下午10:03, Ryan Yates wrote: >> >> Hi Compl, >> >> Having a pool of transaction processing threads can be helpful in a >> certain way. If the body of the transaction takes more time to execute >> then the Haskell thread is allowed and it yields, the suspended thread >> won't get in the way of other thread, but when it is rescheduled, will have >> a low probability of success. Even worse, it will probably not discover >> that it is doomed to failure until commit time. If transactions are more >> likely to reach commit without yielding, they are more likely to succeed. >> If the transactions are not conflicting, it doesn't make much difference >> other than cache churn. >> >> The Haskell capability that is committing a transaction will not yield to >> another Haskell thread while it is doing the commit. The OS thread may be >> preempted, but once commit starts the haskell scheduler is not invoked >> until after locks are released. >> >> To get good performance from STM you must pay attention to what TVars are >> involved in a commit. All STM systems are working under the assumption of >> low contention, so you want to minimize "false" conflicts (conflicts that >> are not essential to the computation). Something like `TVar (HashMap k >> v)` will work pretty well for a low thread count, but every transaction >> that writes to that structure will be in conflict with every other >> transaction that accesses it. Pushing the `TVar` into the nodes of the >> structure reduces the possibilities for conflict, while increasing the >> amount of bookkeeping STM has to do. I would like to reduce the cost of >> that bookkeeping using better structures, but we need to do so without >> harming performance in the low TVar count case. Right now it is optimized >> for good cache performance with a handful of TVars. >> >> There is another way to play with performance by moving work into and out >> of the transaction body. A transaction body that executes quickly will >> reach commit faster. But it may be delaying work that moves into another >> transaction. Forcing values at the right time can make a big difference. >> >> Ryan >> >> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < >> haskell-cafe at haskell.org> wrote: >> >>> Thanks Chris, I confess I didn't pay enough attention to STM specialized >>> container libraries by far, I skimmed through the description of >>> stm-containers and ttrie, and feel they would definitely improve my code's >>> performance in case I limit the server's parallelism within hardware >>> capabilities. That may because I'm still prototyping the api and >>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay >>> for me at the moment, only if at low contention (surely there're plenty of >>> CPU cycles to be optimized out in next steps). I model my data after graph >>> model, so most data, even most indices are localized to nodes and edges, >>> those can be manipulated without conflict, that's why I assumed I have a >>> low contention use case since the very beginning - until I found there are >>> still (though minor) needs for global indices to guarantee global >>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more >>> scalable global index data structure, thanks for hinting me. >>> >>> So an evident solution comes into my mind now, is to run the server with >>> a pool of tx processing threads, matching number of CPU cores, client RPC >>> requests then get queued to be executed in some thread from the pool. But >>> I'm really fond of the mechanism of M:N scheduler which solves >>> massive/dynamic concurrency so elegantly. I had some good result with Go in >>> this regard, and see GHC at par in doing this, I don't want to give up this >>> enjoyable machinery. >>> >>> But looked at the stm implementation in GHC, it seems written TVars are >>> exclusively locked during commit of a tx, I suspect this is the culprit >>> when there're large M lightweight threads scheduled upon a small N hardware >>> capabilities, that is when a lightweight thread yield control during an stm >>> transaction commit, the TVars it locked will stay so until it's scheduled >>> again (and again) till it can finish the commit. This way, descheduled >>> threads could hold live threads from progressing. I haven't gone into more >>> details there, but wonder if there can be some improvement for GHC RTS to >>> keep an stm committing thread from descheduled, but seemingly that may >>> impose more starvation potential; or stm can be improved to have its TVar >>> locks preemptable when the owner trec/thread is in descheduled state? >>> Neither should be easy but I'd really love massive lightweight threads >>> doing STM practically well. >>> >>> Best regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>> >>> It seems like you know how to run practical tests for tuning thread >>> count and contention for throughput. Part of the reason you haven't gotten >>> a super clear answer is "it depends." You give up fairness when you use STM >>> instead of MVars or equivalent structures. That means a long running >>> transaction might get stampeded by many small ones invalidating it over and >>> over. The long-running transaction might never clear if the small >>> transactions keep moving the cheese. I mention this because transaction >>> runtime and size and count all affect throughput and latency. What might be >>> ideal for one pattern of work might not be ideal for another. Optimizing >>> for overall throughput might make the contention and fairness problems >>> worse too. I've done practical tests to optimize this in the past, both for >>> STM in Haskell and for RDBMS workloads. >>> >>> The next step is sometimes figuring out whether you really need a data >>> structure within a single STM container or if perhaps you can break up your >>> STM container boundaries into zones or regions that roughly map onto update >>> boundaries. That should make the transactions churn less. On the outside >>> chance you do need to touch more than one container in a transaction, well, >>> they compose. >>> >>> e.g. https://hackage.haskell.org/package/stm-containers >>> https://hackage.haskell.org/package/ttrie >>> >>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>> >>> All else fails, stop using STM and find something more tuned to your >>> problem space. >>> >>> Hope this helps, >>> Chris Allen >>> >>> >>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < >>> haskell-cafe at haskell.org> wrote: >>> >>>> Hello Cafe, >>>> >>>> I'm working on an in-memory database, in Client/Server mode I just let >>>> each connected client submit remote procedure call running in its dedicated >>>> lightweight thread, modifying TVars in RAM per its business needs, then in >>>> case many clients connected concurrently and trying to insert new data, if >>>> they are triggering global index (some TVar) update, the throughput would >>>> drop drastically. I reduced the shared state to a simple int counter by >>>> TVar, got same symptom. While the parallelism feels okay when there's no >>>> hot TVar conflicting, or M is not much greater than N. >>>> >>>> As an empirical test workload, I have a `+RTS -N10` server process, it >>>> handles 10 concurrent clients okay, got ~5x of single thread throughput; >>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be >>>> driven to ~10% utilization, the throughput seems even worse than single >>>> thread. More clients can even drive it thrashing without much progressing. >>>> >>>> I can understand that pure STM doesn't scale well after reading [1], >>>> and I see it suggested [7] attractive and planned future work toward that >>>> direction. >>>> >>>> But I can't find certain libraries or frameworks addressing large M >>>> over small N scenarios, [1] experimented with designated N parallelism, and >>>> [7] is rather theoretical to my empirical needs. >>>> >>>> Can you direct me to some available library implementing the >>>> methodology proposed in [7] or other ways tackling this problem? >>>> >>>> I think the most difficult one is that a transaction should commit with >>>> global indices (with possibly unique constraints) atomically updated, and >>>> rollback with any violation of constraints, i.e. transactions have to cover >>>> global states like indices. Other problems seem more trivial than this. >>>> >>>> Specifically, [7] states: >>>> >>>> > It must be emphasized that all of the mechanisms we deploy originate, >>>> in one form or another, in the database literature from the 70s and 80s. >>>> Our contribution is to adapt these techniques to software transactional >>>> memory, providing more effective solutions to important STM problems than >>>> prior proposals. >>>> >>>> I wonder any STM based library has simplified those techniques to be >>>> composed right away? I don't really want to implement those mechanisms by >>>> myself, rebuilding many wheels from scratch. >>>> >>>> Best regards, >>>> Compl >>>> >>>> >>>> [1] Comparing the performance of concurrent linked-list implementations >>>> in Haskell >>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>> >>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology >>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >>>> 207–216. ACM Press, 2008. >>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>> >>> >>> >>> -- >>> Chris Allen >>> Currently working on http://haskellbook.com >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Sat Jul 25 06:04:27 2020 From: compl.yue at icloud.com (Compl Yue) Date: Sat, 25 Jul 2020 14:04:27 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> Message-ID: Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? Thanks with best regards, Compl On 2020/7/25 上午2:02, Ryan Yates wrote: > To be clear, I was trying to refer to Linux `perf` [^1].  Sampling > based profiling can do a good job with concurrent and parallel > programs where other methods are problematic.  For instance, >  changing the size of heap objects can drastically change cache > performance and completely different behavior can show up. > > [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) > > The spinning in `readTVar` should always be very short and it > typically shows up as intensive CPU use, though it may not be high > energy use with `pause` in the loop on x86 (looks like we don't have > it [^2], I thought we did, but maybe that was only in some of my code... ) > > [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 > > All that to say, I doubt that you are spending much time spinning (but > it would certainly be interesting to know if you are!  You would see > `perf` attribute a large amount of time to `read_current_value`).  The > amount of code to execute for commit (the time when locks are held) is > always much shorter than it takes to execute the transaction body.  As > you add more conflicting threads this gets worse of course as commits > sequence. > > The code you have will count commits of executions of `retry`.  Note > that `retry` is a user level idea, that is, you are counting user > level *explicit* retries.  This is different from a transaction > failing to commit and starting again. These are invisible to the > user.  Also using your trace will convert `retry` from the efficient > wake on write implementation, to an active retry that will always > attempt again.  We don't have cheap logging of transaction aborts in > GHC, but I have built such logging in my work.  You can observe these > aborts with a debugger by looking for execution of this line: > > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 > > Ryan > > > > On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > wrote: > > I'm not familiar with profiling GHC yet, may need more time to get > myself proficient with it. > > And a bit more details of my test workload for diagnostic: the db > clients are Python processes from a cluster of worker nodes, > consulting the db server to register some path for data files, > under a data dir within a shared filesystem, then mmap those data > files and fill in actual array data. So the db server don't have > much computation to perform, but puts the data file path into a > global index, which at the same validates its uniqueness. As there > are many client processes trying to insert one meta data record > concurrently, with my naive implementation, the global index's > TVar will almost always in locked state by one client after > another, from a queue never fall empty. > > So if `readTVar` should spinning waiting, I doubt the threads > should actually make high CPU utilization, because at any instant > of time, all threads except the committing one will be doing that > one thing. > > And I have something in my code to track STM retry like this: > > ``` > > -- blocking wait not expected, track stm retries explicitly > trackSTM:: Int-> IO(Either() a) > trackSTM !rtc = do > when -- todo increase the threshold of reporting? > (rtc > 0) $ do > -- trace out the retries so the end users can be aware of them > tid <- myThreadId > trace > ( "🔙\n" > <> show callCtx > <> "🌀 " > <> show tid > <> " stm retry #" > <> show rtc > ) > $ return () > atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case > Nothing -> -- stm failed, do a tracked retry > trackSTM (rtc + 1) > Just ... -> ... > > ``` > > No such trace msg fires during my test, neither in single thread > run, nor in runs with pressure. I'm sure this tracing mechanism > works, as I can see such traces fire, in case e.g. posting a TMVar > to a TQueue for some other thread to fill it, then read the result > out, if these 2 ops are composed into a single tx, then of course > it's infinite retry loop, and a sequence of such msgs are logged > with ever increasing rtc #. > > So I believe no retry has ever been triggered. > > What can going on there? > > > On 2020/7/24 下午11:46, Ryan Yates wrote: >> > Then to explain the low CPU utilization (~10%), am I right to >> understand it as that upon reading a TVar locked by another >> committing tx, a lightweight thread will put itself into `waiting >> STM` and descheduled state, so the CPUs can only stay idle as not >> so many threads are willing to proceed? >> >> Since the commit happens in finite steps, the expectation is that >> the lock will be released very soon.  Given this when the body of >> a transaction executes `readTVar` it spins (active CPU!) until >> the `TVar` is observed unlocked.  If a lock is observed while >> commiting, it immediately starts the transaction again from the >> beginning.  To get the behavior of suspending a transaction you >> have to successfully commit a transaction that executed `retry`.  >> Then the transaction is put on the wakeup lists of its read set >> and subsequent commits will wake it up if its write set overlaps. >> >> I don't think any of these things would explain low CPU >> utilization.  You could try running with `perf` and see if lots >> of time is spent in some recognizable part of the RTS. >> >> Ryan >> >> >> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue > > wrote: >> >> Thanks very much for the insightful information Ryan! I'm >> glad my suspect was wrong about the Haskell scheduler: >> >> > The Haskell capability that is committing a transaction >> will not yield to another Haskell thread while it is doing >> the commit.  The OS thread may be preempted, but once commit >> starts the haskell scheduler is not invoked until after locks >> are released. >> >> So best effort had already been made in GHC and I just need >> to cooperate better with its design. Then to explain the low >> CPU utilization (~10%), am I right to understand it as that >> upon reading a TVar locked by another committing tx, a >> lightweight thread will put itself into `waiting STM` and >> descheduled state, so the CPUs can only stay idle as not so >> many threads are willing to proceed? >> >> Anyway, I see light with better data structures to improve my >> situation, let me try them and report back. Actually I later >> changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` >> where the `Int` being array index into `TVar (Vector (TVar >> (Maybe v)))`, in pursuing insertion order preservation >> semantic of dict entries (like that in Python 3.7+), then >> it's very hopeful to incorporate stm-containers' Map or ttrie >> to approach free of contention. >> >> Thanks with regards, >> >> Compl >> >> >> On 2020/7/24 下午10:03, Ryan Yates wrote: >>> Hi Compl, >>> >>> Having a pool of transaction processing threads can be >>> helpful in a certain way.  If the body of the transaction >>> takes more time to execute then the Haskell thread is >>> allowed and it yields, the suspended thread won't get in the >>> way of other thread, but when it is rescheduled, will have a >>> low probability of success.  Even worse, it will probably >>> not discover that it is doomed to failure until commit >>> time.  If transactions are more likely to reach commit >>> without yielding, they are more likely to succeed. If the >>> transactions are not conflicting, it doesn't make much >>> difference other than cache churn. >>> >>> The Haskell capability that is committing a transaction will >>> not yield to another Haskell thread while it is doing the >>> commit.  The OS thread may be preempted, but once commit >>> starts the haskell scheduler is not invoked until after >>> locks are released. >>> >>> To get good performance from STM you must pay attention to >>> what TVars are involved in a commit.  All STM systems are >>> working under the assumption of low contention, so you want >>> to minimize "false" conflicts (conflicts that are not >>> essential to the computation).    Something like `TVar >>> (HashMap k v)` will work pretty well for a low thread count, >>> but every transaction that writes to that structure will be >>> in conflict with every other transaction that accesses it.  >>> Pushing the `TVar` into the nodes of the structure reduces >>> the possibilities for conflict, while increasing the amount >>> of bookkeeping STM has to do.  I would like to reduce the >>> cost of that bookkeeping using better structures, but we >>> need to do so without harming performance in the low TVar >>> count case.  Right now it is optimized for good cache >>> performance with a handful of TVars. >>> >>> There is another way to play with performance by moving work >>> into and out of the transaction body.  A transaction body >>> that executes quickly will reach commit faster.  But it may >>> be delaying work that moves into another transaction.  >>> Forcing values at the right time can make a big difference. >>> >>> Ryan >>> >>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe >>> > >>> wrote: >>> >>> Thanks Chris, I confess I didn't pay enough attention to >>> STM specialized container libraries by far, I skimmed >>> through the description of stm-containers and ttrie, and >>> feel they would definitely improve my code's performance >>> in case I limit the server's parallelism within hardware >>> capabilities. That may because I'm still prototyping the >>> api and infrastructure for correctness, so even `TVar >>> (HashMap k v)` performs okay for me at the moment, only >>> if at low contention (surely there're plenty of CPU >>> cycles to be optimized out in next steps). I model my >>> data after graph model, so most data, even most indices >>> are localized to nodes and edges, those can be >>> manipulated without conflict, that's why I assumed I >>> have a low contention use case since the very beginning >>> - until I found there are still (though minor) needs for >>> global indices to guarantee global uniqueness, I feel >>> faithful with stm-containers/ttrie to implement a more >>> scalable global index data structure, thanks for hinting me. >>> >>> So an evident solution comes into my mind now, is to run >>> the server with a pool of tx processing threads, >>> matching number of CPU cores, client RPC requests then >>> get queued to be executed in some thread from the pool. >>> But I'm really fond of the mechanism of M:N scheduler >>> which solves massive/dynamic concurrency so elegantly. I >>> had some good result with Go in this regard, and see GHC >>> at par in doing this, I don't want to give up this >>> enjoyable machinery. >>> >>> But looked at the stm implementation in GHC, it seems >>> written TVars are exclusively locked during commit of a >>> tx, I suspect this is the culprit when there're large M >>> lightweight threads scheduled upon a small N hardware >>> capabilities, that is when a lightweight thread yield >>> control during an stm transaction commit, the TVars it >>> locked will stay so until it's scheduled again (and >>> again) till it can finish the commit. This way, >>> descheduled threads could hold live threads from >>> progressing. I haven't gone into more details there, but >>> wonder if there can be some improvement for GHC RTS to >>> keep an stm committing thread from descheduled, but >>> seemingly that may impose more starvation potential; or >>> stm can be improved to have its TVar locks preemptable >>> when the owner trec/thread is in descheduled state? >>> Neither should be easy but I'd really love massive >>> lightweight threads doing STM practically well. >>> >>> Best regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>> It seems like you know how to run practical tests for >>>> tuning thread count and contention for throughput. Part >>>> of the reason you haven't gotten a super clear answer >>>> is "it depends." You give up fairness when you use STM >>>> instead of MVars or equivalent structures. That means a >>>> long running transaction might get stampeded by many >>>> small ones invalidating it over and over. The >>>> long-running transaction might never clear if the small >>>> transactions keep moving the cheese. I mention this >>>> because transaction runtime and size and count all >>>> affect throughput and latency. What might be ideal for >>>> one pattern of work might not be ideal for another. >>>> Optimizing for overall throughput might make the >>>> contention and fairness problems worse too. I've done >>>> practical tests to optimize this in the past, both for >>>> STM in Haskell and for RDBMS workloads. >>>> >>>> The next step is sometimes figuring out whether you >>>> really need a data structure within a single STM >>>> container or if perhaps you can break up your STM >>>> container boundaries into zones or regions that roughly >>>> map onto update boundaries. That should make the >>>> transactions churn less. On the outside chance you do >>>> need to touch more than one container in a transaction, >>>> well, they compose. >>>> >>>> e.g. https://hackage.haskell.org/package/stm-containers >>>> https://hackage.haskell.org/package/ttrie >>>> >>>> It also sounds a bit like your question bumps into >>>> Amdahl's Law a bit. >>>> >>>> All else fails, stop using STM and find something more >>>> tuned to your problem space. >>>> >>>> Hope this helps, >>>> Chris Allen >>>> >>>> >>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via >>>> Haskell-Cafe >>> > wrote: >>>> >>>> Hello Cafe, >>>> >>>> I'm working on an in-memory database, in >>>> Client/Server mode I just let each connected client >>>> submit remote procedure call running in its >>>> dedicated lightweight thread, modifying TVars in >>>> RAM per its business needs, then in case many >>>> clients connected concurrently and trying to insert >>>> new data, if they are triggering global index (some >>>> TVar) update, the throughput would drop >>>> drastically. I reduced the shared state to a simple >>>> int counter by TVar, got same symptom. While the >>>> parallelism feels okay when there's no hot TVar >>>> conflicting, or M is not much greater than N. >>>> >>>> As an empirical test workload, I have a `+RTS -N10` >>>> server process, it handles 10 concurrent clients >>>> okay, got ~5x of single thread throughput; but in >>>> handling 20 concurrent clients, each of the 10 CPUs >>>> can only be driven to ~10% utilization, the >>>> throughput seems even worse than single thread. >>>> More clients can even drive it thrashing without >>>> much  progressing. >>>> >>>>  I can understand that pure STM doesn't scale well >>>> after reading [1], and I see it suggested [7] >>>> attractive and planned future work toward that >>>> direction. >>>> >>>> But I can't find certain libraries or frameworks >>>> addressing large M over small N scenarios, [1] >>>> experimented with designated N parallelism, and [7] >>>> is rather theoretical to my empirical needs. >>>> >>>> Can you direct me to some available library >>>> implementing the methodology proposed in [7] or >>>> other ways tackling this problem? >>>> >>>> I think the most difficult one is that a >>>> transaction should commit with global indices (with >>>> possibly unique constraints) atomically updated, >>>> and rollback with any violation of constraints, >>>> i.e. transactions have to cover global states like >>>> indices. Other problems seem more trivial than this. >>>> >>>> Specifically, [7] states: >>>> >>>> > It must be emphasized that all of the mechanisms >>>> we deploy originate, in one form or another, in the >>>> database literature from the 70s and 80s. Our >>>> contribution is to adapt these techniques to >>>> software transactional memory, providing more >>>> effective solutions to important STM problems than >>>> prior proposals. >>>> >>>> I wonder any STM based library has simplified those >>>> techniques to be composed right away? I don't >>>> really want to implement those mechanisms by >>>> myself, rebuilding many wheels from scratch. >>>> >>>> Best regards, >>>> Compl >>>> >>>> >>>> [1] Comparing the performance of concurrent >>>> linked-list implementations in Haskell >>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>> >>>> [7] M. Herlihy and E. Koskinen. Transactional >>>> boosting: a methodology for highly-concurrent >>>> transactional objects. In Proc. of PPoPP ’08, pages >>>> 207–216. ACM Press, 2008. >>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives >>>> go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are >>>> allowed to post. >>>> >>>> >>>> >>>> -- >>>> Chris Allen >>>> Currently working on http://haskellbook.com >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed >>> to post. >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Sat Jul 25 07:35:41 2020 From: compl.yue at icloud.com (Compl Yue) Date: Sat, 25 Jul 2020 15:35:41 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> Message-ID: <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Dear Cafe, As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. I see Ryan shared the code benchmarking RBTree with stm in mind: https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? (of course production ready libraries most desirable) Thanks with regards, Compl On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: > > Shame on me for I have neither experienced with `perf`, I'd learn > these essential tools soon to put them into good use. > > It's great to learn about how `orElse` actually works, I did get > confused why there are so little retries captured, and now I know. So > that little trick should definitely be removed before going > production, as it does no much useful things at excessive cost. I put > it there to help me understand internal working of stm, now I get even > better knowledge ;-) > > I think a debugger will trap every single abort, isn't it annoying > when many aborts would occur? If I'd like to count the number of > aborts, ideally accounted per service endpoints, time periods, source > modules etc. there some tricks for that? > > Thanks with best regards, > > Compl > > > On 2020/7/25 上午2:02, Ryan Yates wrote: >> To be clear, I was trying to refer to Linux `perf` [^1].  Sampling >> based profiling can do a good job with concurrent and parallel >> programs where other methods are problematic.  For instance, >>  changing the size of heap objects can drastically change cache >> performance and completely different behavior can show up. >> >> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >> >> The spinning in `readTVar` should always be very short and it >> typically shows up as intensive CPU use, though it may not be high >> energy use with `pause` in the loop on x86 (looks like we don't have >> it [^2], I thought we did, but maybe that was only in some of my >> code... ) >> >> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >> >> All that to say, I doubt that you are spending much time spinning >> (but it would certainly be interesting to know if you are!  You would >> see `perf` attribute a large amount of time to >> `read_current_value`).  The amount of code to execute for commit (the >> time when locks are held) is always much shorter than it takes to >> execute the transaction body. As you add more conflicting threads >> this gets worse of course as commits sequence. >> >> The code you have will count commits of executions of `retry`.  Note >> that `retry` is a user level idea, that is, you are counting user >> level *explicit* retries.  This is different from a transaction >> failing to commit and starting again.  These are invisible to the >> user.  Also using your trace will convert `retry` from the efficient >> wake on write implementation, to an active retry that will always >> attempt again.  We don't have cheap logging of transaction aborts in >> GHC, but I have built such logging in my work.  You can observe these >> aborts with a debugger by looking for execution of this line: >> >> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >> >> Ryan >> >> >> >> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > > wrote: >> >> I'm not familiar with profiling GHC yet, may need more time to >> get myself proficient with it. >> >> And a bit more details of my test workload for diagnostic: the db >> clients are Python processes from a cluster of worker nodes, >> consulting the db server to register some path for data files, >> under a data dir within a shared filesystem, then mmap those data >> files and fill in actual array data. So the db server don't have >> much computation to perform, but puts the data file path into a >> global index, which at the same validates its uniqueness. As >> there are many client processes trying to insert one meta data >> record concurrently, with my naive implementation, the global >> index's TVar will almost always in locked state by one client >> after another, from a queue never fall empty. >> >> So if `readTVar` should spinning waiting, I doubt the threads >> should actually make high CPU utilization, because at any instant >> of time, all threads except the committing one will be doing that >> one thing. >> >> And I have something in my code to track STM retry like this: >> >> ``` >> >> -- blocking wait not expected, track stm retries explicitly >> trackSTM:: Int-> IO(Either() a) >> trackSTM !rtc = do >> when -- todo increase the threshold of reporting? >> (rtc > 0) $ do >> -- trace out the retries so the end users can be aware of them >> tid <- myThreadId >> trace >> ( "🔙\n" >> <> show callCtx >> <> "🌀 " >> <> show tid >> <> " stm retry #" >> <> show rtc >> ) >> $ return () >> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >> Nothing -> -- stm failed, do a tracked retry >> trackSTM (rtc + 1) >> Just ... -> ... >> >> ``` >> >> No such trace msg fires during my test, neither in single thread >> run, nor in runs with pressure. I'm sure this tracing mechanism >> works, as I can see such traces fire, in case e.g. posting a >> TMVar to a TQueue for some other thread to fill it, then read the >> result out, if these 2 ops are composed into a single tx, then of >> course it's infinite retry loop, and a sequence of such msgs are >> logged with ever increasing rtc #. >> >> So I believe no retry has ever been triggered. >> >> What can going on there? >> >> >> On 2020/7/24 下午11:46, Ryan Yates wrote: >>> > Then to explain the low CPU utilization (~10%), am I right to >>> understand it as that upon reading a TVar locked by another >>> committing tx, a lightweight thread will put itself into >>> `waiting STM` and descheduled state, so the CPUs can only stay >>> idle as not so many threads are willing to proceed? >>> >>> Since the commit happens in finite steps, the expectation is >>> that the lock will be released very soon.  Given this when the >>> body of a transaction executes `readTVar` it spins (active CPU!) >>> until the `TVar` is observed unlocked.  If a lock is observed >>> while commiting, it immediately starts the transaction again >>> from the beginning.  To get the behavior of suspending a >>> transaction you have to successfully commit a transaction that >>> executed `retry`.  Then the transaction is put on the wakeup >>> lists of its read set and subsequent commits will wake it up if >>> its write set overlaps. >>> >>> I don't think any of these things would explain low CPU >>> utilization.  You could try running with `perf` and see if lots >>> of time is spent in some recognizable part of the RTS. >>> >>> Ryan >>> >>> >>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue >> > wrote: >>> >>> Thanks very much for the insightful information Ryan! I'm >>> glad my suspect was wrong about the Haskell scheduler: >>> >>> > The Haskell capability that is committing a transaction >>> will not yield to another Haskell thread while it is doing >>> the commit.  The OS thread may be preempted, but once commit >>> starts the haskell scheduler is not invoked until after >>> locks are released. >>> >>> So best effort had already been made in GHC and I just need >>> to cooperate better with its design. Then to explain the low >>> CPU utilization (~10%), am I right to understand it as that >>> upon reading a TVar locked by another committing tx, a >>> lightweight thread will put itself into `waiting STM` and >>> descheduled state, so the CPUs can only stay idle as not so >>> many threads are willing to proceed? >>> >>> Anyway, I see light with better data structures to improve >>> my situation, let me try them and report back. Actually I >>> later changed `TVar (HaskMap k v)` to be `TVar (HashMap k >>> Int)` where the `Int` being array index into `TVar (Vector >>> (TVar (Maybe v)))`, in pursuing insertion order preservation >>> semantic of dict entries (like that in Python 3.7+), then >>> it's very hopeful to incorporate stm-containers' Map or >>> ttrie to approach free of contention. >>> >>> Thanks with regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>>> Hi Compl, >>>> >>>> Having a pool of transaction processing threads can be >>>> helpful in a certain way. If the body of the transaction >>>> takes more time to execute then the Haskell thread is >>>> allowed and it yields, the suspended thread won't get in >>>> the way of other thread, but when it is rescheduled, will >>>> have a low probability of success. Even worse, it will >>>> probably not discover that it is doomed to failure until >>>> commit time.  If transactions are more likely to reach >>>> commit without yielding, they are more likely to succeed.  >>>> If the transactions are not conflicting, it doesn't make >>>> much difference other than cache churn. >>>> >>>> The Haskell capability that is committing a transaction >>>> will not yield to another Haskell thread while it is doing >>>> the commit.  The OS thread may be preempted, but once >>>> commit starts the haskell scheduler is not invoked until >>>> after locks are released. >>>> >>>> To get good performance from STM you must pay attention to >>>> what TVars are involved in a commit.  All STM systems are >>>> working under the assumption of low contention, so you want >>>> to minimize "false" conflicts (conflicts that are not >>>> essential to the computation). Something like `TVar >>>> (HashMap k v)` will work pretty well for a low thread >>>> count, but every transaction that writes to that structure >>>> will be in conflict with every other transaction that >>>> accesses it. Pushing the `TVar` into the nodes of the >>>> structure reduces the possibilities for conflict, while >>>> increasing the amount of bookkeeping STM has to do.  I >>>> would like to reduce the cost of that bookkeeping using >>>> better structures, but we need to do so without harming >>>> performance in the low TVar count case.  Right now it is >>>> optimized for good cache performance with a handful of TVars. >>>> >>>> There is another way to play with performance by moving >>>> work into and out of the transaction body.  A transaction >>>> body that executes quickly will reach commit faster.  But >>>> it may be delaying work that moves into another >>>> transaction.  Forcing values at the right time can make a >>>> big difference. >>>> >>>> Ryan >>>> >>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe >>>> >>> > wrote: >>>> >>>> Thanks Chris, I confess I didn't pay enough attention >>>> to STM specialized container libraries by far, I >>>> skimmed through the description of stm-containers and >>>> ttrie, and feel they would definitely improve my code's >>>> performance in case I limit the server's parallelism >>>> within hardware capabilities. That may because I'm >>>> still prototyping the api and infrastructure for >>>> correctness, so even `TVar (HashMap k v)` performs okay >>>> for me at the moment, only if at low contention (surely >>>> there're plenty of CPU cycles to be optimized out in >>>> next steps). I model my data after graph model, so most >>>> data, even most indices are localized to nodes and >>>> edges, those can be manipulated without conflict, >>>> that's why I assumed I have a low contention use case >>>> since the very beginning - until I found there are >>>> still (though minor) needs for global indices to >>>> guarantee global uniqueness, I feel faithful with >>>> stm-containers/ttrie to implement a more scalable >>>> global index data structure, thanks for hinting me. >>>> >>>> So an evident solution comes into my mind now, is to >>>> run the server with a pool of tx processing threads, >>>> matching number of CPU cores, client RPC requests then >>>> get queued to be executed in some thread from the pool. >>>> But I'm really fond of the mechanism of M:N scheduler >>>> which solves massive/dynamic concurrency so elegantly. >>>> I had some good result with Go in this regard, and see >>>> GHC at par in doing this, I don't want to give up this >>>> enjoyable machinery. >>>> >>>> But looked at the stm implementation in GHC, it seems >>>> written TVars are exclusively locked during commit of a >>>> tx, I suspect this is the culprit when there're large M >>>> lightweight threads scheduled upon a small N hardware >>>> capabilities, that is when a lightweight thread yield >>>> control during an stm transaction commit, the TVars it >>>> locked will stay so until it's scheduled again (and >>>> again) till it can finish the commit. This way, >>>> descheduled threads could hold live threads from >>>> progressing. I haven't gone into more details there, >>>> but wonder if there can be some improvement for GHC RTS >>>> to keep an stm committing thread from descheduled, but >>>> seemingly that may impose more starvation potential; or >>>> stm can be improved to have its TVar locks preemptable >>>> when the owner trec/thread is in descheduled state? >>>> Neither should be easy but I'd really love massive >>>> lightweight threads doing STM practically well. >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>>> It seems like you know how to run practical tests for >>>>> tuning thread count and contention for throughput. >>>>> Part of the reason you haven't gotten a super clear >>>>> answer is "it depends." You give up fairness when you >>>>> use STM instead of MVars or equivalent structures. >>>>> That means a long running transaction might get >>>>> stampeded by many small ones invalidating it over and >>>>> over. The long-running transaction might never clear >>>>> if the small transactions keep moving the cheese. I >>>>> mention this because transaction runtime and size and >>>>> count all affect throughput and latency. What might be >>>>> ideal for one pattern of work might not be ideal for >>>>> another. Optimizing for overall throughput might make >>>>> the contention and fairness problems worse too. I've >>>>> done practical tests to optimize this in the past, >>>>> both for STM in Haskell and for RDBMS workloads. >>>>> >>>>> The next step is sometimes figuring out whether you >>>>> really need a data structure within a single STM >>>>> container or if perhaps you can break up your STM >>>>> container boundaries into zones or regions that >>>>> roughly map onto update boundaries. That should make >>>>> the transactions churn less. On the outside chance you >>>>> do need to touch more than one container in a >>>>> transaction, well, they compose. >>>>> >>>>> e.g. https://hackage.haskell.org/package/stm-containers >>>>> https://hackage.haskell.org/package/ttrie >>>>> >>>>> It also sounds a bit like your question bumps into >>>>> Amdahl's Law a bit. >>>>> >>>>> All else fails, stop using STM and find something more >>>>> tuned to your problem space. >>>>> >>>>> Hope this helps, >>>>> Chris Allen >>>>> >>>>> >>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via >>>>> Haskell-Cafe >>>> > wrote: >>>>> >>>>> Hello Cafe, >>>>> >>>>> I'm working on an in-memory database, in >>>>> Client/Server mode I just let each connected >>>>> client submit remote procedure call running in its >>>>> dedicated lightweight thread, modifying TVars in >>>>> RAM per its business needs, then in case many >>>>> clients connected concurrently and trying to >>>>> insert new data, if they are triggering global >>>>> index (some TVar) update, the throughput would >>>>> drop drastically. I reduced the shared state to a >>>>> simple int counter by TVar, got same symptom. >>>>> While the parallelism feels okay when there's no >>>>> hot TVar conflicting, or M is not much greater than N. >>>>> >>>>> As an empirical test workload, I have a `+RTS >>>>> -N10` server process, it handles 10 concurrent >>>>> clients okay, got ~5x of single thread throughput; >>>>> but in handling 20 concurrent clients, each of the >>>>> 10 CPUs can only be driven to ~10% utilization, >>>>> the throughput seems even worse than single >>>>> thread. More clients can even drive it thrashing >>>>> without much  progressing. >>>>> >>>>>  I can understand that pure STM doesn't scale well >>>>> after reading [1], and I see it suggested [7] >>>>> attractive and planned future work toward that >>>>> direction. >>>>> >>>>> But I can't find certain libraries or frameworks >>>>> addressing large M over small N scenarios, [1] >>>>> experimented with designated N parallelism, and >>>>> [7] is rather theoretical to my empirical needs. >>>>> >>>>> Can you direct me to some available library >>>>> implementing the methodology proposed in [7] or >>>>> other ways tackling this problem? >>>>> >>>>> I think the most difficult one is that a >>>>> transaction should commit with global indices >>>>> (with possibly unique constraints) atomically >>>>> updated, and rollback with any violation of >>>>> constraints, i.e. transactions have to cover >>>>> global states like indices. Other problems seem >>>>> more trivial than this. >>>>> >>>>> Specifically, [7] states: >>>>> >>>>> > It must be emphasized that all of the mechanisms >>>>> we deploy originate, in one form or another, in >>>>> the database literature from the 70s and 80s. Our >>>>> contribution is to adapt these techniques to >>>>> software transactional memory, providing more >>>>> effective solutions to important STM problems than >>>>> prior proposals. >>>>> >>>>> I wonder any STM based library has simplified >>>>> those techniques to be composed right away? I >>>>> don't really want to implement those mechanisms by >>>>> myself, rebuilding many wheels from scratch. >>>>> >>>>> Best regards, >>>>> Compl >>>>> >>>>> >>>>> [1] Comparing the performance of concurrent >>>>> linked-list implementations in Haskell >>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>>> >>>>> [7] M. Herlihy and E. Koskinen. Transactional >>>>> boosting: a methodology for highly-concurrent >>>>> transactional objects. In Proc. of PPoPP ’08, >>>>> pages 207–216. ACM Press, 2008. >>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>>> >>>>> _______________________________________________ >>>>> Haskell-Cafe mailing list >>>>> To (un)subscribe, modify options or view archives >>>>> go to: >>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>> Only members subscribed via the mailman list are >>>>> allowed to post. >>>>> >>>>> >>>>> >>>>> -- >>>>> Chris Allen >>>>> Currently working on http://haskellbook.com >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are >>>> allowed to post. >>>> > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fryguybob at gmail.com Sat Jul 25 13:48:30 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Sat, 25 Jul 2020 09:48:30 -0400 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> Message-ID: I have never done it, but I think you can make GDB count the times a breakpoint is hit using conditional breakpoints. Someone else may know of better tools. On Sat, Jul 25, 2020 at 2:04 AM Compl Yue wrote: > Shame on me for I have neither experienced with `perf`, I'd learn these > essential tools soon to put them into good use. > > It's great to learn about how `orElse` actually works, I did get confused > why there are so little retries captured, and now I know. So that little > trick should definitely be removed before going production, as it does no > much useful things at excessive cost. I put it there to help me understand > internal working of stm, now I get even better knowledge ;-) > > I think a debugger will trap every single abort, isn't it annoying when > many aborts would occur? If I'd like to count the number of aborts, ideally > accounted per service endpoints, time periods, source modules etc. there > some tricks for that? > > Thanks with best regards, > > Compl > > > On 2020/7/25 上午2:02, Ryan Yates wrote: > > To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based > profiling can do a good job with concurrent and parallel programs where > other methods are problematic. For instance, > changing the size of heap objects can drastically change cache > performance and completely different behavior can show up. > > [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) > > The spinning in `readTVar` should always be very short and it typically > shows up as intensive CPU use, though it may not be high energy use with > `pause` in the loop on x86 (looks like we don't have it [^2], I thought we > did, but maybe that was only in some of my code... ) > > [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 > > All that to say, I doubt that you are spending much time spinning (but it > would certainly be interesting to know if you are! You would see `perf` > attribute a large amount of time to `read_current_value`). The amount of > code to execute for commit (the time when locks are held) is always much > shorter than it takes to execute the transaction body. As you add more > conflicting threads this gets worse of course as commits sequence. > > The code you have will count commits of executions of `retry`. Note that > `retry` is a user level idea, that is, you are counting user level > *explicit* retries. This is different from a transaction failing to commit > and starting again. These are invisible to the user. Also using your > trace will convert `retry` from the efficient wake on write implementation, > to an active retry that will always attempt again. We don't have cheap > logging of transaction aborts in GHC, but I have built such logging in my > work. You can observe these aborts with a debugger by looking for > execution of this line: > > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 > > Ryan > > > > On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: > >> I'm not familiar with profiling GHC yet, may need more time to get myself >> proficient with it. >> >> And a bit more details of my test workload for diagnostic: the db clients >> are Python processes from a cluster of worker nodes, consulting the db >> server to register some path for data files, under a data dir within a >> shared filesystem, then mmap those data files and fill in actual array >> data. So the db server don't have much computation to perform, but puts the >> data file path into a global index, which at the same validates its >> uniqueness. As there are many client processes trying to insert one meta >> data record concurrently, with my naive implementation, the global index's >> TVar will almost always in locked state by one client after another, from a >> queue never fall empty. >> >> So if `readTVar` should spinning waiting, I doubt the threads should >> actually make high CPU utilization, because at any instant of time, all >> threads except the committing one will be doing that one thing. >> >> And I have something in my code to track STM retry like this: >> >> ``` >> -- blocking wait not expected, track stm retries explicitly >> trackSTM :: Int -> IO (Either () a) >> trackSTM !rtc = do >> when -- todo increase the threshold of reporting? >> (rtc > 0) $ do >> -- trace out the retries so the end users can be aware of them >> tid <- myThreadId >> trace >> ( "🔙\n" >> <> show callCtx >> <> "🌀 " >> <> show tid >> <> " stm retry #" >> <> show rtc >> ) >> $ return () >> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >> Nothing -> -- stm failed, do a tracked retry >> trackSTM (rtc + 1) >> Just ... -> ... >> >> ``` >> >> No such trace msg fires during my test, neither in single thread run, nor >> in runs with pressure. I'm sure this tracing mechanism works, as I can see >> such traces fire, in case e.g. posting a TMVar to a TQueue for some other >> thread to fill it, then read the result out, if these 2 ops are composed >> into a single tx, then of course it's infinite retry loop, and a sequence >> of such msgs are logged with ever increasing rtc #. >> >> So I believe no retry has ever been triggered. >> >> What can going on there? >> >> >> On 2020/7/24 下午11:46, Ryan Yates wrote: >> >> > Then to explain the low CPU utilization (~10%), am I right to >> understand it as that upon reading a TVar locked by another committing tx, >> a lightweight thread will put itself into `waiting STM` and descheduled >> state, so the CPUs can only stay idle as not so many threads are willing to >> proceed? >> >> Since the commit happens in finite steps, the expectation is that the >> lock will be released very soon. Given this when the body of a transaction >> executes `readTVar` it spins (active CPU!) until the `TVar` is observed >> unlocked. If a lock is observed while commiting, it immediately starts the >> transaction again from the beginning. To get the behavior of suspending a >> transaction you have to successfully commit a transaction that executed >> `retry`. Then the transaction is put on the wakeup lists of its read set >> and subsequent commits will wake it up if its write set overlaps. >> >> I don't think any of these things would explain low CPU utilization. You >> could try running with `perf` and see if lots of time is spent in some >> recognizable part of the RTS. >> >> Ryan >> >> >> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: >> >>> Thanks very much for the insightful information Ryan! I'm glad my >>> suspect was wrong about the Haskell scheduler: >>> >>> > The Haskell capability that is committing a transaction will not yield >>> to another Haskell thread while it is doing the commit. The OS thread may >>> be preempted, but once commit starts the haskell scheduler is not invoked >>> until after locks are released. >>> So best effort had already been made in GHC and I just need to cooperate >>> better with its design. Then to explain the low CPU utilization (~10%), am >>> I right to understand it as that upon reading a TVar locked by another >>> committing tx, a lightweight thread will put itself into `waiting STM` and >>> descheduled state, so the CPUs can only stay idle as not so many threads >>> are willing to proceed? >>> >>> Anyway, I see light with better data structures to improve my situation, >>> let me try them and report back. Actually I later changed `TVar (HaskMap k >>> v)` to be `TVar (HashMap k Int)` where the `Int` being array index into >>> `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation >>> semantic of dict entries (like that in Python 3.7+), then it's very hopeful >>> to incorporate stm-containers' Map or ttrie to approach free of contention. >>> >>> Thanks with regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>> >>> Hi Compl, >>> >>> Having a pool of transaction processing threads can be helpful in a >>> certain way. If the body of the transaction takes more time to execute >>> then the Haskell thread is allowed and it yields, the suspended thread >>> won't get in the way of other thread, but when it is rescheduled, will have >>> a low probability of success. Even worse, it will probably not discover >>> that it is doomed to failure until commit time. If transactions are more >>> likely to reach commit without yielding, they are more likely to succeed. >>> If the transactions are not conflicting, it doesn't make much difference >>> other than cache churn. >>> >>> The Haskell capability that is committing a transaction will not yield >>> to another Haskell thread while it is doing the commit. The OS thread may >>> be preempted, but once commit starts the haskell scheduler is not invoked >>> until after locks are released. >>> >>> To get good performance from STM you must pay attention to what TVars >>> are involved in a commit. All STM systems are working under the assumption >>> of low contention, so you want to minimize "false" conflicts (conflicts >>> that are not essential to the computation). Something like `TVar >>> (HashMap k v)` will work pretty well for a low thread count, but every >>> transaction that writes to that structure will be in conflict with every >>> other transaction that accesses it. Pushing the `TVar` into the nodes of >>> the structure reduces the possibilities for conflict, while increasing the >>> amount of bookkeeping STM has to do. I would like to reduce the cost of >>> that bookkeeping using better structures, but we need to do so without >>> harming performance in the low TVar count case. Right now it is optimized >>> for good cache performance with a handful of TVars. >>> >>> There is another way to play with performance by moving work into and >>> out of the transaction body. A transaction body that executes quickly will >>> reach commit faster. But it may be delaying work that moves into another >>> transaction. Forcing values at the right time can make a big difference. >>> >>> Ryan >>> >>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < >>> haskell-cafe at haskell.org> wrote: >>> >>>> Thanks Chris, I confess I didn't pay enough attention to STM >>>> specialized container libraries by far, I skimmed through the description >>>> of stm-containers and ttrie, and feel they would definitely improve my >>>> code's performance in case I limit the server's parallelism within hardware >>>> capabilities. That may because I'm still prototyping the api and >>>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay >>>> for me at the moment, only if at low contention (surely there're plenty of >>>> CPU cycles to be optimized out in next steps). I model my data after graph >>>> model, so most data, even most indices are localized to nodes and edges, >>>> those can be manipulated without conflict, that's why I assumed I have a >>>> low contention use case since the very beginning - until I found there are >>>> still (though minor) needs for global indices to guarantee global >>>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more >>>> scalable global index data structure, thanks for hinting me. >>>> >>>> So an evident solution comes into my mind now, is to run the server >>>> with a pool of tx processing threads, matching number of CPU cores, client >>>> RPC requests then get queued to be executed in some thread from the pool. >>>> But I'm really fond of the mechanism of M:N scheduler which solves >>>> massive/dynamic concurrency so elegantly. I had some good result with Go in >>>> this regard, and see GHC at par in doing this, I don't want to give up this >>>> enjoyable machinery. >>>> >>>> But looked at the stm implementation in GHC, it seems written TVars are >>>> exclusively locked during commit of a tx, I suspect this is the culprit >>>> when there're large M lightweight threads scheduled upon a small N hardware >>>> capabilities, that is when a lightweight thread yield control during an stm >>>> transaction commit, the TVars it locked will stay so until it's scheduled >>>> again (and again) till it can finish the commit. This way, descheduled >>>> threads could hold live threads from progressing. I haven't gone into more >>>> details there, but wonder if there can be some improvement for GHC RTS to >>>> keep an stm committing thread from descheduled, but seemingly that may >>>> impose more starvation potential; or stm can be improved to have its TVar >>>> locks preemptable when the owner trec/thread is in descheduled state? >>>> Neither should be easy but I'd really love massive lightweight threads >>>> doing STM practically well. >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>> >>>> It seems like you know how to run practical tests for tuning thread >>>> count and contention for throughput. Part of the reason you haven't gotten >>>> a super clear answer is "it depends." You give up fairness when you use STM >>>> instead of MVars or equivalent structures. That means a long running >>>> transaction might get stampeded by many small ones invalidating it over and >>>> over. The long-running transaction might never clear if the small >>>> transactions keep moving the cheese. I mention this because transaction >>>> runtime and size and count all affect throughput and latency. What might be >>>> ideal for one pattern of work might not be ideal for another. Optimizing >>>> for overall throughput might make the contention and fairness problems >>>> worse too. I've done practical tests to optimize this in the past, both for >>>> STM in Haskell and for RDBMS workloads. >>>> >>>> The next step is sometimes figuring out whether you really need a data >>>> structure within a single STM container or if perhaps you can break up your >>>> STM container boundaries into zones or regions that roughly map onto update >>>> boundaries. That should make the transactions churn less. On the outside >>>> chance you do need to touch more than one container in a transaction, well, >>>> they compose. >>>> >>>> e.g. https://hackage.haskell.org/package/stm-containers >>>> https://hackage.haskell.org/package/ttrie >>>> >>>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>>> >>>> All else fails, stop using STM and find something more tuned to your >>>> problem space. >>>> >>>> Hope this helps, >>>> Chris Allen >>>> >>>> >>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < >>>> haskell-cafe at haskell.org> wrote: >>>> >>>>> Hello Cafe, >>>>> >>>>> I'm working on an in-memory database, in Client/Server mode I just let >>>>> each connected client submit remote procedure call running in its dedicated >>>>> lightweight thread, modifying TVars in RAM per its business needs, then in >>>>> case many clients connected concurrently and trying to insert new data, if >>>>> they are triggering global index (some TVar) update, the throughput would >>>>> drop drastically. I reduced the shared state to a simple int counter by >>>>> TVar, got same symptom. While the parallelism feels okay when there's no >>>>> hot TVar conflicting, or M is not much greater than N. >>>>> >>>>> As an empirical test workload, I have a `+RTS -N10` server process, it >>>>> handles 10 concurrent clients okay, got ~5x of single thread throughput; >>>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be >>>>> driven to ~10% utilization, the throughput seems even worse than single >>>>> thread. More clients can even drive it thrashing without much progressing. >>>>> >>>>> I can understand that pure STM doesn't scale well after reading [1], >>>>> and I see it suggested [7] attractive and planned future work toward that >>>>> direction. >>>>> >>>>> But I can't find certain libraries or frameworks addressing large M >>>>> over small N scenarios, [1] experimented with designated N parallelism, and >>>>> [7] is rather theoretical to my empirical needs. >>>>> >>>>> Can you direct me to some available library implementing the >>>>> methodology proposed in [7] or other ways tackling this problem? >>>>> >>>>> I think the most difficult one is that a transaction should commit >>>>> with global indices (with possibly unique constraints) atomically updated, >>>>> and rollback with any violation of constraints, i.e. transactions have to >>>>> cover global states like indices. Other problems seem more trivial than >>>>> this. >>>>> >>>>> Specifically, [7] states: >>>>> >>>>> > It must be emphasized that all of the mechanisms we deploy >>>>> originate, in one form or another, in the database literature from the 70s >>>>> and 80s. Our contribution is to adapt these techniques to software >>>>> transactional memory, providing more effective solutions to important STM >>>>> problems than prior proposals. >>>>> >>>>> I wonder any STM based library has simplified those techniques to be >>>>> composed right away? I don't really want to implement those mechanisms by >>>>> myself, rebuilding many wheels from scratch. >>>>> >>>>> Best regards, >>>>> Compl >>>>> >>>>> >>>>> [1] Comparing the performance of concurrent linked-list >>>>> implementations in Haskell >>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>>> >>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology >>>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >>>>> 207–216. ACM Press, 2008. >>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>>> >>>>> _______________________________________________ >>>>> Haskell-Cafe mailing list >>>>> To (un)subscribe, modify options or view archives go to: >>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >>>> >>>> -- >>>> Chris Allen >>>> Currently working on http://haskellbook.com >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>> >>> -------------- next part -------------- An HTML attachment was scrubbed... URL: From fryguybob at gmail.com Sat Jul 25 14:07:10 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Sat, 25 Jul 2020 10:07:10 -0400 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711 Or my thesis: https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 The PPoPP benchmarks are on a branch (or the releases tab on github): https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. Ryan On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < haskell-cafe at haskell.org> wrote: > Dear Cafe, > > As Chris Allen has suggested, I learned that > https://hackage.haskell.org/package/stm-containers and > https://hackage.haskell.org/package/ttrie can help a lot when used in > place of traditional HashMap for stm tx processing, under heavy > concurrency, yet still with automatic parallelism as GHC implemented them. > Then I realized that in addition to hash map (used to implement dicts and > scopes), I also need to find a TreeMap replacement data structure to > implement the db index. I've been focusing on the uniqueness constraint > aspect, but it's still an index, needs to provide range scan api for db > clients, so hash map is not sufficient for the index. > > I see Ryan shared the code benchmarking RBTree with stm in mind: > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree > > But can't find conclusion or interpretation of that benchmark suite. And > here's a followup question: > > > Where are some STM contention optimized data structures, that having keys > ordered, with sub-range traversing api ? > > (of course production ready libraries most desirable) > > > Thanks with regards, > > Compl > > > On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: > > Shame on me for I have neither experienced with `perf`, I'd learn these > essential tools soon to put them into good use. > > It's great to learn about how `orElse` actually works, I did get confused > why there are so little retries captured, and now I know. So that little > trick should definitely be removed before going production, as it does no > much useful things at excessive cost. I put it there to help me understand > internal working of stm, now I get even better knowledge ;-) > > I think a debugger will trap every single abort, isn't it annoying when > many aborts would occur? If I'd like to count the number of aborts, ideally > accounted per service endpoints, time periods, source modules etc. there > some tricks for that? > > Thanks with best regards, > > Compl > > > On 2020/7/25 上午2:02, Ryan Yates wrote: > > To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based > profiling can do a good job with concurrent and parallel programs where > other methods are problematic. For instance, > changing the size of heap objects can drastically change cache > performance and completely different behavior can show up. > > [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) > > The spinning in `readTVar` should always be very short and it typically > shows up as intensive CPU use, though it may not be high energy use with > `pause` in the loop on x86 (looks like we don't have it [^2], I thought we > did, but maybe that was only in some of my code... ) > > [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 > > All that to say, I doubt that you are spending much time spinning (but it > would certainly be interesting to know if you are! You would see `perf` > attribute a large amount of time to `read_current_value`). The amount of > code to execute for commit (the time when locks are held) is always much > shorter than it takes to execute the transaction body. As you add more > conflicting threads this gets worse of course as commits sequence. > > The code you have will count commits of executions of `retry`. Note that > `retry` is a user level idea, that is, you are counting user level > *explicit* retries. This is different from a transaction failing to commit > and starting again. These are invisible to the user. Also using your > trace will convert `retry` from the efficient wake on write implementation, > to an active retry that will always attempt again. We don't have cheap > logging of transaction aborts in GHC, but I have built such logging in my > work. You can observe these aborts with a debugger by looking for > execution of this line: > > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 > > Ryan > > > > On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: > >> I'm not familiar with profiling GHC yet, may need more time to get myself >> proficient with it. >> >> And a bit more details of my test workload for diagnostic: the db clients >> are Python processes from a cluster of worker nodes, consulting the db >> server to register some path for data files, under a data dir within a >> shared filesystem, then mmap those data files and fill in actual array >> data. So the db server don't have much computation to perform, but puts the >> data file path into a global index, which at the same validates its >> uniqueness. As there are many client processes trying to insert one meta >> data record concurrently, with my naive implementation, the global index's >> TVar will almost always in locked state by one client after another, from a >> queue never fall empty. >> >> So if `readTVar` should spinning waiting, I doubt the threads should >> actually make high CPU utilization, because at any instant of time, all >> threads except the committing one will be doing that one thing. >> >> And I have something in my code to track STM retry like this: >> >> ``` >> -- blocking wait not expected, track stm retries explicitly >> trackSTM :: Int -> IO (Either () a) >> trackSTM !rtc = do >> when -- todo increase the threshold of reporting? >> (rtc > 0) $ do >> -- trace out the retries so the end users can be aware of them >> tid <- myThreadId >> trace >> ( "🔙\n" >> <> show callCtx >> <> "🌀 " >> <> show tid >> <> " stm retry #" >> <> show rtc >> ) >> $ return () >> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >> Nothing -> -- stm failed, do a tracked retry >> trackSTM (rtc + 1) >> Just ... -> ... >> >> ``` >> >> No such trace msg fires during my test, neither in single thread run, nor >> in runs with pressure. I'm sure this tracing mechanism works, as I can see >> such traces fire, in case e.g. posting a TMVar to a TQueue for some other >> thread to fill it, then read the result out, if these 2 ops are composed >> into a single tx, then of course it's infinite retry loop, and a sequence >> of such msgs are logged with ever increasing rtc #. >> >> So I believe no retry has ever been triggered. >> >> What can going on there? >> >> >> On 2020/7/24 下午11:46, Ryan Yates wrote: >> >> > Then to explain the low CPU utilization (~10%), am I right to >> understand it as that upon reading a TVar locked by another committing tx, >> a lightweight thread will put itself into `waiting STM` and descheduled >> state, so the CPUs can only stay idle as not so many threads are willing to >> proceed? >> >> Since the commit happens in finite steps, the expectation is that the >> lock will be released very soon. Given this when the body of a transaction >> executes `readTVar` it spins (active CPU!) until the `TVar` is observed >> unlocked. If a lock is observed while commiting, it immediately starts the >> transaction again from the beginning. To get the behavior of suspending a >> transaction you have to successfully commit a transaction that executed >> `retry`. Then the transaction is put on the wakeup lists of its read set >> and subsequent commits will wake it up if its write set overlaps. >> >> I don't think any of these things would explain low CPU utilization. You >> could try running with `perf` and see if lots of time is spent in some >> recognizable part of the RTS. >> >> Ryan >> >> >> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: >> >>> Thanks very much for the insightful information Ryan! I'm glad my >>> suspect was wrong about the Haskell scheduler: >>> >>> > The Haskell capability that is committing a transaction will not yield >>> to another Haskell thread while it is doing the commit. The OS thread may >>> be preempted, but once commit starts the haskell scheduler is not invoked >>> until after locks are released. >>> So best effort had already been made in GHC and I just need to cooperate >>> better with its design. Then to explain the low CPU utilization (~10%), am >>> I right to understand it as that upon reading a TVar locked by another >>> committing tx, a lightweight thread will put itself into `waiting STM` and >>> descheduled state, so the CPUs can only stay idle as not so many threads >>> are willing to proceed? >>> >>> Anyway, I see light with better data structures to improve my situation, >>> let me try them and report back. Actually I later changed `TVar (HaskMap k >>> v)` to be `TVar (HashMap k Int)` where the `Int` being array index into >>> `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation >>> semantic of dict entries (like that in Python 3.7+), then it's very hopeful >>> to incorporate stm-containers' Map or ttrie to approach free of contention. >>> >>> Thanks with regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>> >>> Hi Compl, >>> >>> Having a pool of transaction processing threads can be helpful in a >>> certain way. If the body of the transaction takes more time to execute >>> then the Haskell thread is allowed and it yields, the suspended thread >>> won't get in the way of other thread, but when it is rescheduled, will have >>> a low probability of success. Even worse, it will probably not discover >>> that it is doomed to failure until commit time. If transactions are more >>> likely to reach commit without yielding, they are more likely to succeed. >>> If the transactions are not conflicting, it doesn't make much difference >>> other than cache churn. >>> >>> The Haskell capability that is committing a transaction will not yield >>> to another Haskell thread while it is doing the commit. The OS thread may >>> be preempted, but once commit starts the haskell scheduler is not invoked >>> until after locks are released. >>> >>> To get good performance from STM you must pay attention to what TVars >>> are involved in a commit. All STM systems are working under the assumption >>> of low contention, so you want to minimize "false" conflicts (conflicts >>> that are not essential to the computation). Something like `TVar >>> (HashMap k v)` will work pretty well for a low thread count, but every >>> transaction that writes to that structure will be in conflict with every >>> other transaction that accesses it. Pushing the `TVar` into the nodes of >>> the structure reduces the possibilities for conflict, while increasing the >>> amount of bookkeeping STM has to do. I would like to reduce the cost of >>> that bookkeeping using better structures, but we need to do so without >>> harming performance in the low TVar count case. Right now it is optimized >>> for good cache performance with a handful of TVars. >>> >>> There is another way to play with performance by moving work into and >>> out of the transaction body. A transaction body that executes quickly will >>> reach commit faster. But it may be delaying work that moves into another >>> transaction. Forcing values at the right time can make a big difference. >>> >>> Ryan >>> >>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < >>> haskell-cafe at haskell.org> wrote: >>> >>>> Thanks Chris, I confess I didn't pay enough attention to STM >>>> specialized container libraries by far, I skimmed through the description >>>> of stm-containers and ttrie, and feel they would definitely improve my >>>> code's performance in case I limit the server's parallelism within hardware >>>> capabilities. That may because I'm still prototyping the api and >>>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay >>>> for me at the moment, only if at low contention (surely there're plenty of >>>> CPU cycles to be optimized out in next steps). I model my data after graph >>>> model, so most data, even most indices are localized to nodes and edges, >>>> those can be manipulated without conflict, that's why I assumed I have a >>>> low contention use case since the very beginning - until I found there are >>>> still (though minor) needs for global indices to guarantee global >>>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more >>>> scalable global index data structure, thanks for hinting me. >>>> >>>> So an evident solution comes into my mind now, is to run the server >>>> with a pool of tx processing threads, matching number of CPU cores, client >>>> RPC requests then get queued to be executed in some thread from the pool. >>>> But I'm really fond of the mechanism of M:N scheduler which solves >>>> massive/dynamic concurrency so elegantly. I had some good result with Go in >>>> this regard, and see GHC at par in doing this, I don't want to give up this >>>> enjoyable machinery. >>>> >>>> But looked at the stm implementation in GHC, it seems written TVars are >>>> exclusively locked during commit of a tx, I suspect this is the culprit >>>> when there're large M lightweight threads scheduled upon a small N hardware >>>> capabilities, that is when a lightweight thread yield control during an stm >>>> transaction commit, the TVars it locked will stay so until it's scheduled >>>> again (and again) till it can finish the commit. This way, descheduled >>>> threads could hold live threads from progressing. I haven't gone into more >>>> details there, but wonder if there can be some improvement for GHC RTS to >>>> keep an stm committing thread from descheduled, but seemingly that may >>>> impose more starvation potential; or stm can be improved to have its TVar >>>> locks preemptable when the owner trec/thread is in descheduled state? >>>> Neither should be easy but I'd really love massive lightweight threads >>>> doing STM practically well. >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>> >>>> It seems like you know how to run practical tests for tuning thread >>>> count and contention for throughput. Part of the reason you haven't gotten >>>> a super clear answer is "it depends." You give up fairness when you use STM >>>> instead of MVars or equivalent structures. That means a long running >>>> transaction might get stampeded by many small ones invalidating it over and >>>> over. The long-running transaction might never clear if the small >>>> transactions keep moving the cheese. I mention this because transaction >>>> runtime and size and count all affect throughput and latency. What might be >>>> ideal for one pattern of work might not be ideal for another. Optimizing >>>> for overall throughput might make the contention and fairness problems >>>> worse too. I've done practical tests to optimize this in the past, both for >>>> STM in Haskell and for RDBMS workloads. >>>> >>>> The next step is sometimes figuring out whether you really need a data >>>> structure within a single STM container or if perhaps you can break up your >>>> STM container boundaries into zones or regions that roughly map onto update >>>> boundaries. That should make the transactions churn less. On the outside >>>> chance you do need to touch more than one container in a transaction, well, >>>> they compose. >>>> >>>> e.g. https://hackage.haskell.org/package/stm-containers >>>> https://hackage.haskell.org/package/ttrie >>>> >>>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>>> >>>> All else fails, stop using STM and find something more tuned to your >>>> problem space. >>>> >>>> Hope this helps, >>>> Chris Allen >>>> >>>> >>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < >>>> haskell-cafe at haskell.org> wrote: >>>> >>>>> Hello Cafe, >>>>> >>>>> I'm working on an in-memory database, in Client/Server mode I just let >>>>> each connected client submit remote procedure call running in its dedicated >>>>> lightweight thread, modifying TVars in RAM per its business needs, then in >>>>> case many clients connected concurrently and trying to insert new data, if >>>>> they are triggering global index (some TVar) update, the throughput would >>>>> drop drastically. I reduced the shared state to a simple int counter by >>>>> TVar, got same symptom. While the parallelism feels okay when there's no >>>>> hot TVar conflicting, or M is not much greater than N. >>>>> >>>>> As an empirical test workload, I have a `+RTS -N10` server process, it >>>>> handles 10 concurrent clients okay, got ~5x of single thread throughput; >>>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be >>>>> driven to ~10% utilization, the throughput seems even worse than single >>>>> thread. More clients can even drive it thrashing without much progressing. >>>>> >>>>> I can understand that pure STM doesn't scale well after reading [1], >>>>> and I see it suggested [7] attractive and planned future work toward that >>>>> direction. >>>>> >>>>> But I can't find certain libraries or frameworks addressing large M >>>>> over small N scenarios, [1] experimented with designated N parallelism, and >>>>> [7] is rather theoretical to my empirical needs. >>>>> >>>>> Can you direct me to some available library implementing the >>>>> methodology proposed in [7] or other ways tackling this problem? >>>>> >>>>> I think the most difficult one is that a transaction should commit >>>>> with global indices (with possibly unique constraints) atomically updated, >>>>> and rollback with any violation of constraints, i.e. transactions have to >>>>> cover global states like indices. Other problems seem more trivial than >>>>> this. >>>>> >>>>> Specifically, [7] states: >>>>> >>>>> > It must be emphasized that all of the mechanisms we deploy >>>>> originate, in one form or another, in the database literature from the 70s >>>>> and 80s. Our contribution is to adapt these techniques to software >>>>> transactional memory, providing more effective solutions to important STM >>>>> problems than prior proposals. >>>>> >>>>> I wonder any STM based library has simplified those techniques to be >>>>> composed right away? I don't really want to implement those mechanisms by >>>>> myself, rebuilding many wheels from scratch. >>>>> >>>>> Best regards, >>>>> Compl >>>>> >>>>> >>>>> [1] Comparing the performance of concurrent linked-list >>>>> implementations in Haskell >>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>>> >>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology >>>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >>>>> 207–216. ACM Press, 2008. >>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>>> >>>>> _______________________________________________ >>>>> Haskell-Cafe mailing list >>>>> To (un)subscribe, modify options or view archives go to: >>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >>>> >>>> -- >>>> Chris Allen >>>> Currently working on http://haskellbook.com >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>> >>> > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From johannes.waldmann at htwk-leipzig.de Sat Jul 25 18:17:04 2020 From: johannes.waldmann at htwk-leipzig.de (Johannes Waldmann) Date: Sat, 25 Jul 2020 20:17:04 +0200 Subject: [Haskell-cafe] how to get live info on thread scheduling? Message-ID: <6a7d0c8f-d7da-f45e-7082-48123fccdbcd@htwk-leipzig.de> Dear Cafe, With Control.Concurrent, I would want to know, for a ThreadId, the total (up to now) (wallclock) time it was running/paused by the RTS scheduler. I guess that the scheduler's actions can be written to the eventlog for later analysis. Can I access them from inside a running program? I guess this is https://gitlab.haskell.org/ghc/ghc/-/wikis/event-log/live-monitoring , what is the status? - J. From compl.yue at icloud.com Mon Jul 27 06:58:04 2020 From: compl.yue at icloud.com (YueCompl) Date: Mon, 27 Jul 2020 14:58:04 +0800 Subject: [Haskell-cafe] How to understand the 320-byte heap footprint of UUID ? Message-ID: <64E8B76F-DD77-4495-8920-F79D1664152C@icloud.com> Hello Cafe, I'm about to introduce UUID into my code, and see https://github.com/haskell-hvr/uuid/issues/24 stating: > Currently, UUID is represented as data UUID = UUID {-# UNPACK #-} !Word32 {-# UNPACK #-} !Word32 {-# UNPACK #-} !Word32 {-# UNPACK #-} !Word32 > However, this suboptimal for 64bit archs (where GHC currently stores this a 320-byte Heap object); ... According to https://wiki.haskell.org/GHC/Memory_Footprint I can understand each evaluated `Word32` on 64-bit hardware can take 2 words - 16 bytes, and given they are unpacked and strict, I think one whole UUID record should just take 64 bytes plus a few words, which is far less than 320 bytes. So how comes the 320 bytes? Thanks with regards, Compl -------------- next part -------------- An HTML attachment was scrubbed... URL: From oleg.grenrus at iki.fi Mon Jul 27 16:21:58 2020 From: oleg.grenrus at iki.fi (Oleg Grenrus) Date: Mon, 27 Jul 2020 19:21:58 +0300 Subject: [Haskell-cafe] How to understand the 320-byte heap footprint of UUID ? In-Reply-To: <64E8B76F-DD77-4495-8920-F79D1664152C@icloud.com> References: <64E8B76F-DD77-4495-8920-F79D1664152C@icloud.com> Message-ID: <252ca95c-815b-c323-b128-4ea35b9202aa@iki.fi> TL;DR bits, not bytes. It meant to say 320 bit. 4 * 64 (Each Word32 is stored as Word64) + one 64bit header. 5 * 64 = 320. It could be just 3 * 64 = 192. - Oleg On 27.7.2020 9.58, YueCompl via Haskell-Cafe wrote: > Hello Cafe, > > I'm about to introduce UUID into my code, and > see https://github.com/haskell-hvr/uuid/issues/24 stating: > > > Currently, |UUID| is represented as > data UUID = UUID > {-# UNPACK #-} !Word32 > {-# UNPACK #-} !Word32 > {-# UNPACK #-} !Word32 > {-# UNPACK #-} !Word32 > > > However, this suboptimal for 64bit archs (where GHC currently stores > this a 320-byte Heap object); ... > > > According to https://wiki.haskell.org/GHC/Memory_Footprint I can > understand each evaluated `Word32` on 64-bit hardware can take 2 words > - 16 bytes, and given they are unpacked and strict, I think one whole > UUID record should just take 64 bytes plus a few words, which is far > less than 320 bytes. So how comes the 320 bytes? > > Thanks with regards, > Compl > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at gmail.com Tue Jul 28 05:51:11 2020 From: compl.yue at gmail.com (Compl Yue) Date: Tue, 28 Jul 2020 13:51:11 +0800 Subject: [Haskell-cafe] How to understand the 320-byte heap footprint of UUID ? In-Reply-To: <252ca95c-815b-c323-b128-4ea35b9202aa@iki.fi> References: <64E8B76F-DD77-4495-8920-F79D1664152C@icloud.com> <252ca95c-815b-c323-b128-4ea35b9202aa@iki.fi> Message-ID: <48A3A443-F006-42E6-A59A-6502F6B81AE6@gmail.com> I see, thanks! It's a relief, that huge overhead (as I wrongly perceived) really made me uncomfortable. > On 2020-07-28, at 00:21, Oleg Grenrus wrote: > > TL;DR bits, not bytes. > > It meant to say 320 bit. > > 4 * 64 (Each Word32 is stored as Word64) + one 64bit header. > > 5 * 64 = 320. > > It could be just 3 * 64 = 192. > > - Oleg > > On 27.7.2020 9.58, YueCompl via Haskell-Cafe wrote: >> Hello Cafe, >> >> I'm about to introduce UUID into my code, and see https://github.com/haskell-hvr/uuid/issues/24 stating: >> >> > Currently, UUID is represented as >> data UUID = UUID >> {-# UNPACK #-} !Word32 >> {-# UNPACK #-} !Word32 >> {-# UNPACK #-} !Word32 >> {-# UNPACK #-} !Word32 >> > However, this suboptimal for 64bit archs (where GHC currently stores this a 320-byte Heap object); ... >> >> >> According to https://wiki.haskell.org/GHC/Memory_Footprint I can understand each evaluated `Word32` on 64-bit hardware can take 2 words - 16 bytes, and given they are unpacked and strict, I think one whole UUID record should just take 64 bytes plus a few words, which is far less than 320 bytes. So how comes the 320 bytes? >> >> Thanks with regards, >> Compl >> >> >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leuschner at cp-med.com Tue Jul 28 09:23:48 2020 From: leuschner at cp-med.com (David Leuschner) Date: Tue, 28 Jul 2020 11:23:48 +0200 Subject: [Haskell-cafe] Job: Medical Solutions with Haskell in Freiburg (German speaker required) Message-ID: Dear Haskellers, our team is very happily and successfully using Haskell in production for over 10 years. We're currently hiring german-speaking/learning, full-time, on-site Haskell developers to work on user-friendly web- and mobile applications for doctors, nurses, patients and all other people involved. We have a very experienced team and we're looking for new-comers that are eager to learn as well as we're looking for experienced Haskell developers. We'd love to hear from you! Please look at our full announcement on Reddit: https://www.reddit.com/r/haskell/comments/hy8tuv/medical_solutions_with_haskelltypescript_in/ Kind regards, David -- David Leuschner Entwicklungsleitung Lohmann & Birkner Software Solutions GmbH Alt-Reinickendorf 25 13407 Berlin Email: leuschner at cp-med.com Web: http://www.checkpad.de Lohmann und Birkner Software Solutions GmbH Geschaeftsfuehrer: Dr. Ruediger Lohmann Handelsregister: Amtsgericht Berlin-Charlottenburg HRB 130806 B -------------- next part -------------- An HTML attachment was scrubbed... URL: From james.faure at epitech.eu Tue Jul 28 19:43:21 2020 From: james.faure at epitech.eu (james faure) Date: Tue, 28 Jul 2020 19:43:21 +0000 Subject: [Haskell-cafe] Subtyping CoC Message-ID: Hello, I am working on a subtyping calculus of constructions: https://github.com/jfaure/Nimzo, based on Algebraic Subtyping [1] The goal is to leverage the synthesis of subtyping with the CoC for general purpose programming, both for the usual correctness guarantees, but additionally exceptional type inference and powerful optimizations by using subtyping relations on dependent types. I spent the last year researching the theory and practice for this, and with the compiler at now just over 3000 lines of Haskell, I feel like I am no longer advancing as quickly as I would like, so if anyone is interested in helping create this language of the future, please get in touch ! [1]: https://www.cs.tufts.edu/~nr/cs257/archive/stephen-dolan/thesis.pdf James Faure (Discord: J4#0303) -------------- next part -------------- An HTML attachment was scrubbed... URL: From carter.schonwald at gmail.com Tue Jul 28 21:09:06 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Tue, 28 Jul 2020 17:09:06 -0400 Subject: [Haskell-cafe] Subtyping CoC In-Reply-To: References: Message-ID: Cool! Looks like there’s a lot of interesting ideas here. Thanks for sharing with the community! On Tue, Jul 28, 2020 at 3:44 PM james faure wrote: > Hello, > I am working on a subtyping calculus of constructions: > https://github.com/jfaure/Nimzo, based on Algebraic Subtyping [1] > > The goal is to leverage the synthesis of subtyping with the CoC for > general purpose programming, both for the usual correctness guarantees, but > additionally exceptional type inference and powerful optimizations by using > subtyping relations on dependent types. > > I spent the last year researching the theory and practice for this, and > with the compiler at now just over 3000 lines of Haskell, I feel like I am > no longer advancing as quickly as I would like, so if anyone is interested > in helping create this language of the future, please get in touch ! > > [1]: https://www.cs.tufts.edu/~nr/cs257/archive/stephen-dolan/thesis.pdf > > James Faure (Discord: J4#0303) > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Wed Jul 29 14:23:49 2020 From: compl.yue at icloud.com (YueCompl) Date: Wed, 29 Jul 2020 22:23:49 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: Hi Cafe and Ryan, I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. Best regards, Compl > On 2020-07-25, at 22:07, Ryan Yates wrote: > > Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: > > Leveraging hardware TM in Haskell (PPoPP '19) > https://dl.acm.org/doi/10.1145/3293883.3295711 > > Or my thesis: > https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 > > The PPoPP benchmarks are on a branch (or the releases tab on github): > https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src > > > All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. > > Ryan > > > On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe > wrote: > Dear Cafe, > > As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. > > I see Ryan shared the code benchmarking RBTree with stm in mind: > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree > But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: > > > > Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? > > (of course production ready libraries most desirable) > > > > Thanks with regards, > > Compl > > > > On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: >> Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. >> >> It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) >> >> I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? >> >> Thanks with best regards, >> >> Compl >> >> >> >> On 2020/7/25 上午2:02, Ryan Yates wrote: >>> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, >>> changing the size of heap objects can drastically change cache performance and completely different behavior can show up. >>> >>> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >>> >>> The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) >>> >>> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >>> >>> All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. >>> >>> The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: >>> >>> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >>> >>> Ryan >>> >>> >>> >>> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > wrote: >>> I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. >>> >>> And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. >>> >>> So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. >>> >>> And I have something in my code to track STM retry like this: >>> >>> ``` >>> >>> >>> -- blocking wait not expected, track stm retries explicitly >>> trackSTM :: Int -> IO (Either () a) >>> trackSTM !rtc = do >>> >>> when -- todo increase the threshold of reporting? >>> (rtc > 0) $ do >>> -- trace out the retries so the end users can be aware of them >>> tid <- myThreadId >>> trace >>> ( "🔙\n" >>> <> show callCtx >>> <> "🌀 " >>> <> show tid >>> <> " stm retry #" >>> <> show rtc >>> ) >>> $ return () >>> >>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >>> Nothing -> -- stm failed, do a tracked retry >>> trackSTM (rtc + 1) >>> Just ... -> ... >>> >>> ``` >>> >>> No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. >>> >>> So I believe no retry has ever been triggered. >>> >>> What can going on there? >>> >>> >>> >>> On 2020/7/24 下午11:46, Ryan Yates wrote: >>>> > Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>>> >>>> Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. >>>> >>>> I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. >>>> >>>> Ryan >>>> >>>> >>>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue > wrote: >>>> Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: >>>> >>>> > The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>>> >>>> So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>>> >>>> Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. >>>> Thanks with regards, >>>> >>>> Compl >>>> >>>> >>>> >>>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>>>> Hi Compl, >>>>> >>>>> Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. >>>>> >>>>> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>>>> >>>>> To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. >>>>> >>>>> There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. >>>>> >>>>> Ryan >>>>> >>>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe > wrote: >>>>> Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. >>>>> >>>>> So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. >>>>> >>>>> But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. >>>>> >>>>> Best regards, >>>>> >>>>> Compl >>>>> >>>>> >>>>> >>>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>>>> It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. >>>>>> >>>>>> The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. >>>>>> >>>>>> e.g. https://hackage.haskell.org/package/stm-containers >>>>>> https://hackage.haskell.org/package/ttrie >>>>>> >>>>>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>>>>> >>>>>> All else fails, stop using STM and find something more tuned to your problem space. >>>>>> >>>>>> Hope this helps, >>>>>> Chris Allen >>>>>> >>>>>> >>>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe > wrote: >>>>>> Hello Cafe, >>>>>> >>>>>> I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. >>>>>> >>>>>> As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. >>>>>> >>>>>> I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. >>>>>> >>>>>> But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. >>>>>> >>>>>> Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? >>>>>> >>>>>> I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. >>>>>> >>>>>> Specifically, [7] states: >>>>>> >>>>>> > It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. >>>>>> >>>>>> I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. >>>>>> >>>>>> Best regards, >>>>>> Compl >>>>>> >>>>>> >>>>>> [1] Comparing the performance of concurrent linked-list implementations in Haskell >>>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>>>> >>>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. >>>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>>>> >>>>>> _______________________________________________ >>>>>> Haskell-Cafe mailing list >>>>>> To (un)subscribe, modify options or view archives go to: >>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>>> Only members subscribed via the mailman list are allowed to post. >>>>>> >>>>>> >>>>>> -- >>>>>> Chris Allen >>>>>> Currently working on http://haskellbook.com _______________________________________________ >>>>> Haskell-Cafe mailing list >>>>> To (un)subscribe, modify options or view archives go to: >>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>> Only members subscribed via the mailman list are allowed to post. >> >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jo at durchholz.org Wed Jul 29 17:37:10 2020 From: jo at durchholz.org (Joachim Durchholz) Date: Wed, 29 Jul 2020 19:37:10 +0200 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <5268f36c-a71b-7ed7-fcb2-c2b4d146ec77@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <5268f36c-a71b-7ed7-fcb2-c2b4d146ec77@icloud.com> Message-ID: <31d90d33-2a21-18ee-a358-48c330efd184@durchholz.org> Am 24.07.20 um 17:48 schrieb Compl Yue via Haskell-Cafe: > The global > counter is only used to reveal the technical traits of my situation, > it's of course not a requirement of my business needs. Given the other discussion here, I'm not sure if it's really relevant to your situation, but that stats counter could indeed be causing lock contention. Which means your numbers may be skewed, and you may be drawing wrong conclusions - which is actually commonplace in benchmarking. Two things you could do: 1) Leave the global counter out and see whether the running times vary. There's still a chance that while the overall running time is the same, the code might now be hitting a different bottleneck. Or maybe the counter isn't the bottleneck but it would become one once you have done the other optimizations. So that experiment is cheap but gives you no more than a preliminary result. 2) Let each thread collect its own statistics, and coalesce into the global counter only once in a while. (Vary the "once in a while" determination and see whether it changes anything.) Just my 2c from the sideline. Regards, Jo From charukiewicz at protonmail.com Wed Jul 29 17:38:16 2020 From: charukiewicz at protonmail.com (Christian Charukiewicz) Date: Wed, 29 Jul 2020 17:38:16 +0000 Subject: [Haskell-cafe] [ANN] isbn - ISBN Validation and Manipulation Message-ID: Hello Haskell Cafe, I wanted to share my first ever Haskell package: isbn https://hackage.haskell.org/package/isbn The package is motivated by my need to validate ISBNs (the unique identifier associated with every book published since 1970) in a Haskell application I am building. I published isbn as a back in May but yesterday I made some improvements the API and I think it is now ready to share as v1.1.0.0. I have been using Haskell commercially for a few years, and have made several contributions to various packages, but as mentioned, this is my first time authoring and publishing a package. If anyone has any feedback, I would be happy to hear it. Thank you, Christian Charukiewicz From fryguybob at gmail.com Wed Jul 29 19:40:58 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Wed, 29 Jul 2020 15:40:58 -0400 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: Hi Compl, There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. Ryan On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote: > Hi Cafe and Ryan, > > I tried Map/Set from stm-containers and TSkipList (added range scan api > against its internal data structure) from > http://hackage.haskell.org/package/tskiplist , with them I've got quite > improved at scalability on concurrency. > > But unfortunately then I hit another wall at single thread scalability > over working memory size, I suspect it's because massively more TVars > (those being pointers per se) are introduced by those "contention-free" > data structures, they need to mutate separate pointers concurrently in > avoiding contentions anyway, but such pointer-intensive heap seems imposing > extraordinary pressure to GHC's garbage collector, that GC will dominate > CPU utilization with poor business progress. > > For example in my test, I use `+RTS -H2g` for the Haskell server process, > so GC is not triggered until after a while, then spin off 3 Python client > to insert new records concurrently, in the first stage each Python process > happily taking ~90% CPU filling (through local mmap) the arrays allocated > from the server and logs of success scroll quickly, while the server > process utilizes only 30~40% CPU to serve those 3 clients (insert meta data > records into unique indices merely); then the client processes' CPU > utilization drop drastically once Haskell server process' private memory > reached around 2gb, i.e. GC started engaging, the server process's CPU > utilization quickly approaches ~300%, while all client processes' drop to > 0% for most of the time, and occasionally burst a tiny while with some log > output showing progress. And I disable parallel GC lately, enabling > parallel GC only makes it worse. > > If I comment out the code updating the indices (those creating many > TVars), the overall throughput only drop slowly as more data are inserted, > the parallelism feels steady even after the server process' private memory > takes several GBs. > > I didn't expect this, but appears to me that GC of GHC is really not good > at handling massive number of pointers in the heap, while those pointers > are essential to reduce contention (and maybe expensive data copying too) > at heavy parallelism/concurrency. > > Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior > compared to 8.8.3; and also tried tweaking GC related RTS options a bit, > including increasing -G up to 10, no much difference too. > > I feel hopeless at the moment, wondering if I'll have to rewrite this > in-memory db in Go/Rust or some other runtime ... > > Btw I read > https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in > searching about the symptoms, and don't feel likely to convert my DB > managed data into immutable types thus to fit into Compact Regions, not > quite likely a live in-mem database instance can do. > > So seems there are good reasons no successful DBMS, at least in-memory > ones have been written in Haskell. > > Best regards, > Compl > > > On 2020-07-25, at 22:07, Ryan Yates wrote: > > Unfortunately my STM benchmarks are rather disorganized. The most > relevant paper using them is: > > Leveraging hardware TM in Haskell (PPoPP '19) > https://dl.acm.org/doi/10.1145/3293883.3295711 > > Or my thesis: > > https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 > > > The PPoPP benchmarks are on a branch (or the releases tab on github): > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src > > > > All that to say, without an implementation of mutable constructor fields > (which I'm working on getting into GHC) the scaling is limited. > > Ryan > > > On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > >> Dear Cafe, >> >> As Chris Allen has suggested, I learned that >> https://hackage.haskell.org/package/stm-containers and >> https://hackage.haskell.org/package/ttrie can help a lot when used in >> place of traditional HashMap for stm tx processing, under heavy >> concurrency, yet still with automatic parallelism as GHC implemented them. >> Then I realized that in addition to hash map (used to implement dicts and >> scopes), I also need to find a TreeMap replacement data structure to >> implement the db index. I've been focusing on the uniqueness constraint >> aspect, but it's still an index, needs to provide range scan api for db >> clients, so hash map is not sufficient for the index. >> >> I see Ryan shared the code benchmarking RBTree with stm in mind: >> >> >> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput >> >> >> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree >> >> But can't find conclusion or interpretation of that benchmark suite. And >> here's a followup question: >> >> >> Where are some STM contention optimized data structures, that having keys >> ordered, with sub-range traversing api ? >> >> (of course production ready libraries most desirable) >> >> >> Thanks with regards, >> >> Compl >> >> >> On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: >> >> Shame on me for I have neither experienced with `perf`, I'd learn these >> essential tools soon to put them into good use. >> >> It's great to learn about how `orElse` actually works, I did get confused >> why there are so little retries captured, and now I know. So that little >> trick should definitely be removed before going production, as it does no >> much useful things at excessive cost. I put it there to help me understand >> internal working of stm, now I get even better knowledge ;-) >> >> I think a debugger will trap every single abort, isn't it annoying when >> many aborts would occur? If I'd like to count the number of aborts, ideally >> accounted per service endpoints, time periods, source modules etc. there >> some tricks for that? >> >> Thanks with best regards, >> >> Compl >> >> >> On 2020/7/25 上午2:02, Ryan Yates wrote: >> >> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based >> profiling can do a good job with concurrent and parallel programs where >> other methods are problematic. For instance, >> changing the size of heap objects can drastically change cache >> performance and completely different behavior can show up. >> >> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >> >> The spinning in `readTVar` should always be very short and it typically >> shows up as intensive CPU use, though it may not be high energy use with >> `pause` in the loop on x86 (looks like we don't have it [^2], I thought we >> did, but maybe that was only in some of my code... ) >> >> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >> >> All that to say, I doubt that you are spending much time spinning (but it >> would certainly be interesting to know if you are! You would see `perf` >> attribute a large amount of time to `read_current_value`). The amount of >> code to execute for commit (the time when locks are held) is always much >> shorter than it takes to execute the transaction body. As you add more >> conflicting threads this gets worse of course as commits sequence. >> >> The code you have will count commits of executions of `retry`. Note that >> `retry` is a user level idea, that is, you are counting user level >> *explicit* retries. This is different from a transaction failing to commit >> and starting again. These are invisible to the user. Also using your >> trace will convert `retry` from the efficient wake on write implementation, >> to an active retry that will always attempt again. We don't have cheap >> logging of transaction aborts in GHC, but I have built such logging in my >> work. You can observe these aborts with a debugger by looking for >> execution of this line: >> >> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >> >> Ryan >> >> >> >> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: >> >>> I'm not familiar with profiling GHC yet, may need more time to get >>> myself proficient with it. >>> >>> And a bit more details of my test workload for diagnostic: the db >>> clients are Python processes from a cluster of worker nodes, consulting the >>> db server to register some path for data files, under a data dir within a >>> shared filesystem, then mmap those data files and fill in actual array >>> data. So the db server don't have much computation to perform, but puts the >>> data file path into a global index, which at the same validates its >>> uniqueness. As there are many client processes trying to insert one meta >>> data record concurrently, with my naive implementation, the global index's >>> TVar will almost always in locked state by one client after another, from a >>> queue never fall empty. >>> >>> So if `readTVar` should spinning waiting, I doubt the threads should >>> actually make high CPU utilization, because at any instant of time, all >>> threads except the committing one will be doing that one thing. >>> >>> And I have something in my code to track STM retry like this: >>> >>> ``` >>> -- blocking wait not expected, track stm retries explicitly >>> trackSTM :: Int -> IO (Either () a) >>> trackSTM !rtc = do >>> when -- todo increase the threshold of reporting? >>> (rtc > 0) $ do >>> -- trace out the retries so the end users can be aware of them >>> tid <- myThreadId >>> trace >>> ( "🔙\n" >>> <> show callCtx >>> <> "🌀 " >>> <> show tid >>> <> " stm retry #" >>> <> show rtc >>> ) >>> $ return () >>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >>> Nothing -> -- stm failed, do a tracked retry >>> trackSTM (rtc + 1) >>> Just ... -> ... >>> >>> ``` >>> >>> No such trace msg fires during my test, neither in single thread run, >>> nor in runs with pressure. I'm sure this tracing mechanism works, as I can >>> see such traces fire, in case e.g. posting a TMVar to a TQueue for some >>> other thread to fill it, then read the result out, if these 2 ops are >>> composed into a single tx, then of course it's infinite retry loop, and a >>> sequence of such msgs are logged with ever increasing rtc #. >>> >>> So I believe no retry has ever been triggered. >>> >>> What can going on there? >>> >>> >>> On 2020/7/24 下午11:46, Ryan Yates wrote: >>> >>> > Then to explain the low CPU utilization (~10%), am I right to >>> understand it as that upon reading a TVar locked by another committing tx, >>> a lightweight thread will put itself into `waiting STM` and descheduled >>> state, so the CPUs can only stay idle as not so many threads are willing to >>> proceed? >>> >>> Since the commit happens in finite steps, the expectation is that the >>> lock will be released very soon. Given this when the body of a transaction >>> executes `readTVar` it spins (active CPU!) until the `TVar` is observed >>> unlocked. If a lock is observed while commiting, it immediately starts the >>> transaction again from the beginning. To get the behavior of suspending a >>> transaction you have to successfully commit a transaction that executed >>> `retry`. Then the transaction is put on the wakeup lists of its read set >>> and subsequent commits will wake it up if its write set overlaps. >>> >>> I don't think any of these things would explain low CPU utilization. >>> You could try running with `perf` and see if lots of time is spent in some >>> recognizable part of the RTS. >>> >>> Ryan >>> >>> >>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: >>> >>>> Thanks very much for the insightful information Ryan! I'm glad my >>>> suspect was wrong about the Haskell scheduler: >>>> >>>> > The Haskell capability that is committing a transaction will not >>>> yield to another Haskell thread while it is doing the commit. The OS >>>> thread may be preempted, but once commit starts the haskell scheduler is >>>> not invoked until after locks are released. >>>> So best effort had already been made in GHC and I just need to >>>> cooperate better with its design. Then to explain the low CPU utilization >>>> (~10%), am I right to understand it as that upon reading a TVar locked by >>>> another committing tx, a lightweight thread will put itself into `waiting >>>> STM` and descheduled state, so the CPUs can only stay idle as not so many >>>> threads are willing to proceed? >>>> >>>> Anyway, I see light with better data structures to improve my >>>> situation, let me try them and report back. Actually I later changed `TVar >>>> (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array >>>> index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order >>>> preservation semantic of dict entries (like that in Python 3.7+), then it's >>>> very hopeful to incorporate stm-containers' Map or ttrie to approach free >>>> of contention. >>>> >>>> Thanks with regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>>> >>>> Hi Compl, >>>> >>>> Having a pool of transaction processing threads can be helpful in a >>>> certain way. If the body of the transaction takes more time to execute >>>> then the Haskell thread is allowed and it yields, the suspended thread >>>> won't get in the way of other thread, but when it is rescheduled, will have >>>> a low probability of success. Even worse, it will probably not discover >>>> that it is doomed to failure until commit time. If transactions are more >>>> likely to reach commit without yielding, they are more likely to succeed. >>>> If the transactions are not conflicting, it doesn't make much difference >>>> other than cache churn. >>>> >>>> The Haskell capability that is committing a transaction will not yield >>>> to another Haskell thread while it is doing the commit. The OS thread may >>>> be preempted, but once commit starts the haskell scheduler is not invoked >>>> until after locks are released. >>>> >>>> To get good performance from STM you must pay attention to what TVars >>>> are involved in a commit. All STM systems are working under the assumption >>>> of low contention, so you want to minimize "false" conflicts (conflicts >>>> that are not essential to the computation). Something like `TVar >>>> (HashMap k v)` will work pretty well for a low thread count, but every >>>> transaction that writes to that structure will be in conflict with every >>>> other transaction that accesses it. Pushing the `TVar` into the nodes of >>>> the structure reduces the possibilities for conflict, while increasing the >>>> amount of bookkeeping STM has to do. I would like to reduce the cost of >>>> that bookkeeping using better structures, but we need to do so without >>>> harming performance in the low TVar count case. Right now it is optimized >>>> for good cache performance with a handful of TVars. >>>> >>>> There is another way to play with performance by moving work into and >>>> out of the transaction body. A transaction body that executes quickly will >>>> reach commit faster. But it may be delaying work that moves into another >>>> transaction. Forcing values at the right time can make a big difference. >>>> >>>> Ryan >>>> >>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < >>>> haskell-cafe at haskell.org> wrote: >>>> >>>>> Thanks Chris, I confess I didn't pay enough attention to STM >>>>> specialized container libraries by far, I skimmed through the description >>>>> of stm-containers and ttrie, and feel they would definitely improve my >>>>> code's performance in case I limit the server's parallelism within hardware >>>>> capabilities. That may because I'm still prototyping the api and >>>>> infrastructure for correctness, so even `TVar (HashMap k v)` performs okay >>>>> for me at the moment, only if at low contention (surely there're plenty of >>>>> CPU cycles to be optimized out in next steps). I model my data after graph >>>>> model, so most data, even most indices are localized to nodes and edges, >>>>> those can be manipulated without conflict, that's why I assumed I have a >>>>> low contention use case since the very beginning - until I found there are >>>>> still (though minor) needs for global indices to guarantee global >>>>> uniqueness, I feel faithful with stm-containers/ttrie to implement a more >>>>> scalable global index data structure, thanks for hinting me. >>>>> >>>>> So an evident solution comes into my mind now, is to run the server >>>>> with a pool of tx processing threads, matching number of CPU cores, client >>>>> RPC requests then get queued to be executed in some thread from the pool. >>>>> But I'm really fond of the mechanism of M:N scheduler which solves >>>>> massive/dynamic concurrency so elegantly. I had some good result with Go in >>>>> this regard, and see GHC at par in doing this, I don't want to give up this >>>>> enjoyable machinery. >>>>> >>>>> But looked at the stm implementation in GHC, it seems written TVars >>>>> are exclusively locked during commit of a tx, I suspect this is the culprit >>>>> when there're large M lightweight threads scheduled upon a small N hardware >>>>> capabilities, that is when a lightweight thread yield control during an stm >>>>> transaction commit, the TVars it locked will stay so until it's scheduled >>>>> again (and again) till it can finish the commit. This way, descheduled >>>>> threads could hold live threads from progressing. I haven't gone into more >>>>> details there, but wonder if there can be some improvement for GHC RTS to >>>>> keep an stm committing thread from descheduled, but seemingly that may >>>>> impose more starvation potential; or stm can be improved to have its TVar >>>>> locks preemptable when the owner trec/thread is in descheduled state? >>>>> Neither should be easy but I'd really love massive lightweight threads >>>>> doing STM practically well. >>>>> >>>>> Best regards, >>>>> >>>>> Compl >>>>> >>>>> >>>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>>> >>>>> It seems like you know how to run practical tests for tuning thread >>>>> count and contention for throughput. Part of the reason you haven't gotten >>>>> a super clear answer is "it depends." You give up fairness when you use STM >>>>> instead of MVars or equivalent structures. That means a long running >>>>> transaction might get stampeded by many small ones invalidating it over and >>>>> over. The long-running transaction might never clear if the small >>>>> transactions keep moving the cheese. I mention this because transaction >>>>> runtime and size and count all affect throughput and latency. What might be >>>>> ideal for one pattern of work might not be ideal for another. Optimizing >>>>> for overall throughput might make the contention and fairness problems >>>>> worse too. I've done practical tests to optimize this in the past, both for >>>>> STM in Haskell and for RDBMS workloads. >>>>> >>>>> The next step is sometimes figuring out whether you really need a data >>>>> structure within a single STM container or if perhaps you can break up your >>>>> STM container boundaries into zones or regions that roughly map onto update >>>>> boundaries. That should make the transactions churn less. On the outside >>>>> chance you do need to touch more than one container in a transaction, well, >>>>> they compose. >>>>> >>>>> e.g. https://hackage.haskell.org/package/stm-containers >>>>> https://hackage.haskell.org/package/ttrie >>>>> >>>>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>>>> >>>>> All else fails, stop using STM and find something more tuned to your >>>>> problem space. >>>>> >>>>> Hope this helps, >>>>> Chris Allen >>>>> >>>>> >>>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < >>>>> haskell-cafe at haskell.org> wrote: >>>>> >>>>>> Hello Cafe, >>>>>> >>>>>> I'm working on an in-memory database, in Client/Server mode I just >>>>>> let each connected client submit remote procedure call running in its >>>>>> dedicated lightweight thread, modifying TVars in RAM per its business >>>>>> needs, then in case many clients connected concurrently and trying to >>>>>> insert new data, if they are triggering global index (some TVar) update, >>>>>> the throughput would drop drastically. I reduced the shared state to a >>>>>> simple int counter by TVar, got same symptom. While the parallelism feels >>>>>> okay when there's no hot TVar conflicting, or M is not much greater than N. >>>>>> >>>>>> As an empirical test workload, I have a `+RTS -N10` server process, >>>>>> it handles 10 concurrent clients okay, got ~5x of single thread throughput; >>>>>> but in handling 20 concurrent clients, each of the 10 CPUs can only be >>>>>> driven to ~10% utilization, the throughput seems even worse than single >>>>>> thread. More clients can even drive it thrashing without much progressing. >>>>>> >>>>>> I can understand that pure STM doesn't scale well after reading [1], >>>>>> and I see it suggested [7] attractive and planned future work toward that >>>>>> direction. >>>>>> >>>>>> But I can't find certain libraries or frameworks addressing large M >>>>>> over small N scenarios, [1] experimented with designated N parallelism, and >>>>>> [7] is rather theoretical to my empirical needs. >>>>>> >>>>>> Can you direct me to some available library implementing the >>>>>> methodology proposed in [7] or other ways tackling this problem? >>>>>> >>>>>> I think the most difficult one is that a transaction should commit >>>>>> with global indices (with possibly unique constraints) atomically updated, >>>>>> and rollback with any violation of constraints, i.e. transactions have to >>>>>> cover global states like indices. Other problems seem more trivial than >>>>>> this. >>>>>> >>>>>> Specifically, [7] states: >>>>>> >>>>>> > It must be emphasized that all of the mechanisms we deploy >>>>>> originate, in one form or another, in the database literature from the 70s >>>>>> and 80s. Our contribution is to adapt these techniques to software >>>>>> transactional memory, providing more effective solutions to important STM >>>>>> problems than prior proposals. >>>>>> >>>>>> I wonder any STM based library has simplified those techniques to be >>>>>> composed right away? I don't really want to implement those mechanisms by >>>>>> myself, rebuilding many wheels from scratch. >>>>>> >>>>>> Best regards, >>>>>> Compl >>>>>> >>>>>> >>>>>> [1] Comparing the performance of concurrent linked-list >>>>>> implementations in Haskell >>>>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>>>> >>>>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology >>>>>> for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages >>>>>> 207–216. ACM Press, 2008. >>>>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>>>> >>>>>> _______________________________________________ >>>>>> Haskell-Cafe mailing list >>>>>> To (un)subscribe, modify options or view archives go to: >>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>>> Only members subscribed via the mailman list are allowed to post. >>>>> >>>>> >>>>> >>>>> -- >>>>> Chris Allen >>>>> Currently working on http://haskellbook.com >>>>> >>>>> _______________________________________________ >>>>> Haskell-Cafe mailing list >>>>> To (un)subscribe, modify options or view archives go to: >>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to:http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From simonpj at microsoft.com Wed Jul 29 20:57:37 2020 From: simonpj at microsoft.com (Simon Peyton Jones) Date: Wed, 29 Jul 2020 20:57:37 +0000 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something. My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet. Maybe someone with experience of performance debugging might feel able to help Compl? Simon From: Haskell-Cafe On Behalf Of Ryan Yates Sent: 29 July 2020 20:41 To: YueCompl Cc: Haskell Cafe Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? Hi Compl, There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. Ryan On Wed, Jul 29, 2020 at 10:24 AM YueCompl > wrote: Hi Cafe and Ryan, I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. Best regards, Compl On 2020-07-25, at 22:07, Ryan Yates > wrote: Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: Leveraging hardware TM in Haskell (PPoPP '19) https://dl.acm.org/doi/10.1145/3293883.3295711 Or my thesis: https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 The PPoPP benchmarks are on a branch (or the releases tab on github): https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. Ryan On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe > wrote: Dear Cafe, As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. I see Ryan shared the code benchmarking RBTree with stm in mind: https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? (of course production ready libraries most desirable) Thanks with regards, Compl On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? Thanks with best regards, Compl On 2020/7/25 上午2:02, Ryan Yates wrote: To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, changing the size of heap objects can drastically change cache performance and completely different behavior can show up. [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 Ryan On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > wrote: I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. And I have something in my code to track STM retry like this: ``` -- blocking wait not expected, track stm retries explicitly trackSTM :: Int -> IO (Either () a) trackSTM !rtc = do when -- todo increase the threshold of reporting? (rtc > 0) $ do -- trace out the retries so the end users can be aware of them tid <- myThreadId trace ( "🔙\n" <> show callCtx <> "🌀 " <> show tid <> " stm retry #" <> show rtc ) $ return () atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case Nothing -> -- stm failed, do a tracked retry trackSTM (rtc + 1) Just ... -> ... ``` No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. So I believe no retry has ever been triggered. What can going on there? On 2020/7/24 下午11:46, Ryan Yates wrote: > Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. Ryan On Fri, Jul 24, 2020 at 11:22 AM Compl Yue > wrote: Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: > The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. Thanks with regards, Compl On 2020/7/24 下午10:03, Ryan Yates wrote: Hi Compl, Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. Ryan On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe > wrote: Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. Best regards, Compl On 2020/7/24 上午12:57, Christopher Allen wrote: It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. e.g. https://hackage.haskell.org/package/stm-containers https://hackage.haskell.org/package/ttrie It also sounds a bit like your question bumps into Amdahl's Law a bit. All else fails, stop using STM and find something more tuned to your problem space. Hope this helps, Chris Allen On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe > wrote: Hello Cafe, I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. Specifically, [7] states: > It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. Best regards, Compl [1] Comparing the performance of concurrent linked-list implementations in Haskell https://simonmar.github.io/bib/papers/concurrent-data.pdf [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. -- Chris Allen Currently working on http://haskellbook.com _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. _______________________________________________ Haskell-Cafe mailing list To (un)subscribe, modify options or view archives go to: http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From fryguybob at gmail.com Thu Jul 30 02:05:14 2020 From: fryguybob at gmail.com (Ryan Yates) Date: Wed, 29 Jul 2020 22:05:14 -0400 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC. Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment: -- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./-- For performance reasons, this function uses 'unsafePerformIO' to access the-- random number generator. (It would be possible to store the random number-- generator in a 'TVar' and thus be able to access it safely from within the-- STM monad. This, however, might cause high contention among threads.) chooseLevel :: TSkipList k a -> Int This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization. Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date). Ryan On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote: > Compl’s problem is (apparently) that execution becomes dominated by GC. > That doesn’t sound like a constant-factor overhead from TVars, no matter > how efficient (or otherwise) they are. It sounds more like a space leak to > me; perhaps you need some strict evaluation or something. > > > > My point is only: before re-engineering STM it would make sense to get a > much more detailed insight into what is actually happening, and where the > space and time is going. We have tools to do this (heap profiling, > Threadscope, …) but I know they need some skill and insight to use well. > But we don’t have nearly enough insight to draw meaningful conclusions yet. > > > > Maybe someone with experience of performance debugging might feel able to > help Compl? > > > > Simon > > > > *From:* Haskell-Cafe *On Behalf Of *Ryan > Yates > *Sent:* 29 July 2020 20:41 > *To:* YueCompl > *Cc:* Haskell Cafe > *Subject:* Re: [Haskell-cafe] STM friendly TreeMap (or similar with range > scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of > STM threads, with hot TVar updates? > > > > Hi Compl, > > > > There is a lot of overhead with TVars. My thesis work addresses this by > incorporating mutable constructor fields with STM. I would like to get all > that into GHC as soon as I can :D. I haven't looked closely at the > `tskiplist` package, I'll take a look and see if I see any potential > issues. There was some recent work on concurrent B-tree that may be > interesting to try. > > > > Ryan > > > > On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote: > > Hi Cafe and Ryan, > > > > I tried Map/Set from stm-containers and TSkipList (added range scan api > against its internal data structure) from > http://hackage.haskell.org/package/tskiplist > , > with them I've got quite improved at scalability on concurrency. > > > > But unfortunately then I hit another wall at single thread scalability > over working memory size, I suspect it's because massively more TVars > (those being pointers per se) are introduced by those "contention-free" > data structures, they need to mutate separate pointers concurrently in > avoiding contentions anyway, but such pointer-intensive heap seems imposing > extraordinary pressure to GHC's garbage collector, that GC will dominate > CPU utilization with poor business progress. > > > > For example in my test, I use `+RTS -H2g` for the Haskell server process, > so GC is not triggered until after a while, then spin off 3 Python client > to insert new records concurrently, in the first stage each Python process > happily taking ~90% CPU filling (through local mmap) the arrays allocated > from the server and logs of success scroll quickly, while the server > process utilizes only 30~40% CPU to serve those 3 clients (insert meta data > records into unique indices merely); then the client processes' CPU > utilization drop drastically once Haskell server process' private memory > reached around 2gb, i.e. GC started engaging, the server process's CPU > utilization quickly approaches ~300%, while all client processes' drop to > 0% for most of the time, and occasionally burst a tiny while with some log > output showing progress. And I disable parallel GC lately, enabling > parallel GC only makes it worse. > > > > If I comment out the code updating the indices (those creating many > TVars), the overall throughput only drop slowly as more data are inserted, > the parallelism feels steady even after the server process' private memory > takes several GBs. > > > > I didn't expect this, but appears to me that GC of GHC is really not good > at handling massive number of pointers in the heap, while those pointers > are essential to reduce contention (and maybe expensive data copying too) > at heavy parallelism/concurrency. > > > > Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior > compared to 8.8.3; and also tried tweaking GC related RTS options a bit, > including increasing -G up to 10, no much difference too. > > > > I feel hopeless at the moment, wondering if I'll have to rewrite this > in-memory db in Go/Rust or some other runtime ... > > > > Btw I read > https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html > in > searching about the symptoms, and don't feel likely to convert my DB > managed data into immutable types thus to fit into Compact Regions, not > quite likely a live in-mem database instance can do. > > > > So seems there are good reasons no successful DBMS, at least in-memory > ones have been written in Haskell. > > > > Best regards, > > Compl > > > > > > On 2020-07-25, at 22:07, Ryan Yates wrote: > > > > Unfortunately my STM benchmarks are rather disorganized. The most > relevant paper using them is: > > > > Leveraging hardware TM in Haskell (PPoPP '19) > > https://dl.acm.org/doi/10.1145/3293883.3295711 > > > > > Or my thesis: > > > https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 > > > > > > The PPoPP benchmarks are on a branch (or the releases tab on github): > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src > > > > > > > > All that to say, without an implementation of mutable constructor fields > (which I'm working on getting into GHC) the scaling is limited. > > > > Ryan > > > > > > On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > > Dear Cafe, > > As Chris Allen has suggested, I learned that > https://hackage.haskell.org/package/stm-containers > > and https://hackage.haskell.org/package/ttrie > > can help a lot when used in place of traditional HashMap for stm tx > processing, under heavy concurrency, yet still with automatic parallelism > as GHC implemented them. Then I realized that in addition to hash map (used > to implement dicts and scopes), I also need to find a TreeMap replacement > data structure to implement the db index. I've been focusing on the > uniqueness constraint aspect, but it's still an index, needs to provide > range scan api for db clients, so hash map is not sufficient for the index. > > I see Ryan shared the code benchmarking RBTree with stm in mind: > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput > > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree > > > But can't find conclusion or interpretation of that benchmark suite. And > here's a followup question: > > > > Where are some STM contention optimized data structures, that having keys > ordered, with sub-range traversing api ? > > (of course production ready libraries most desirable) > > > > Thanks with regards, > > Compl > > > > On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: > > Shame on me for I have neither experienced with `perf`, I'd learn these > essential tools soon to put them into good use. > > It's great to learn about how `orElse` actually works, I did get confused > why there are so little retries captured, and now I know. So that little > trick should definitely be removed before going production, as it does no > much useful things at excessive cost. I put it there to help me understand > internal working of stm, now I get even better knowledge ;-) > > I think a debugger will trap every single abort, isn't it annoying when > many aborts would occur? If I'd like to count the number of aborts, ideally > accounted per service endpoints, time periods, source modules etc. there > some tricks for that? > > Thanks with best regards, > > Compl > > > > On 2020/7/25 上午2:02, Ryan Yates wrote: > > To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based > profiling can do a good job with concurrent and parallel programs where > other methods are problematic. For instance, > > changing the size of heap objects can drastically change cache > performance and completely different behavior can show up. > > > > [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) > > > > > The spinning in `readTVar` should always be very short and it typically > shows up as intensive CPU use, though it may not be high energy use with > `pause` in the loop on x86 (looks like we don't have it [^2], I thought we > did, but maybe that was only in some of my code... ) > > > > [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 > > > > > > All that to say, I doubt that you are spending much time spinning (but it > would certainly be interesting to know if you are! You would see `perf` > attribute a large amount of time to `read_current_value`). The amount of > code to execute for commit (the time when locks are held) is always much > shorter than it takes to execute the transaction body. As you add more > conflicting threads this gets worse of course as commits sequence. > > > > The code you have will count commits of executions of `retry`. Note that > `retry` is a user level idea, that is, you are counting user level > *explicit* retries. This is different from a transaction failing to commit > and starting again. These are invisible to the user. Also using your > trace will convert `retry` from the efficient wake on write implementation, > to an active retry that will always attempt again. We don't have cheap > logging of transaction aborts in GHC, but I have built such logging in my > work. You can observe these aborts with a debugger by looking for > execution of this line: > > > > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 > > > > > Ryan > > > > > > > > On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: > > I'm not familiar with profiling GHC yet, may need more time to get myself > proficient with it. > > And a bit more details of my test workload for diagnostic: the db clients > are Python processes from a cluster of worker nodes, consulting the db > server to register some path for data files, under a data dir within a > shared filesystem, then mmap those data files and fill in actual array > data. So the db server don't have much computation to perform, but puts the > data file path into a global index, which at the same validates its > uniqueness. As there are many client processes trying to insert one meta > data record concurrently, with my naive implementation, the global index's > TVar will almost always in locked state by one client after another, from a > queue never fall empty. > > So if `readTVar` should spinning waiting, I doubt the threads should > actually make high CPU utilization, because at any instant of time, all > threads except the committing one will be doing that one thing. > > And I have something in my code to track STM retry like this: > > ``` > > -- blocking wait not expected, track stm retries explicitly > > trackSTM :: Int -> IO (Either () a) > > trackSTM !rtc = do > > when -- todo increase the threshold of reporting? > > (rtc > 0) $ do > > -- trace out the retries so the end users can be aware of them > > tid <- myThreadId > > trace > > ( "🔙\n" > > <> show callCtx > > <> "🌀 " > > <> show tid > > <> " stm retry #" > > <> show rtc > > ) > > $ return () > > atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case > > Nothing -> -- stm failed, do a tracked retry > > trackSTM (rtc + 1) > > Just ... -> ... > > ``` > > No such trace msg fires during my test, neither in single thread run, nor > in runs with pressure. I'm sure this tracing mechanism works, as I can see > such traces fire, in case e.g. posting a TMVar to a TQueue for some other > thread to fill it, then read the result out, if these 2 ops are composed > into a single tx, then of course it's infinite retry loop, and a sequence > of such msgs are logged with ever increasing rtc #. > > So I believe no retry has ever been triggered. > > What can going on there? > > > > On 2020/7/24 下午11:46, Ryan Yates wrote: > > > Then to explain the low CPU utilization (~10%), am I right to understand > it as that upon reading a TVar locked by another committing tx, a > lightweight thread will put itself into `waiting STM` and descheduled > state, so the CPUs can only stay idle as not so many threads are willing to > proceed? > > > > Since the commit happens in finite steps, the expectation is that the lock > will be released very soon. Given this when the body of a transaction > executes `readTVar` it spins (active CPU!) until the `TVar` is observed > unlocked. If a lock is observed while commiting, it immediately starts the > transaction again from the beginning. To get the behavior of suspending a > transaction you have to successfully commit a transaction that executed > `retry`. Then the transaction is put on the wakeup lists of its read set > and subsequent commits will wake it up if its write set overlaps. > > > > I don't think any of these things would explain low CPU utilization. You > could try running with `perf` and see if lots of time is spent in some > recognizable part of the RTS. > > > > Ryan > > > > > > On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: > > Thanks very much for the insightful information Ryan! I'm glad my suspect > was wrong about the Haskell scheduler: > > > The Haskell capability that is committing a transaction will not yield > to another Haskell thread while it is doing the commit. The OS thread may > be preempted, but once commit starts the haskell scheduler is not invoked > until after locks are released. > > So best effort had already been made in GHC and I just need to cooperate > better with its design. Then to explain the low CPU utilization (~10%), am > I right to understand it as that upon reading a TVar locked by another > committing tx, a lightweight thread will put itself into `waiting STM` and > descheduled state, so the CPUs can only stay idle as not so many threads > are willing to proceed? > > > > Anyway, I see light with better data structures to improve my situation, > let me try them and report back. Actually I later changed `TVar (HaskMap k > v)` to be `TVar (HashMap k Int)` where the `Int` being array index into > `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation > semantic of dict entries (like that in Python 3.7+), then it's very hopeful > to incorporate stm-containers' Map or ttrie to approach free of contention. > > Thanks with regards, > > Compl > > > > On 2020/7/24 下午10:03, Ryan Yates wrote: > > Hi Compl, > > > > Having a pool of transaction processing threads can be helpful in a > certain way. If the body of the transaction takes more time to execute > then the Haskell thread is allowed and it yields, the suspended thread > won't get in the way of other thread, but when it is rescheduled, will have > a low probability of success. Even worse, it will probably not discover > that it is doomed to failure until commit time. If transactions are more > likely to reach commit without yielding, they are more likely to succeed. > If the transactions are not conflicting, it doesn't make much difference > other than cache churn. > > > > The Haskell capability that is committing a transaction will not yield to > another Haskell thread while it is doing the commit. The OS thread may be > preempted, but once commit starts the haskell scheduler is not invoked > until after locks are released. > > > > To get good performance from STM you must pay attention to what TVars are > involved in a commit. All STM systems are working under the assumption of > low contention, so you want to minimize "false" conflicts (conflicts that > are not essential to the computation). Something like `TVar (HashMap k > v)` will work pretty well for a low thread count, but every transaction > that writes to that structure will be in conflict with every other > transaction that accesses it. Pushing the `TVar` into the nodes of the > structure reduces the possibilities for conflict, while increasing the > amount of bookkeeping STM has to do. I would like to reduce the cost of > that bookkeeping using better structures, but we need to do so without > harming performance in the low TVar count case. Right now it is optimized > for good cache performance with a handful of TVars. > > > > There is another way to play with performance by moving work into and out > of the transaction body. A transaction body that executes quickly will > reach commit faster. But it may be delaying work that moves into another > transaction. Forcing values at the right time can make a big difference. > > > > Ryan > > > > On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > > Thanks Chris, I confess I didn't pay enough attention to STM specialized > container libraries by far, I skimmed through the description of > stm-containers and ttrie, and feel they would definitely improve my code's > performance in case I limit the server's parallelism within hardware > capabilities. That may because I'm still prototyping the api and > infrastructure for correctness, so even `TVar (HashMap k v)` performs okay > for me at the moment, only if at low contention (surely there're plenty of > CPU cycles to be optimized out in next steps). I model my data after graph > model, so most data, even most indices are localized to nodes and edges, > those can be manipulated without conflict, that's why I assumed I have a > low contention use case since the very beginning - until I found there are > still (though minor) needs for global indices to guarantee global > uniqueness, I feel faithful with stm-containers/ttrie to implement a more > scalable global index data structure, thanks for hinting me. > > So an evident solution comes into my mind now, is to run the server with a > pool of tx processing threads, matching number of CPU cores, client RPC > requests then get queued to be executed in some thread from the pool. But > I'm really fond of the mechanism of M:N scheduler which solves > massive/dynamic concurrency so elegantly. I had some good result with Go in > this regard, and see GHC at par in doing this, I don't want to give up this > enjoyable machinery. > > But looked at the stm implementation in GHC, it seems written TVars are > exclusively locked during commit of a tx, I suspect this is the culprit > when there're large M lightweight threads scheduled upon a small N hardware > capabilities, that is when a lightweight thread yield control during an stm > transaction commit, the TVars it locked will stay so until it's scheduled > again (and again) till it can finish the commit. This way, descheduled > threads could hold live threads from progressing. I haven't gone into more > details there, but wonder if there can be some improvement for GHC RTS to > keep an stm committing thread from descheduled, but seemingly that may > impose more starvation potential; or stm can be improved to have its TVar > locks preemptable when the owner trec/thread is in descheduled state? > Neither should be easy but I'd really love massive lightweight threads > doing STM practically well. > > Best regards, > > Compl > > > > On 2020/7/24 上午12:57, Christopher Allen wrote: > > It seems like you know how to run practical tests for tuning thread count > and contention for throughput. Part of the reason you haven't gotten a > super clear answer is "it depends." You give up fairness when you use STM > instead of MVars or equivalent structures. That means a long running > transaction might get stampeded by many small ones invalidating it over and > over. The long-running transaction might never clear if the small > transactions keep moving the cheese. I mention this because transaction > runtime and size and count all affect throughput and latency. What might be > ideal for one pattern of work might not be ideal for another. Optimizing > for overall throughput might make the contention and fairness problems > worse too. I've done practical tests to optimize this in the past, both for > STM in Haskell and for RDBMS workloads. > > > > The next step is sometimes figuring out whether you really need a data > structure within a single STM container or if perhaps you can break up your > STM container boundaries into zones or regions that roughly map onto update > boundaries. That should make the transactions churn less. On the outside > chance you do need to touch more than one container in a transaction, well, > they compose. > > > > e.g. https://hackage.haskell.org/package/stm-containers > > > https://hackage.haskell.org/package/ttrie > > > > > It also sounds a bit like your question bumps into Amdahl's Law a bit. > > > > All else fails, stop using STM and find something more tuned to your > problem space. > > > > Hope this helps, > > Chris Allen > > > > > > On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe < > haskell-cafe at haskell.org> wrote: > > Hello Cafe, > > > > I'm working on an in-memory database, in Client/Server mode I just let > each connected client submit remote procedure call running in its dedicated > lightweight thread, modifying TVars in RAM per its business needs, then in > case many clients connected concurrently and trying to insert new data, if > they are triggering global index (some TVar) update, the throughput would > drop drastically. I reduced the shared state to a simple int counter by > TVar, got same symptom. While the parallelism feels okay when there's no > hot TVar conflicting, or M is not much greater than N. > > > > As an empirical test workload, I have a `+RTS -N10` server process, it > handles 10 concurrent clients okay, got ~5x of single thread throughput; > but in handling 20 concurrent clients, each of the 10 CPUs can only be > driven to ~10% utilization, the throughput seems even worse than single > thread. More clients can even drive it thrashing without much progressing. > > > > I can understand that pure STM doesn't scale well after reading [1], and > I see it suggested [7] attractive and planned future work toward that > direction. > > > > But I can't find certain libraries or frameworks addressing large M over > small N scenarios, [1] experimented with designated N parallelism, and [7] > is rather theoretical to my empirical needs. > > > > Can you direct me to some available library implementing the methodology > proposed in [7] or other ways tackling this problem? > > > > I think the most difficult one is that a transaction should commit with > global indices (with possibly unique constraints) atomically updated, and > rollback with any violation of constraints, i.e. transactions have to cover > global states like indices. Other problems seem more trivial than this. > > > > Specifically, [7] states: > > > > > It must be emphasized that all of the mechanisms we deploy originate, in > one form or another, in the database literature from the 70s and 80s. Our > contribution is to adapt these techniques to software transactional memory, > providing more effective solutions to important STM problems than prior > proposals. > > > > I wonder any STM based library has simplified those techniques to be > composed right away? I don't really want to implement those mechanisms by > myself, rebuilding many wheels from scratch. > > > > Best regards, > > Compl > > > > > > [1] Comparing the performance of concurrent linked-list implementations in > Haskell > > https://simonmar.github.io/bib/papers/concurrent-data.pdf > > > > > [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for > highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages > 207–216. ACM Press, 2008. > > https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf > > > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > > > > -- > > Chris Allen > > Currently working on http://haskellbook.com > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > > > _______________________________________________ > > Haskell-Cafe mailing list > > To (un)subscribe, modify options or view archives go to: > > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Thu Jul 30 04:30:14 2020 From: compl.yue at icloud.com (Compl Yue) Date: Thu, 30 Jul 2020 12:30:14 +0800 Subject: [Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <31d90d33-2a21-18ee-a358-48c330efd184@durchholz.org> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <5268f36c-a71b-7ed7-fcb2-c2b4d146ec77@icloud.com> <31d90d33-2a21-18ee-a358-48c330efd184@durchholz.org> Message-ID: <96c7e15b-2e2f-5322-e738-502410c5e530@icloud.com> Hi Jo, Thanks anyway and FYI the global counter originally served as a source for unique entity id, then later I have replaced it with UUID from uuid package, seems not a problem since then. Regards, Compl On 2020/7/30 上午1:37, Joachim Durchholz wrote: > Am 24.07.20 um 17:48 schrieb Compl Yue via Haskell-Cafe: >> The global counter is only used to reveal the technical traits of my >> situation, it's of course not a requirement of my business needs. > > Given the other discussion here, I'm not sure if it's really relevant > to your situation, but that stats counter could indeed be causing lock > contention. Which means your numbers may be skewed, and you may be > drawing wrong conclusions - which is actually commonplace in > benchmarking. > > Two things you could do: > 1) Leave the global counter out and see whether the running times > vary. There's still a chance that while the overall running time is > the same, the code might now be hitting a different bottleneck. Or > maybe the counter isn't the bottleneck but it would become one once > you have done the other optimizations. So that experiment is cheap but > gives you no more than a preliminary result. > 2) Let each thread collect its own statistics, and coalesce into the > global counter only once in a while. (Vary the "once in a while" > determination and see whether it changes anything.) > > Just my 2c from the sideline. > > Regards, > Jo > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From compl.yue at icloud.com Thu Jul 30 05:31:38 2020 From: compl.yue at icloud.com (Compl Yue) Date: Thu, 30 Jul 2020 13:31:38 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Thanks Ryan, and I'm honored to get Simon's attention. I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as: > - This package provides an implementation of a skip list in STM. >+ This package provides a proof-of-concept implementation of a skip list in STM This has to mean something but I can't figure out yet. Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set. I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. Best regards, Compl On 2020/7/30 上午10:05, Ryan Yates wrote: > Simon, I certainly want to help get to the bottom of the performance > issue at hand :D.  Sorry if my reply was misleading.  The constant > factor overhead of pushing `TVar`s into the internal structure may be > pressuring unacceptable GC behavior to happen sooner.  My impression > was that given the same size problem performance loss shifted from > synchronization to GC. > > Compl, I'm not aware of mutable heap objects being problematic in > particular for GHC's GC.  There are lots of special cases to handle > them of course.  I have successfully written Haskell programs that get > good performance from the GC with the dominant fraction of heap > objects being mutable.  I looked a little more at `TSkipList` and one > tricky aspect of an STM based skip list is how to manage randomness.  > In `TSkipList`'s code there is the following comment: > > -- | Returns a randomly chosen level. Used for inserting new elements. > /O(1)./ > -- For performance reasons, this function uses 'unsafePerformIO' to > access the > -- random number generator. (It would be possible to store the random > number > -- generator in a 'TVar' and thus be able to access it safely from > within the > -- STM monad. This, however, might cause high contention among threads.) > chooseLevel :: TSkipList k a -> Int > > This level is chosen on insertion to determine the height of the > node.  When writing my own STM skiplist I found that the details in > unsafely accessing randomness had a significant impact on > performance.  We went with an unboxed array of PCG states that had an > entry for each capability giving constant memory overhead in the > number of capabilities.  `TSkipList` uses `newStdGen` which involves > allocation and synchronization. > > Again, I'm not pointing this out to say that this is the entirety of > the issue you are encountering, rather, I do think the `TSkipList` > library could be improved to allocate much less.  Others can speak to > how to tell where the time is going in GC (my knowledge of this is > likely out of date). > > Ryan > > > On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones > > wrote: > > Compl’s problem is (apparently) that execution becomes dominated > by GC.  That doesn’t sound like a constant-factor overhead from > TVars, no matter how efficient (or otherwise) they are.  It sounds > more like a space leak to me; perhaps you need some strict > evaluation or something. > > My point is only: before re-engineering STM it would make sense to > get a much more detailed insight into what is actually happening, > and where the space and time is going.  We have tools to do this > (heap profiling, Threadscope, …) but I know they need some skill > and insight to use well.  But we don’t have nearly enough insight > to draw meaningful conclusions yet. > > Maybe someone with experience of performance debugging might feel > able to help Compl? > > Simon > > *From:*Haskell-Cafe > *On Behalf Of *Ryan Yates > *Sent:* 29 July 2020 20:41 > *To:* YueCompl > > *Cc:* Haskell Cafe > > *Subject:* Re: [Haskell-cafe] STM friendly TreeMap (or similar > with range scan api) ? WAS: Best ways to achieve throughput, for > large M:N ratio of STM threads, with hot TVar updates? > > Hi Compl, > > There is a lot of overhead with TVars.  My thesis work addresses > this by incorporating mutable constructor fields with STM.  I > would like to get all that into GHC as soon as I can :D.  I > haven't looked closely at the `tskiplist` package, I'll take a > look and see if I see any potential issues.  There was some recent > work on concurrent B-tree that may be interesting to try. > > Ryan > > On Wed, Jul 29, 2020 at 10:24 AM YueCompl > wrote: > > Hi Cafe and Ryan, > > I tried Map/Set from stm-containers and TSkipList (added range > scan api against its internal data structure) from > http://hackage.haskell.org/package/tskiplist >  , > with them I've got quite improved at scalability on concurrency. > > But unfortunately then I hit another wall at single thread > scalability over working memory size, I suspect it's because > massively more TVars (those being pointers per se) are > introduced by those "contention-free" data structures, they > need to mutate separate pointers concurrently in avoiding > contentions anyway, but such pointer-intensive heap seems > imposing extraordinary pressure to GHC's garbage collector, > that GC will dominate CPU utilization with poor business > progress. > > For example in my test, I use `+RTS -H2g` for the Haskell > server process, so GC is not triggered until after a while, > then spin off 3 Python client to insert new records > concurrently, in the first stage each Python process happily > taking ~90% CPU filling (through local mmap) the arrays > allocated from the server and logs of success scroll quickly, > while the server process utilizes only 30~40% CPU to serve > those 3 clients (insert meta data records into unique indices > merely); then the client processes' CPU utilization drop > drastically once Haskell server process' private memory > reached around 2gb, i.e. GC started engaging, the server > process's CPU utilization quickly approaches ~300%, while all > client processes' drop to 0% for most of the time, and > occasionally burst a tiny while with some log output showing > progress. And I disable parallel GC lately, enabling parallel > GC only makes it worse. > > If I comment out the code updating the indices (those creating > many TVars), the overall throughput only drop slowly as more > data are inserted, the parallelism feels steady even after the > server process' private memory takes several GBs. > > I didn't expect this, but appears to me that GC of GHC is > really not good at handling massive number of pointers in the > heap, while those pointers are essential to reduce contention > (and maybe expensive data copying too) at heavy > parallelism/concurrency. > > Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious > different behavior compared to 8.8.3; and also tried tweaking > GC related RTS options a bit, including increasing -G up to > 10, no much difference too. > > I feel hopeless at the moment, wondering if I'll have to > rewrite this in-memory db in Go/Rust or some other runtime ... > > Btw I read > https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html >  in > searching about the symptoms, and don't feel likely to convert > my DB managed data into immutable types thus to fit into > Compact Regions, not quite likely a live in-mem database > instance can do. > > So seems there are good reasons no successful DBMS, at least > in-memory ones have been written in Haskell. > > Best regards, > > Compl > > > > On 2020-07-25, at 22:07, Ryan Yates > wrote: > > Unfortunately my STM benchmarks are rather disorganized.  > The most relevant paper using them is: > > Leveraging hardware TM in Haskell (PPoPP '19) > > https://dl.acm.org/doi/10.1145/3293883.3295711 > > > Or my thesis: > > https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 > > > >  The PPoPP benchmarks are on a branch (or the releases tab > on github): > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src > > > >  All that to say, without an implementation of mutable > constructor fields (which I'm working on getting into GHC) > the scaling is limited. > > Ryan > > On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe > > wrote: > > Dear Cafe, > > As Chris Allen has suggested, I learned that > https://hackage.haskell.org/package/stm-containers > > and https://hackage.haskell.org/package/ttrie > > can help a lot when used in place of traditional > HashMap for stm tx processing, under heavy > concurrency, yet still with automatic parallelism as > GHC implemented them. Then I realized that in addition > to hash map (used to implement dicts and scopes), I > also need to find a TreeMap replacement data structure > to implement the db index. I've been focusing on the > uniqueness constraint aspect, but it's still an index, > needs to provide range scan api for db clients, so > hash map is not sufficient for the index. > > I see Ryan shared the code benchmarking RBTree with > stm in mind: > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput > > > > https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree > > > But can't find conclusion or interpretation of that > benchmark suite. And here's a followup question: > > Where are some STM contention optimized data > structures, that having keys ordered, with sub-range > traversing api ? > > (of course production ready libraries most desirable) > > Thanks with regards, > > Compl > > On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: > > Shame on me for I have neither experienced with > `perf`, I'd learn these essential tools soon to > put them into good use. > > It's great to learn about how `orElse` actually > works, I did get confused why there are so little > retries captured, and now I know. So that little > trick should definitely be removed before going > production, as it does no much useful things at > excessive cost. I put it there to help me > understand internal working of stm, now I get even > better knowledge ;-) > > I think a debugger will trap every single abort, > isn't it annoying when many aborts would occur? If > I'd like to count the number of aborts, ideally > accounted per service endpoints, time periods, > source modules etc. there some tricks for that? > > Thanks with best regards, > > Compl > > On 2020/7/25 上午2:02, Ryan Yates wrote: > > To be clear, I was trying to refer to Linux > `perf` [^1]. Sampling based profiling can do a > good job with concurrent and parallel programs > where other methods are problematic.  For > instance, > >  changing the size of heap objects can > drastically change cache performance and > completely different behavior can show up. > > [^1]: > https://en.wikipedia.org/wiki/Perf_(Linux) > > > The spinning in `readTVar` should always be > very short and it typically shows up as > intensive CPU use, though it may not be high > energy use with `pause` in the loop on x86 > (looks like we don't have it [^2], I thought > we did, but maybe that was only in some of my > code... ) > > [^2]: > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 > > > > All that to say, I doubt that you are spending > much time spinning (but it would certainly be > interesting to know if you are!  You would see > `perf` attribute a large amount of time to > `read_current_value`). The amount of code to > execute for commit (the time when locks are > held) is always much shorter than it takes to > execute the transaction body.  As you add more > conflicting threads this gets worse of course > as commits sequence. > > The code you have will count commits of > executions of `retry`.  Note that `retry` is a > user level idea, that is, you are counting > user level *explicit* retries.  This is > different from a transaction failing to commit > and starting again.  These are invisible to > the user. Also using your trace will convert > `retry` from the efficient wake on write > implementation, to an active retry that will > always attempt again.  We don't have cheap > logging of transaction aborts in GHC, but I > have built such logging in my work.  You can > observe these aborts with a debugger by > looking for execution of this line: > > https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 > > > Ryan > > On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > > wrote: > > I'm not familiar with profiling GHC yet, > may need more time to get myself > proficient with it. > > And a bit more details of my test workload > for diagnostic: the db clients are Python > processes from a cluster of worker nodes, > consulting the db server to register some > path for data files, under a data dir > within a shared filesystem, then mmap > those data files and fill in actual array > data. So the db server don't have much > computation to perform, but puts the data > file path into a global index, which at > the same validates its uniqueness. As > there are many client processes trying to > insert one meta data record concurrently, > with my naive implementation, the global > index's TVar will almost always in locked > state by one client after another, from a > queue never fall empty. > > So if `readTVar` should spinning waiting, > I doubt the threads should actually make > high CPU utilization, because at any > instant of time, all threads except the > committing one will be doing that one thing. > > And I have something in my code to track > STM retry like this: > > ``` > > -- blocking wait not expected, track stm > retries explicitly > > trackSTM:: Int-> IO(Either() a) > > trackSTM !rtc = do > > when -- todo increase the threshold of > reporting? > > (rtc > 0) $ do > > -- trace out the retries so the end users > can be aware of them > > tid <- myThreadId > > trace > > ( "🔙\n" > > <> show callCtx > > <> "🌀" > > <> show tid > > <> " stm retry #" > > <> show rtc > > ) > > $ return () > > atomically ((Just <$> stmJob) `orElse` > return Nothing) >>= \case > > Nothing -> -- stm failed, do a tracked retry > > trackSTM (rtc + 1) > > Just ... -> ... > > ``` > > No such trace msg fires during my test, > neither in single thread run, nor in runs > with pressure. I'm sure this tracing > mechanism works, as I can see such traces > fire, in case e.g. posting a TMVar to a > TQueue for some other thread to fill it, > then read the result out, if these 2 ops > are composed into a single tx, then of > course it's infinite retry loop, and a > sequence of such msgs are logged with ever > increasing rtc #. > > So I believe no retry has ever been triggered. > > What can going on there? > > On 2020/7/24 下午11:46, Ryan Yates wrote: > > > Then to explain the low CPU > utilization (~10%), am I right to > understand it as that upon reading a > TVar locked by another committing tx, > a lightweight thread will put itself > into `waiting STM` and descheduled > state, so the CPUs can only stay idle > as not so many threads are willing to > proceed? > > Since the commit happens in finite > steps, the expectation is that the > lock will be released very soon. Given > this when the body of a transaction > executes `readTVar` it spins (active > CPU!) until the `TVar` is observed > unlocked.  If a lock is observed while > commiting, it immediately starts the > transaction again from the beginning.  > To get the behavior of suspending a > transaction you have to successfully > commit a transaction that executed > `retry`.  Then the transaction is put > on the wakeup lists of its read set > and subsequent commits will wake it up > if its write set overlaps. > > I don't think any of these things > would explain low CPU utilization. You > could try running with `perf` and see > if lots of time is spent in some > recognizable part of the RTS. > > Ryan > > On Fri, Jul 24, 2020 at 11:22 AM Compl > Yue > wrote: > > Thanks very much for the > insightful information Ryan! I'm > glad my suspect was wrong about > the Haskell scheduler: > > > The Haskell capability that is > committing a transaction will not > yield to another Haskell thread > while it is doing the commit.  The > OS thread may be preempted, but > once commit starts the haskell > scheduler is not invoked until > after locks are released. > > So best effort had already been > made in GHC and I just need to > cooperate better with its design. > Then to explain the low CPU > utilization (~10%), am I right to > understand it as that upon reading > a TVar locked by another > committing tx, a lightweight > thread will put itself into > `waiting STM` and descheduled > state, so the CPUs can only stay > idle as not so many threads are > willing to proceed? > > Anyway, I see light with better > data structures to improve my > situation, let me try them and > report back. Actually I later > changed `TVar (HaskMap k v)` to be > `TVar (HashMap k Int)` where the > `Int` being array index into `TVar > (Vector (TVar (Maybe v)))`, in > pursuing insertion order > preservation semantic of dict > entries (like that in Python > 3.7+), then it's very hopeful to > incorporate stm-containers' Map or > ttrie to approach free of contention. > > Thanks with regards, > > Compl > > On 2020/7/24 下午10:03, Ryan Yates > wrote: > > Hi Compl, > > Having a pool of > transaction processing threads > can be helpful in a certain > way. If the body of the > transaction takes more time to > execute then the Haskell > thread is allowed and it > yields, the suspended thread > won't get in the way of other > thread, but when it is > rescheduled, will have a low > probability of success.  Even > worse, it will probably not > discover that it is doomed to > failure until commit time.  If > transactions are more likely > to reach commit without > yielding, they are more likely > to succeed.  If the > transactions are not > conflicting, it doesn't make > much difference other than > cache churn. > > The Haskell capability that is > committing a transaction will > not yield to another Haskell > thread while it is doing the > commit.  The OS thread may be > preempted, but once commit > starts the haskell scheduler > is not invoked until after > locks are released. > > To get good performance from > STM you must pay attention to > what TVars are involved in a > commit.  All STM systems are > working under the assumption > of low contention, so you want > to minimize "false" conflicts > (conflicts that are not > essential to the computation). >   Something like `TVar > (HashMap k v)` will work > pretty well for a low thread > count, but every transaction > that writes to that structure > will be in conflict with every > other transaction that > accesses it.  Pushing the > `TVar` into the nodes of the > structure reduces the > possibilities for conflict, > while increasing the amount of > bookkeeping STM has to do.  I > would like to reduce the cost > of that bookkeeping using > better structures, but we need > to do so without harming > performance in the low TVar > count case. Right now it is > optimized for good cache > performance with a handful of > TVars. > > There is another way to play > with performance by moving > work into and out of the > transaction body.  A > transaction body that executes > quickly will reach commit > faster.  But it may be > delaying work that moves into > another transaction. Forcing > values at the right time can > make a big difference. > > Ryan > > On Fri, Jul 24, 2020 at 2:14 > AM Compl Yue via Haskell-Cafe > > > wrote: > > Thanks Chris, I confess I > didn't pay enough > attention to STM > specialized container > libraries by far, I > skimmed through the > description of > stm-containers and ttrie, > and feel they would > definitely improve my > code's performance in case > I limit the server's > parallelism within > hardware capabilities. > That may because I'm still > prototyping the api and > infrastructure for > correctness, so even `TVar > (HashMap k v)` performs > okay for me at the moment, > only if at low contention > (surely there're plenty of > CPU cycles to be optimized > out in next steps). I > model my data after graph > model, so most data, even > most indices are localized > to nodes and edges, those > can be manipulated without > conflict, that's why I > assumed I have a low > contention use case since > the very beginning - until > I found there are still > (though minor) needs for > global indices to > guarantee global > uniqueness, I feel > faithful with > stm-containers/ttrie to > implement a more scalable > global index data > structure, thanks for > hinting me. > > So an evident solution > comes into my mind now, is > to run the server with a > pool of tx processing > threads, matching number > of CPU cores, client RPC > requests then get queued > to be executed in some > thread from the pool. But > I'm really fond of the > mechanism of M:N scheduler > which solves > massive/dynamic > concurrency so elegantly. > I had some good result > with Go in this regard, > and see GHC at par in > doing this, I don't want > to give up this enjoyable > machinery. > > But looked at the stm > implementation in GHC, it > seems written TVars are > exclusively locked during > commit of a tx, I suspect > this is the culprit when > there're large M > lightweight threads > scheduled upon a small N > hardware capabilities, > that is when a lightweight > thread yield control > during an stm transaction > commit, the TVars it > locked will stay so until > it's scheduled again (and > again) till it can finish > the commit. This way, > descheduled threads could > hold live threads from > progressing. I haven't > gone into more details > there, but wonder if there > can be some improvement > for GHC RTS to keep an stm > committing thread from > descheduled, but seemingly > that may impose more > starvation potential; or > stm can be improved to > have its TVar locks > preemptable when the owner > trec/thread is in > descheduled state? Neither > should be easy but I'd > really love massive > lightweight threads doing > STM practically well. > > Best regards, > > Compl > > On 2020/7/24 上午12:57, > Christopher Allen wrote: > > It seems like you know > how to run practical > tests for tuning > thread count and > contention for > throughput. Part of > the reason you haven't > gotten a super clear > answer is "it > depends." You give up > fairness when you use > STM instead of MVars > or equivalent > structures. That means > a long running > transaction might get > stampeded by many > small ones > invalidating it over > and over. The > long-running > transaction might > never clear if > the small transactions > keep moving the > cheese. I mention this > because transaction > runtime and size and > count all affect > throughput and > latency. What might be > ideal for one pattern > of work might not be > ideal for another. > Optimizing for overall > throughput might make > the contention and > fairness problems > worse too. I've done > practical tests to > optimize this in the > past, both for STM in > Haskell and for RDBMS > workloads. > > The next step is > sometimes figuring out > whether you really > need a data structure > within a single STM > container or if > perhaps you can break > up your STM container > boundaries into zones > or regions that > roughly map onto > update boundaries. > That should make the > transactions churn > less. On the outside > chance you do need to > touch more than one > container in a > transaction, well, > they compose. > > e.g. > https://hackage.haskell.org/package/stm-containers > > > https://hackage.haskell.org/package/ttrie > > > It also sounds a bit > like your question > bumps into Amdahl's > Law a bit. > > All else fails, stop > using STM and find > something more tuned > to your problem space. > > Hope this helps, > > Chris Allen > > On Thu, Jul 23, 2020 > at 9:53 AM YueCompl > via Haskell-Cafe > > > wrote: > > Hello Cafe, > > I'm working on an > in-memory > database, in > Client/Server mode > I just let each > connected client > submit remote > procedure call > running in its > dedicated > lightweight > thread, modifying > TVars in RAM per > its business > needs, then in > case many clients > connected > concurrently and > trying to insert > new data, if they > are triggering > global index (some > TVar) update, the > throughput would > drop drastically. > I reduced the > shared state to a > simple int counter > by TVar, got same > symptom. While the > parallelism feels > okay when there's > no hot TVar > conflicting, or M > is not much > greater than N. > > As an empirical > test workload, I > have a `+RTS -N10` > server process, it > handles 10 > concurrent clients > okay, got ~5x of > single thread > throughput; but in > handling 20 > concurrent > clients, each of > the 10 CPUs can > only be driven to > ~10% utilization, > the throughput > seems even worse > than single > thread. More > clients can even > drive it thrashing > without much >  progressing. > >  I can understand > that pure STM > doesn't scale well > after reading [1], > and I see it > suggested [7] > attractive and > planned future > work toward that > direction. > > But I can't find > certain libraries > or frameworks > addressing large M > over small N > scenarios, [1] > experimented with > designated N > parallelism, and > [7] is rather > theoretical to my > empirical needs. > > Can you direct me > to some available > library > implementing the > methodology > proposed in [7] or > other ways > tackling this problem? > > I think the most > difficult one is > that a transaction > should commit with > global indices > (with possibly > unique > constraints) > atomically > updated, and > rollback with any > violation of > constraints, i.e. > transactions have > to cover global > states like > indices. Other > problems seem more > trivial than this. > > Specifically, [7] > states: > > > It must be > emphasized that > all of the > mechanisms we > deploy originate, > in one form or > another, in the > database > literature from > the 70s and 80s. > Our contribution > is to adapt these > techniques to > software > transactional > memory, providing > more effective > solutions to > important STM > problems than > prior proposals. > > I wonder any STM > based library has > simplified those > techniques to be > composed right > away? I don't > really want to > implement those > mechanisms by > myself, rebuilding > many wheels from > scratch. > > Best regards, > > Compl > > [1] Comparing the > performance of > concurrent > linked-list > implementations in > Haskell > > https://simonmar.github.io/bib/papers/concurrent-data.pdf > > > [7] M. Herlihy and > E. Koskinen. > Transactional > boosting: a > methodology for > highly-concurrent > transactional > objects. In Proc. > of PPoPP ’08, > pages 207–216. ACM > Press, 2008. > > https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf > > > _______________________________________________ > Haskell-Cafe > mailing list > To (un)subscribe, > modify options or > view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members > subscribed via the > mailman list are > allowed to post. > > > -- > > Chris Allen > > Currently working on > http://haskellbook.com > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify > options or view archives > go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed > via the mailman list are > allowed to post. > > _______________________________________________ > > Haskell-Cafe mailing list > > To (un)subscribe, modify options or view archives go to: > > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are > allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > > Only members subscribed via the mailman list are allowed > to post. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jo at durchholz.org Thu Jul 30 07:24:51 2020 From: jo at durchholz.org (Joachim Durchholz) Date: Thu, 30 Jul 2020 09:24:51 +0200 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: Am 30.07.20 um 07:31 schrieb Compl Yue via Haskell-Cafe: > And > now I realize what presuring GC in my situation is not only the large > number of pointers (TVars), and at the same time, they form many > circular structures, that might be nightmare for a GC. Cycles are relevant only for reference-counting collectors. As far as I understand http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_control.html, GHC offers only tracing collectors, and cycles are irrelevant there. > I'm still curious why the new non-moving GC in 8.10.1 still don't get > obvious business progressing in my situation. I tested it on my Mac > yesterday and there I don't know how to see how CPU time is distributed > over threads within a process, I'll further test it with some Linux > boxes to try understand it better. Hmm... can GHC's memory management fragment? If that's the case, you may be seeing GC trying to find free blocks in fragmented memory, and having to re-run the GC cycle to free a block so there's enough contiguous memory. It's a bit of a stretch, but it can happen, and testing that hypothesis would be relatively quick: Run the program with moving GC, observe running time and if it's still slow, check if the GC is actually eating CPU, or if it's merely waiting for other threads to respond to the stop-the-world signal. Regards, Jo From compl.yue at icloud.com Thu Jul 30 08:00:04 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 30 Jul 2020 16:00:04 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: Update: nonmoving GC does make differences I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then: Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck. Regards, Compl > On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote: > > Thanks Ryan, and I'm honored to get Simon's attention. > > I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as: > > > - This package provides an implementation of a skip list in STM. > > >+ This package provides a proof-of-concept implementation of a skip list in STM > > This has to mean something but I can't figure out yet. > > Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. > > Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set. > > I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. > > Best regards, > > Compl > > > > On 2020/7/30 上午10:05, Ryan Yates wrote: >> Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC. >> >> Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment: >> >> -- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ >> <>-- For performance reasons, this function uses 'unsafePerformIO' to access the >> <>-- random number generator. (It would be possible to store the random number >> <>-- generator in a 'TVar' and thus be able to access it safely from within the >> <>-- STM monad. This, however, might cause high contention among threads.) >> chooseLevel :: TSkipList k a -> Int >> >> This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization. >> >> Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date). >> >> Ryan >> >> >> On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones > wrote: >> Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something. >> >> >> My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet. >> >> >> Maybe someone with experience of performance debugging might feel able to help Compl? >> >> >> Simon >> >> >> From: Haskell-Cafe > On Behalf Of Ryan Yates >> Sent: 29 July 2020 20:41 >> To: YueCompl > >> Cc: Haskell Cafe > >> Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? >> >> >> Hi Compl, >> >> >> There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. >> >> >> Ryan >> >> >> On Wed, Jul 29, 2020 at 10:24 AM YueCompl > wrote: >> >> Hi Cafe and Ryan, >> >> >> I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. >> >> >> But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. >> >> >> For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. >> >> >> If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. >> >> >> I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. >> >> >> Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. >> >> >> I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... >> >> >> Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. >> >> >> So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. >> >> >> Best regards, >> >> Compl >> >> >> >> >> >> On 2020-07-25, at 22:07, Ryan Yates > wrote: >> >> >> Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: >> >> >> Leveraging hardware TM in Haskell (PPoPP '19) >> >> https://dl.acm.org/doi/10.1145/3293883.3295711 >> >> Or my thesis: >> >> https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 >> >> >> The PPoPP benchmarks are on a branch (or the releases tab on github): >> >> https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src >> >> >> >> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. >> >> >> Ryan >> >> >> >> On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe > wrote: >> >> Dear Cafe, >> >> As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. >> >> I see Ryan shared the code benchmarking RBTree with stm in mind: >> >> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput >> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree >> But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: >> >> >> Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? >> >> (of course production ready libraries most desirable) >> >> >> Thanks with regards, >> >> Compl >> >> >> On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: >> >> Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. >> >> It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) >> >> I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? >> >> Thanks with best regards, >> >> Compl >> >> >> On 2020/7/25 上午2:02, Ryan Yates wrote: >> >> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, >> >> changing the size of heap objects can drastically change cache performance and completely different behavior can show up. >> >> >> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >> >> The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) >> >> >> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >> >> >> All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. >> >> >> The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: >> >> >> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >> >> Ryan >> >> >> >> >> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue > wrote: >> >> I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. >> >> And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. >> >> So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. >> >> And I have something in my code to track STM retry like this: >> >> ``` >> >> -- blocking wait not expected, track stm retries explicitly >> >> trackSTM :: Int -> IO (Either () a) >> >> trackSTM !rtc = do >> >> when -- todo increase the threshold of reporting? >> >> (rtc > 0) $ do >> >> -- trace out the retries so the end users can be aware of them >> >> tid <- myThreadId >> >> trace >> >> ( "🔙\n" >> >> <> show callCtx >> >> <> "🌀 " >> >> <> show tid >> >> <> " stm retry #" >> >> <> show rtc >> >> ) >> >> $ return () >> >> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >> >> Nothing -> -- stm failed, do a tracked retry >> >> trackSTM (rtc + 1) >> >> Just ... -> ... >> >> ``` >> >> No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. >> >> So I believe no retry has ever been triggered. >> >> What can going on there? >> >> >> On 2020/7/24 下午11:46, Ryan Yates wrote: >> >> > Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >> >> >> Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. >> >> >> I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. >> >> >> Ryan >> >> >> >> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue > wrote: >> >> Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: >> >> > The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >> >> So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >> >> >> Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. >> >> Thanks with regards, >> >> Compl >> >> >> On 2020/7/24 下午10:03, Ryan Yates wrote: >> >> Hi Compl, >> >> >> Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. >> >> >> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >> >> >> To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. >> >> >> There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. >> >> >> Ryan >> >> >> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe > wrote: >> >> Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. >> >> So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. >> >> But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. >> >> Best regards, >> >> Compl >> >> >> On 2020/7/24 上午12:57, Christopher Allen wrote: >> >> It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. >> >> >> The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. >> >> >> e.g. https://hackage.haskell.org/package/stm-containers >> https://hackage.haskell.org/package/ttrie >> >> It also sounds a bit like your question bumps into Amdahl's Law a bit. >> >> >> All else fails, stop using STM and find something more tuned to your problem space. >> >> >> Hope this helps, >> >> Chris Allen >> >> >> >> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe > wrote: >> >> Hello Cafe, >> >> >> I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. >> >> >> As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. >> >> >> I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. >> >> >> But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. >> >> >> Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? >> >> >> I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. >> >> >> Specifically, [7] states: >> >> >> > It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. >> >> >> I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. >> >> >> Best regards, >> >> Compl >> >> >> >> [1] Comparing the performance of concurrent linked-list implementations in Haskell >> >> https://simonmar.github.io/bib/papers/concurrent-data.pdf >> >> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. >> >> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> >> >> >> -- >> >> Chris Allen >> >> Currently working on http://haskellbook.com >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. >> >> > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Thu Jul 30 08:27:27 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 30 Jul 2020 16:27:27 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: <9E832D24-E05D-4265-BDB2-D64685D683B3@icloud.com> Jo, I have some updates wrt nonmoving GC in another post to the list just now. And per my understanding, GHC's GC doesn't seek free segments within a heap, it instead will copy all live objects to a new heap after then swap the new heap to be the live one, so I assume memory (address space) fragmentation doesn't make much trouble for a GHC process, as for other runtimes. I suspect the difficulty resides in the detection of circular/cyclic circumstances wrt live data structures within the old heap, especially the circles form with arbitrary number of pointers of indirection. If the GC has to perform some dict lookup to decide if an object has been copied to new heap, that's O(n*log(n)) complexity in best case, where n is number of live objects in the heap. To efficiently copy circular structures, one optimization I can imagine is to have a `new ptr` field in every heap object, then in copying another object with a pointer to one object, the `new ptr` can be read out and if not nil, assign the pointer field on another object' in the new heap to that value and it's done; or copy one object' to the new heap, and update the field on one object in the old heap pointing to the new heap. But I don't know details of GHC GC and can't imagine even feasibility of this technique. And even the new nonmoving GC may have similar difficulty to jump out of a circle when following pointers. Regards, Compl > On 2020-07-30, at 15:24, Joachim Durchholz wrote: > > Am 30.07.20 um 07:31 schrieb Compl Yue via Haskell-Cafe: >> And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. > > Cycles are relevant only for reference-counting collectors. > As far as I understand http://downloads.haskell.org/~ghc/latest/docs/html/users_guide/runtime_control.html, GHC offers only tracing collectors, and cycles are irrelevant there. > >> I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. > > Hmm... can GHC's memory management fragment? > If that's the case, you may be seeing GC trying to find free blocks in fragmented memory, and having to re-run the GC cycle to free a block so there's enough contiguous memory. > It's a bit of a stretch, but it can happen, and testing that hypothesis would be relatively quick: Run the program with moving GC, observe running time and if it's still slow, check if the GC is actually eating CPU, or if it's merely waiting for other threads to respond to the stop-the-world signal. > > Regards, > Jo > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. From merijn at inconsistent.nl Thu Jul 30 08:32:26 2020 From: merijn at inconsistent.nl (Merijn Verstraaten) Date: Thu, 30 Jul 2020 10:32:26 +0200 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: <2F5A676A-B698-4B25-90A5-EC8FCFAAA11D@inconsistent.nl> What I haven't seen anyone mention/ask yet is: Are you using the threaded runtime? (Presumably yes) And are you using high numbers of capabilities? (Like +RTS -N), because that will enable parallel GC, which has notoriously poor behaviour with default settings and high numbers of capabilities? I've seen 2 order of magnitude speedups in my own code by disabling the parallel GC in the threaded runtime. Cheers, Merijn > On 30 Jul 2020, at 10:00, YueCompl via Haskell-Cafe wrote: > > Update: nonmoving GC does make differences > > I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then: > > Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS > > With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck. > > Regards, > Compl > > >> On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote: >> >> Thanks Ryan, and I'm honored to get Simon's attention. >> >> I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as: >> >> > - This package provides an implementation of a skip list in STM. >> >> >+ This package provides a proof-of-concept implementation of a skip list in STM >> >> This has to mean something but I can't figure out yet. >> >> Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. >> >> Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set. >> >> I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. >> >> Best regards, >> >> Compl >> >> >> >> On 2020/7/30 上午10:05, Ryan Yates wrote: >>> Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC. >>> >>> Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment: >>> >>> -- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ >>> -- For performance reasons, this function uses 'unsafePerformIO' to access the >>> -- random number generator. (It would be possible to store the random number >>> -- generator in a 'TVar' and thus be able to access it safely from within the >>> -- STM monad. This, however, might cause high contention among threads.) >>> chooseLevel :: TSkipList k a -> Int >>> >>> This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization. >>> >>> Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date). >>> >>> Ryan >>> >>> >>> On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote: >>> Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something. >>> >>> >>> My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet. >>> >>> >>> Maybe someone with experience of performance debugging might feel able to help Compl? >>> >>> >>> Simon >>> >>> >>> From: Haskell-Cafe On Behalf Of Ryan Yates >>> Sent: 29 July 2020 20:41 >>> To: YueCompl >>> Cc: Haskell Cafe >>> Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? >>> >>> >>> Hi Compl, >>> >>> >>> There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. >>> >>> >>> Ryan >>> >>> >>> On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote: >>> >>> Hi Cafe and Ryan, >>> >>> >>> I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. >>> >>> >>> But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. >>> >>> >>> For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. >>> >>> >>> If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. >>> >>> >>> I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. >>> >>> >>> Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. >>> >>> >>> I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... >>> >>> >>> Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. >>> >>> >>> So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. >>> >>> >>> Best regards, >>> >>> Compl >>> >>> >>> >>> >>> >>> On 2020-07-25, at 22:07, Ryan Yates wrote: >>> >>> >>> Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: >>> >>> >>> Leveraging hardware TM in Haskell (PPoPP '19) >>> >>> https://dl.acm.org/doi/10.1145/3293883.3295711 >>> >>> >>> Or my thesis: >>> >>> https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 >>> >>> >>> The PPoPP benchmarks are on a branch (or the releases tab on github): >>> >>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src >>> >>> >>> >>> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. >>> >>> >>> Ryan >>> >>> >>> >>> On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe wrote: >>> >>> Dear Cafe, >>> >>> As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. >>> >>> I see Ryan shared the code benchmarking RBTree with stm in mind: >>> >>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput >>> >>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree >>> >>> But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: >>> >>> >>> Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? >>> >>> (of course production ready libraries most desirable) >>> >>> >>> Thanks with regards, >>> >>> Compl >>> >>> >>> On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: >>> >>> Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. >>> >>> It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) >>> >>> I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? >>> >>> Thanks with best regards, >>> >>> Compl >>> >>> >>> On 2020/7/25 上午2:02, Ryan Yates wrote: >>> >>> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, >>> >>> changing the size of heap objects can drastically change cache performance and completely different behavior can show up. >>> >>> >>> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >>> >>> >>> The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) >>> >>> >>> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >>> >>> >>> All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. >>> >>> >>> The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: >>> >>> >>> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >>> >>> >>> Ryan >>> >>> >>> >>> >>> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: >>> >>> I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. >>> >>> And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. >>> >>> So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. >>> >>> And I have something in my code to track STM retry like this: >>> >>> ``` >>> >>> -- blocking wait not expected, track stm retries explicitly >>> >>> trackSTM :: Int -> IO (Either () a) >>> >>> trackSTM !rtc = do >>> >>> when -- todo increase the threshold of reporting? >>> >>> (rtc > 0) $ do >>> >>> -- trace out the retries so the end users can be aware of them >>> >>> tid <- myThreadId >>> >>> trace >>> >>> ( "🔙\n" >>> >>> <> show callCtx >>> >>> <> "🌀 " >>> >>> <> show tid >>> >>> <> " stm retry #" >>> >>> <> show rtc >>> >>> ) >>> >>> $ return () >>> >>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >>> >>> Nothing -> -- stm failed, do a tracked retry >>> >>> trackSTM (rtc + 1) >>> >>> Just ... -> ... >>> >>> ``` >>> >>> No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. >>> >>> So I believe no retry has ever been triggered. >>> >>> What can going on there? >>> >>> >>> On 2020/7/24 下午11:46, Ryan Yates wrote: >>> >>> > Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>> >>> >>> Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. >>> >>> >>> I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. >>> >>> >>> Ryan >>> >>> >>> >>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: >>> >>> Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: >>> >>> > The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>> >>> So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>> >>> >>> Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. >>> >>> Thanks with regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>> >>> Hi Compl, >>> >>> >>> Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. >>> >>> >>> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>> >>> >>> To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. >>> >>> >>> There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. >>> >>> >>> Ryan >>> >>> >>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe wrote: >>> >>> Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. >>> >>> So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. >>> >>> But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. >>> >>> Best regards, >>> >>> Compl >>> >>> >>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>> >>> It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. >>> >>> >>> The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. >>> >>> >>> e.g. https://hackage.haskell.org/package/stm-containers >>> >>> https://hackage.haskell.org/package/ttrie >>> >>> >>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>> >>> >>> All else fails, stop using STM and find something more tuned to your problem space. >>> >>> >>> Hope this helps, >>> >>> Chris Allen >>> >>> >>> >>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe wrote: >>> >>> Hello Cafe, >>> >>> >>> I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. >>> >>> >>> As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. >>> >>> >>> I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. >>> >>> >>> But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. >>> >>> >>> Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? >>> >>> >>> I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. >>> >>> >>> Specifically, [7] states: >>> >>> >>> > It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. >>> >>> >>> I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. >>> >>> >>> Best regards, >>> >>> Compl >>> >>> >>> >>> [1] Comparing the performance of concurrent linked-list implementations in Haskell >>> >>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>> >>> >>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. >>> >>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>> >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >>> >>> >>> >>> >>> -- >>> >>> Chris Allen >>> >>> Currently working on http://haskellbook.com >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >>> >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >>> >>> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: Message signed with OpenPGP URL: From compl.yue at icloud.com Thu Jul 30 08:59:24 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 30 Jul 2020 16:59:24 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <2F5A676A-B698-4B25-90A5-EC8FCFAAA11D@inconsistent.nl> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> <2F5A676A-B698-4B25-90A5-EC8FCFAAA11D@inconsistent.nl> Message-ID: Hi Merijn, Yes I always use -threaded even for single thread test. I did my tests with `+RTS -N10 -A128m -qg -I0` by default, and tinkered with `-qn5 -qb1 -qg1`, `-G3`, `-G5`, even `-G10` and some slightly tuned combinations, all with no apparent improvement. And yes I've discovered parallel GC terribly affecting the throughput (easily thrashing with high number of concurrent driving clients, inducing high portion of kernel CPU utilization with little business progress), so lately I prefer to disable it `-qg` or at least limit number of participant capabilities with `-qn1` ~ `-qn5`. Btw, it feels like once an RTS option is added by `-with-rtsopts=` at compile time, the same option can not be overridden from command line, I had thought command line `+RTS xx` will always take highest precedence and override other sources (I see env var GHCRTS documented but haven't used it yet), but appears compile time `-with-rtsopts=` is final, so I lately compile only with ghc-options: -Wall -threaded -rtsopts in the `executable` section of my .cabal file, and test various RTS options on command line per each run. Thanks with regards, Compl > On 2020-07-30, at 16:32, Merijn Verstraaten wrote: > > What I haven't seen anyone mention/ask yet is: Are you using the threaded runtime? (Presumably yes) And are you using high numbers of capabilities? (Like +RTS -N), because that will enable parallel GC, which has notoriously poor behaviour with default settings and high numbers of capabilities? > > I've seen 2 order of magnitude speedups in my own code by disabling the parallel GC in the threaded runtime. > > Cheers, > Merijn > >> On 30 Jul 2020, at 10:00, YueCompl via Haskell-Cafe wrote: >> >> Update: nonmoving GC does make differences >> >> I think couldn't observe it because I set the heap -H2g rather large, and generation 0 are still collected by old moving GC which having difficulty in handling the large hazard heap. After I realize just now that nonmoving GC only works against oldest generation, I tested it again with `+RTS -H16m -A4m` with and without `-xn`, then: >> >> Without -xn (old moving GC in effect), the throughput degrades fast and stop business progressing at ~200MB of server RSS >> >> With -xn (new nonmvoing GC in effect), server RSS can burst to ~350MB, then throughput degrades relative slower, until RSS reached ~1GB, after then barely progressing at business yielding. But RSS can keep growing with occasional burst fashioned business yield, until ~3.3GB then it totally stuck. >> >> Regards, >> Compl >> >> >>> On 2020-07-30, at 13:31, Compl Yue via Haskell-Cafe wrote: >>> >>> Thanks Ryan, and I'm honored to get Simon's attention. >>> >>> I did have some worry about package tskiplist, that its github repository seems withdrawn, I emailed the maintainer Peter Robinson lately but have gotten no response by far. What particularly worrying me is the 1st sentence of the Readme has changed from 1.0.0 to 1.0.1 (which is current) as: >>> >>>> - This package provides an implementation of a skip list in STM. >>> >>>> + This package provides a proof-of-concept implementation of a skip list in STM >>> >>> This has to mean something but I can't figure out yet. >>> >>> Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. >>> >>> Despite that, I don't think overhead of TVar itself the most serious issue in my situation, as before GC engagement, there are as many TVars being allocated and updated without stuck at business progressing. And now I realize what presuring GC in my situation is not only the large number of pointers (TVars), and at the same time, they form many circular structures, that might be nightmare for a GC. As I model my data after graph model, in my test workload, there are many FeatureSet instances each being an entity/node object, then there are many Feature instances per FeatureSet object, each Feature instance being an unary relationship/edge object, with a reference attribute (via TVar) pointing to the FeatureSet object it belongs to, circular structures form because I maintain an index at each FeatureSet object, sorted by weight etc., but ultimately pointing back (via TVar) to all Feature objects belonging to the set. >>> >>> I'm still curious why the new non-moving GC in 8.10.1 still don't get obvious business progressing in my situation. I tested it on my Mac yesterday and there I don't know how to see how CPU time is distributed over threads within a process, I'll further test it with some Linux boxes to try understand it better. >>> >>> Best regards, >>> >>> Compl >>> >>> >>> >>> On 2020/7/30 上午10:05, Ryan Yates wrote: >>>> Simon, I certainly want to help get to the bottom of the performance issue at hand :D. Sorry if my reply was misleading. The constant factor overhead of pushing `TVar`s into the internal structure may be pressuring unacceptable GC behavior to happen sooner. My impression was that given the same size problem performance loss shifted from synchronization to GC. >>>> >>>> Compl, I'm not aware of mutable heap objects being problematic in particular for GHC's GC. There are lots of special cases to handle them of course. I have successfully written Haskell programs that get good performance from the GC with the dominant fraction of heap objects being mutable. I looked a little more at `TSkipList` and one tricky aspect of an STM based skip list is how to manage randomness. In `TSkipList`'s code there is the following comment: >>>> >>>> -- | Returns a randomly chosen level. Used for inserting new elements. /O(1)./ >>>> -- For performance reasons, this function uses 'unsafePerformIO' to access the >>>> -- random number generator. (It would be possible to store the random number >>>> -- generator in a 'TVar' and thus be able to access it safely from within the >>>> -- STM monad. This, however, might cause high contention among threads.) >>>> chooseLevel :: TSkipList k a -> Int >>>> >>>> This level is chosen on insertion to determine the height of the node. When writing my own STM skiplist I found that the details in unsafely accessing randomness had a significant impact on performance. We went with an unboxed array of PCG states that had an entry for each capability giving constant memory overhead in the number of capabilities. `TSkipList` uses `newStdGen` which involves allocation and synchronization. >>>> >>>> Again, I'm not pointing this out to say that this is the entirety of the issue you are encountering, rather, I do think the `TSkipList` library could be improved to allocate much less. Others can speak to how to tell where the time is going in GC (my knowledge of this is likely out of date). >>>> >>>> Ryan >>>> >>>> >>>> On Wed, Jul 29, 2020 at 4:57 PM Simon Peyton Jones wrote: >>>> Compl’s problem is (apparently) that execution becomes dominated by GC. That doesn’t sound like a constant-factor overhead from TVars, no matter how efficient (or otherwise) they are. It sounds more like a space leak to me; perhaps you need some strict evaluation or something. >>>> >>>> >>>> My point is only: before re-engineering STM it would make sense to get a much more detailed insight into what is actually happening, and where the space and time is going. We have tools to do this (heap profiling, Threadscope, …) but I know they need some skill and insight to use well. But we don’t have nearly enough insight to draw meaningful conclusions yet. >>>> >>>> >>>> Maybe someone with experience of performance debugging might feel able to help Compl? >>>> >>>> >>>> Simon >>>> >>>> >>>> From: Haskell-Cafe On Behalf Of Ryan Yates >>>> Sent: 29 July 2020 20:41 >>>> To: YueCompl >>>> Cc: Haskell Cafe >>>> Subject: Re: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? >>>> >>>> >>>> Hi Compl, >>>> >>>> >>>> There is a lot of overhead with TVars. My thesis work addresses this by incorporating mutable constructor fields with STM. I would like to get all that into GHC as soon as I can :D. I haven't looked closely at the `tskiplist` package, I'll take a look and see if I see any potential issues. There was some recent work on concurrent B-tree that may be interesting to try. >>>> >>>> >>>> Ryan >>>> >>>> >>>> On Wed, Jul 29, 2020 at 10:24 AM YueCompl wrote: >>>> >>>> Hi Cafe and Ryan, >>>> >>>> >>>> I tried Map/Set from stm-containers and TSkipList (added range scan api against its internal data structure) from http://hackage.haskell.org/package/tskiplist , with them I've got quite improved at scalability on concurrency. >>>> >>>> >>>> But unfortunately then I hit another wall at single thread scalability over working memory size, I suspect it's because massively more TVars (those being pointers per se) are introduced by those "contention-free" data structures, they need to mutate separate pointers concurrently in avoiding contentions anyway, but such pointer-intensive heap seems imposing extraordinary pressure to GHC's garbage collector, that GC will dominate CPU utilization with poor business progress. >>>> >>>> >>>> For example in my test, I use `+RTS -H2g` for the Haskell server process, so GC is not triggered until after a while, then spin off 3 Python client to insert new records concurrently, in the first stage each Python process happily taking ~90% CPU filling (through local mmap) the arrays allocated from the server and logs of success scroll quickly, while the server process utilizes only 30~40% CPU to serve those 3 clients (insert meta data records into unique indices merely); then the client processes' CPU utilization drop drastically once Haskell server process' private memory reached around 2gb, i.e. GC started engaging, the server process's CPU utilization quickly approaches ~300%, while all client processes' drop to 0% for most of the time, and occasionally burst a tiny while with some log output showing progress. And I disable parallel GC lately, enabling parallel GC only makes it worse. >>>> >>>> >>>> If I comment out the code updating the indices (those creating many TVars), the overall throughput only drop slowly as more data are inserted, the parallelism feels steady even after the server process' private memory takes several GBs. >>>> >>>> >>>> I didn't expect this, but appears to me that GC of GHC is really not good at handling massive number of pointers in the heap, while those pointers are essential to reduce contention (and maybe expensive data copying too) at heavy parallelism/concurrency. >>>> >>>> >>>> Btw I tried `+RTS -xn` with GHC 8.10.1 too, no obvious different behavior compared to 8.8.3; and also tried tweaking GC related RTS options a bit, including increasing -G up to 10, no much difference too. >>>> >>>> >>>> I feel hopeless at the moment, wondering if I'll have to rewrite this in-memory db in Go/Rust or some other runtime ... >>>> >>>> >>>> Btw I read https://tech.channable.com/posts/2020-04-07-lessons-in-managing-haskell-memory.html in searching about the symptoms, and don't feel likely to convert my DB managed data into immutable types thus to fit into Compact Regions, not quite likely a live in-mem database instance can do. >>>> >>>> >>>> So seems there are good reasons no successful DBMS, at least in-memory ones have been written in Haskell. >>>> >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> >>>> >>>> >>>> On 2020-07-25, at 22:07, Ryan Yates wrote: >>>> >>>> >>>> Unfortunately my STM benchmarks are rather disorganized. The most relevant paper using them is: >>>> >>>> >>>> Leveraging hardware TM in Haskell (PPoPP '19) >>>> >>>> https://dl.acm.org/doi/10.1145/3293883.3295711 >>>> >>>> >>>> Or my thesis: >>>> >>>> https://urresearch.rochester.edu/institutionalPublicationPublicView.action?institutionalItemId=34931 >>>> >>>> >>>> The PPoPP benchmarks are on a branch (or the releases tab on github): >>>> >>>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/wip/mutable-fields/benchmarks/PPoPP2019/src >>>> >>>> >>>> >>>> All that to say, without an implementation of mutable constructor fields (which I'm working on getting into GHC) the scaling is limited. >>>> >>>> >>>> Ryan >>>> >>>> >>>> >>>> On Sat, Jul 25, 2020 at 3:45 AM Compl Yue via Haskell-Cafe wrote: >>>> >>>> Dear Cafe, >>>> >>>> As Chris Allen has suggested, I learned that https://hackage.haskell.org/package/stm-containers and https://hackage.haskell.org/package/ttrie can help a lot when used in place of traditional HashMap for stm tx processing, under heavy concurrency, yet still with automatic parallelism as GHC implemented them. Then I realized that in addition to hash map (used to implement dicts and scopes), I also need to find a TreeMap replacement data structure to implement the db index. I've been focusing on the uniqueness constraint aspect, but it's still an index, needs to provide range scan api for db clients, so hash map is not sufficient for the index. >>>> >>>> I see Ryan shared the code benchmarking RBTree with stm in mind: >>>> >>>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree-Throughput >>>> >>>> https://github.com/fryguybob/ghc-stm-benchmarks/tree/master/benchmarks/RBTree >>>> >>>> But can't find conclusion or interpretation of that benchmark suite. And here's a followup question: >>>> >>>> >>>> Where are some STM contention optimized data structures, that having keys ordered, with sub-range traversing api ? >>>> >>>> (of course production ready libraries most desirable) >>>> >>>> >>>> Thanks with regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/25 下午2:04, Compl Yue via Haskell-Cafe wrote: >>>> >>>> Shame on me for I have neither experienced with `perf`, I'd learn these essential tools soon to put them into good use. >>>> >>>> It's great to learn about how `orElse` actually works, I did get confused why there are so little retries captured, and now I know. So that little trick should definitely be removed before going production, as it does no much useful things at excessive cost. I put it there to help me understand internal working of stm, now I get even better knowledge ;-) >>>> >>>> I think a debugger will trap every single abort, isn't it annoying when many aborts would occur? If I'd like to count the number of aborts, ideally accounted per service endpoints, time periods, source modules etc. there some tricks for that? >>>> >>>> Thanks with best regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/25 上午2:02, Ryan Yates wrote: >>>> >>>> To be clear, I was trying to refer to Linux `perf` [^1]. Sampling based profiling can do a good job with concurrent and parallel programs where other methods are problematic. For instance, >>>> >>>> changing the size of heap objects can drastically change cache performance and completely different behavior can show up. >>>> >>>> >>>> [^1]: https://en.wikipedia.org/wiki/Perf_(Linux) >>>> >>>> >>>> The spinning in `readTVar` should always be very short and it typically shows up as intensive CPU use, though it may not be high energy use with `pause` in the loop on x86 (looks like we don't have it [^2], I thought we did, but maybe that was only in some of my code... ) >>>> >>>> >>>> [^2]: https://github.com/ghc/ghc/blob/master/rts/STM.c#L1275 >>>> >>>> >>>> All that to say, I doubt that you are spending much time spinning (but it would certainly be interesting to know if you are! You would see `perf` attribute a large amount of time to `read_current_value`). The amount of code to execute for commit (the time when locks are held) is always much shorter than it takes to execute the transaction body. As you add more conflicting threads this gets worse of course as commits sequence. >>>> >>>> >>>> The code you have will count commits of executions of `retry`. Note that `retry` is a user level idea, that is, you are counting user level *explicit* retries. This is different from a transaction failing to commit and starting again. These are invisible to the user. Also using your trace will convert `retry` from the efficient wake on write implementation, to an active retry that will always attempt again. We don't have cheap logging of transaction aborts in GHC, but I have built such logging in my work. You can observe these aborts with a debugger by looking for execution of this line: >>>> >>>> >>>> https://github.com/ghc/ghc/blob/master/rts/STM.c#L1123 >>>> >>>> >>>> Ryan >>>> >>>> >>>> >>>> >>>> On Fri, Jul 24, 2020 at 12:35 PM Compl Yue wrote: >>>> >>>> I'm not familiar with profiling GHC yet, may need more time to get myself proficient with it. >>>> >>>> And a bit more details of my test workload for diagnostic: the db clients are Python processes from a cluster of worker nodes, consulting the db server to register some path for data files, under a data dir within a shared filesystem, then mmap those data files and fill in actual array data. So the db server don't have much computation to perform, but puts the data file path into a global index, which at the same validates its uniqueness. As there are many client processes trying to insert one meta data record concurrently, with my naive implementation, the global index's TVar will almost always in locked state by one client after another, from a queue never fall empty. >>>> >>>> So if `readTVar` should spinning waiting, I doubt the threads should actually make high CPU utilization, because at any instant of time, all threads except the committing one will be doing that one thing. >>>> >>>> And I have something in my code to track STM retry like this: >>>> >>>> ``` >>>> >>>> -- blocking wait not expected, track stm retries explicitly >>>> >>>> trackSTM :: Int -> IO (Either () a) >>>> >>>> trackSTM !rtc = do >>>> >>>> when -- todo increase the threshold of reporting? >>>> >>>> (rtc > 0) $ do >>>> >>>> -- trace out the retries so the end users can be aware of them >>>> >>>> tid <- myThreadId >>>> >>>> trace >>>> >>>> ( "🔙\n" >>>> >>>> <> show callCtx >>>> >>>> <> "🌀 " >>>> >>>> <> show tid >>>> >>>> <> " stm retry #" >>>> >>>> <> show rtc >>>> >>>> ) >>>> >>>> $ return () >>>> >>>> atomically ((Just <$> stmJob) `orElse` return Nothing) >>= \case >>>> >>>> Nothing -> -- stm failed, do a tracked retry >>>> >>>> trackSTM (rtc + 1) >>>> >>>> Just ... -> ... >>>> >>>> ``` >>>> >>>> No such trace msg fires during my test, neither in single thread run, nor in runs with pressure. I'm sure this tracing mechanism works, as I can see such traces fire, in case e.g. posting a TMVar to a TQueue for some other thread to fill it, then read the result out, if these 2 ops are composed into a single tx, then of course it's infinite retry loop, and a sequence of such msgs are logged with ever increasing rtc #. >>>> >>>> So I believe no retry has ever been triggered. >>>> >>>> What can going on there? >>>> >>>> >>>> On 2020/7/24 下午11:46, Ryan Yates wrote: >>>> >>>>> Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>>> >>>> >>>> Since the commit happens in finite steps, the expectation is that the lock will be released very soon. Given this when the body of a transaction executes `readTVar` it spins (active CPU!) until the `TVar` is observed unlocked. If a lock is observed while commiting, it immediately starts the transaction again from the beginning. To get the behavior of suspending a transaction you have to successfully commit a transaction that executed `retry`. Then the transaction is put on the wakeup lists of its read set and subsequent commits will wake it up if its write set overlaps. >>>> >>>> >>>> I don't think any of these things would explain low CPU utilization. You could try running with `perf` and see if lots of time is spent in some recognizable part of the RTS. >>>> >>>> >>>> Ryan >>>> >>>> >>>> >>>> On Fri, Jul 24, 2020 at 11:22 AM Compl Yue wrote: >>>> >>>> Thanks very much for the insightful information Ryan! I'm glad my suspect was wrong about the Haskell scheduler: >>>> >>>>> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>>> >>>> So best effort had already been made in GHC and I just need to cooperate better with its design. Then to explain the low CPU utilization (~10%), am I right to understand it as that upon reading a TVar locked by another committing tx, a lightweight thread will put itself into `waiting STM` and descheduled state, so the CPUs can only stay idle as not so many threads are willing to proceed? >>>> >>>> >>>> Anyway, I see light with better data structures to improve my situation, let me try them and report back. Actually I later changed `TVar (HaskMap k v)` to be `TVar (HashMap k Int)` where the `Int` being array index into `TVar (Vector (TVar (Maybe v)))`, in pursuing insertion order preservation semantic of dict entries (like that in Python 3.7+), then it's very hopeful to incorporate stm-containers' Map or ttrie to approach free of contention. >>>> >>>> Thanks with regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 下午10:03, Ryan Yates wrote: >>>> >>>> Hi Compl, >>>> >>>> >>>> Having a pool of transaction processing threads can be helpful in a certain way. If the body of the transaction takes more time to execute then the Haskell thread is allowed and it yields, the suspended thread won't get in the way of other thread, but when it is rescheduled, will have a low probability of success. Even worse, it will probably not discover that it is doomed to failure until commit time. If transactions are more likely to reach commit without yielding, they are more likely to succeed. If the transactions are not conflicting, it doesn't make much difference other than cache churn. >>>> >>>> >>>> The Haskell capability that is committing a transaction will not yield to another Haskell thread while it is doing the commit. The OS thread may be preempted, but once commit starts the haskell scheduler is not invoked until after locks are released. >>>> >>>> >>>> To get good performance from STM you must pay attention to what TVars are involved in a commit. All STM systems are working under the assumption of low contention, so you want to minimize "false" conflicts (conflicts that are not essential to the computation). Something like `TVar (HashMap k v)` will work pretty well for a low thread count, but every transaction that writes to that structure will be in conflict with every other transaction that accesses it. Pushing the `TVar` into the nodes of the structure reduces the possibilities for conflict, while increasing the amount of bookkeeping STM has to do. I would like to reduce the cost of that bookkeeping using better structures, but we need to do so without harming performance in the low TVar count case. Right now it is optimized for good cache performance with a handful of TVars. >>>> >>>> >>>> There is another way to play with performance by moving work into and out of the transaction body. A transaction body that executes quickly will reach commit faster. But it may be delaying work that moves into another transaction. Forcing values at the right time can make a big difference. >>>> >>>> >>>> Ryan >>>> >>>> >>>> On Fri, Jul 24, 2020 at 2:14 AM Compl Yue via Haskell-Cafe wrote: >>>> >>>> Thanks Chris, I confess I didn't pay enough attention to STM specialized container libraries by far, I skimmed through the description of stm-containers and ttrie, and feel they would definitely improve my code's performance in case I limit the server's parallelism within hardware capabilities. That may because I'm still prototyping the api and infrastructure for correctness, so even `TVar (HashMap k v)` performs okay for me at the moment, only if at low contention (surely there're plenty of CPU cycles to be optimized out in next steps). I model my data after graph model, so most data, even most indices are localized to nodes and edges, those can be manipulated without conflict, that's why I assumed I have a low contention use case since the very beginning - until I found there are still (though minor) needs for global indices to guarantee global uniqueness, I feel faithful with stm-containers/ttrie to implement a more scalable global index data structure, thanks for hinting me. >>>> >>>> So an evident solution comes into my mind now, is to run the server with a pool of tx processing threads, matching number of CPU cores, client RPC requests then get queued to be executed in some thread from the pool. But I'm really fond of the mechanism of M:N scheduler which solves massive/dynamic concurrency so elegantly. I had some good result with Go in this regard, and see GHC at par in doing this, I don't want to give up this enjoyable machinery. >>>> >>>> But looked at the stm implementation in GHC, it seems written TVars are exclusively locked during commit of a tx, I suspect this is the culprit when there're large M lightweight threads scheduled upon a small N hardware capabilities, that is when a lightweight thread yield control during an stm transaction commit, the TVars it locked will stay so until it's scheduled again (and again) till it can finish the commit. This way, descheduled threads could hold live threads from progressing. I haven't gone into more details there, but wonder if there can be some improvement for GHC RTS to keep an stm committing thread from descheduled, but seemingly that may impose more starvation potential; or stm can be improved to have its TVar locks preemptable when the owner trec/thread is in descheduled state? Neither should be easy but I'd really love massive lightweight threads doing STM practically well. >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> On 2020/7/24 上午12:57, Christopher Allen wrote: >>>> >>>> It seems like you know how to run practical tests for tuning thread count and contention for throughput. Part of the reason you haven't gotten a super clear answer is "it depends." You give up fairness when you use STM instead of MVars or equivalent structures. That means a long running transaction might get stampeded by many small ones invalidating it over and over. The long-running transaction might never clear if the small transactions keep moving the cheese. I mention this because transaction runtime and size and count all affect throughput and latency. What might be ideal for one pattern of work might not be ideal for another. Optimizing for overall throughput might make the contention and fairness problems worse too. I've done practical tests to optimize this in the past, both for STM in Haskell and for RDBMS workloads. >>>> >>>> >>>> The next step is sometimes figuring out whether you really need a data structure within a single STM container or if perhaps you can break up your STM container boundaries into zones or regions that roughly map onto update boundaries. That should make the transactions churn less. On the outside chance you do need to touch more than one container in a transaction, well, they compose. >>>> >>>> >>>> e.g. https://hackage.haskell.org/package/stm-containers >>>> >>>> https://hackage.haskell.org/package/ttrie >>>> >>>> >>>> It also sounds a bit like your question bumps into Amdahl's Law a bit. >>>> >>>> >>>> All else fails, stop using STM and find something more tuned to your problem space. >>>> >>>> >>>> Hope this helps, >>>> >>>> Chris Allen >>>> >>>> >>>> >>>> On Thu, Jul 23, 2020 at 9:53 AM YueCompl via Haskell-Cafe wrote: >>>> >>>> Hello Cafe, >>>> >>>> >>>> I'm working on an in-memory database, in Client/Server mode I just let each connected client submit remote procedure call running in its dedicated lightweight thread, modifying TVars in RAM per its business needs, then in case many clients connected concurrently and trying to insert new data, if they are triggering global index (some TVar) update, the throughput would drop drastically. I reduced the shared state to a simple int counter by TVar, got same symptom. While the parallelism feels okay when there's no hot TVar conflicting, or M is not much greater than N. >>>> >>>> >>>> As an empirical test workload, I have a `+RTS -N10` server process, it handles 10 concurrent clients okay, got ~5x of single thread throughput; but in handling 20 concurrent clients, each of the 10 CPUs can only be driven to ~10% utilization, the throughput seems even worse than single thread. More clients can even drive it thrashing without much progressing. >>>> >>>> >>>> I can understand that pure STM doesn't scale well after reading [1], and I see it suggested [7] attractive and planned future work toward that direction. >>>> >>>> >>>> But I can't find certain libraries or frameworks addressing large M over small N scenarios, [1] experimented with designated N parallelism, and [7] is rather theoretical to my empirical needs. >>>> >>>> >>>> Can you direct me to some available library implementing the methodology proposed in [7] or other ways tackling this problem? >>>> >>>> >>>> I think the most difficult one is that a transaction should commit with global indices (with possibly unique constraints) atomically updated, and rollback with any violation of constraints, i.e. transactions have to cover global states like indices. Other problems seem more trivial than this. >>>> >>>> >>>> Specifically, [7] states: >>>> >>>> >>>>> It must be emphasized that all of the mechanisms we deploy originate, in one form or another, in the database literature from the 70s and 80s. Our contribution is to adapt these techniques to software transactional memory, providing more effective solutions to important STM problems than prior proposals. >>>> >>>> >>>> I wonder any STM based library has simplified those techniques to be composed right away? I don't really want to implement those mechanisms by myself, rebuilding many wheels from scratch. >>>> >>>> >>>> Best regards, >>>> >>>> Compl >>>> >>>> >>>> >>>> [1] Comparing the performance of concurrent linked-list implementations in Haskell >>>> >>>> https://simonmar.github.io/bib/papers/concurrent-data.pdf >>>> >>>> >>>> [7] M. Herlihy and E. Koskinen. Transactional boosting: a methodology for highly-concurrent transactional objects. In Proc. of PPoPP ’08, pages 207–216. ACM Press, 2008. >>>> >>>> https://www.cs.stevens.edu/~ejk/papers/boosting-ppopp08.pdf >>>> >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >>>> >>>> >>>> -- >>>> >>>> Chris Allen >>>> >>>> Currently working on http://haskellbook.com >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> _______________________________________________ >>>> Haskell-Cafe mailing list >>>> To (un)subscribe, modify options or view archives go to: >>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>>> Only members subscribed via the mailman list are allowed to post. >>>> >>>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From pwr at lowerbound.io Thu Jul 30 11:19:45 2020 From: pwr at lowerbound.io (Peter Robinson) Date: Thu, 30 Jul 2020 19:19:45 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: Hi Compl, > >+ This package provides a proof-of-concept implementation of a skip list >> in STM >> >> This has to mean something but I can't figure out yet. >> >> Dear Peter Robinson, I hope you can see this message and get in the loop >> of discussion. >> > The reason for adding this sentence was that tskiplist hasn't been optimized for production use. Later on, I wrote an implementation of a concurrent skip list with atomic operations that performs significantly better, but it's operations work in the IO monad. I'm surprised to hear that you're getting poor performance even when using the stm-container package, which I believe was meant to be used in production. A while ago, I ran some benchmarks comparing concurrent dictionary data structures (such as stm-container) under various workloads. While STMContainers.Map wasn't as fast as the concurrent-hashtable package, the results indicate that the performance doesn't degrade too much under larger workloads. You can find these benchmark results here (10^6 randomly generated insertion/deletion/lookup requests distributed among 32 threads): https://lowerbound.io/blog/bench2-32.html And some explanations about the benchmarks are here: https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html One issue that I came across when implementing the tskiplist package was this: If a thread wants to insert some item into the skip list, it needs to search for the entry point by performing readTVar operations starting at the list head. So, on average, a thread will read O(log n) TVars (assuming a skip list of n items) and, if any of these O(log n) TVars are modified by a simultaneously running thread, the STM runtime will observe a (false) conflict and rerun the transaction. It's not clear to me how to resolve this issue without access to something like unreadTVar (see [1]). Best, Peter [1] UnreadTVar: Extending Haskell Software Transactional Memory for Performance (2007) by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , Adrian Cristal , Osman S. Unsal , Mateo Valero. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Thu Jul 30 12:10:22 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 30 Jul 2020 20:10:22 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: Hi Peter, Great to hear from you! For the record tskiplist (and stm-containers together) did improve my situation a great lot with respect to scalability at concurrency/parallelism! I'm still far from the stage to squeeze last drops of performance, currently I'm just making sure performance wise concerns be reasonable during my PoC in correctness and ergonomics of my HPC architecture (an in-memory graph + out-of-core (mmap) array DBMS powered computation cluster, with shared storage), and after parallelism appears acceptable, I seemingly suffer from serious GC issue at up scaling on process working memory size. I'm suspecting it's because of the added more TVars and/or aggressive circular structures of them in my case, and can not find a way to overcome it by far. Thanks for your detailed information! Best regards, Compl > On 2020-07-30, at 19:19, Peter Robinson wrote: > > Hi Compl, > >+ This package provides a proof-of-concept implementation of a skip list in STM > > This has to mean something but I can't figure out yet. > > Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. > > > The reason for adding this sentence was that tskiplist hasn't been optimized for production use. Later on, I wrote an implementation of a concurrent skip list with atomic operations that performs significantly better, but it's operations work in the IO monad. > > I'm surprised to hear that you're getting poor performance even when using the stm-container package, which I believe was meant to be used in production. A while ago, I ran some benchmarks comparing concurrent dictionary data structures (such as stm-container) under various workloads. While STMContainers.Map wasn't as fast as the concurrent-hashtable package, the results indicate that the performance doesn't degrade too much under larger workloads. > > You can find these benchmark results here (10^6 randomly generated insertion/deletion/lookup requests distributed among 32 threads): > https://lowerbound.io/blog/bench2-32.html > And some explanations about the benchmarks are here: > https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html > > One issue that I came across when implementing the tskiplist package was this: If a thread wants to insert some item into the skip list, it needs to search for the entry point by performing readTVar operations starting at the list head. So, on average, a thread will read O(log n) TVars (assuming a skip list of n items) and, if any of these O(log n) TVars are modified by a simultaneously running thread, the STM runtime will observe a (false) conflict and rerun the transaction. It's not clear to me how to resolve this issue without access to something like unreadTVar (see [1]). > > Best, > Peter > > [1] UnreadTVar: Extending Haskell Software Transactional Memory for Performance (2007) by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , Adrian Cristal , Osman S. Unsal , Mateo Valero. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From compl.yue at icloud.com Thu Jul 30 13:28:31 2020 From: compl.yue at icloud.com (YueCompl) Date: Thu, 30 Jul 2020 21:28:31 +0800 Subject: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> Message-ID: <8D814D14-3F54-4D2E-A3F1-EC8A2DCB1E31@icloud.com> For the record, overhead of STM over IO (or other means where manual composition of transactions needed) based concurrency control, is a price I'm willing to pay in my use case, as it's not machine-performance critical in distributing input data + parameters to a cluster of worker nodes, and collecting their results into permanent storage or a data pipeline. But to keep professionally crafting well synced, race-free scheduling code is barely affordable by my org, as shape of datasets, relationship between them and algorithms processing them are varying at fast paces, we have difficulty, or lack the willingness, to hire some workforce specifically to keep each new data pipeline race free, it has to be, but better at cost of machine-hours instead of human head counts. While easily compositing stm code, wrapped in scriptable procedures, will enable our analysts to author the scheduling scripts without too much concerns. Then our programmers can focus on performance critical parts of the data processing code, like optimization of tight-loops. Only if not in the tight loops, I think it's acceptable by us, that up to 2~3 order of magnitude slower for an stm solution compared to its best rivals, as long as it's scalable. For a (maybe cheating) example, if fully optimized code can return result in 10 ms after an analyst clicked a button, we don't bother if unoptimized stm script needs 10 second, so long as the result is correct. In a philosophic thinking, I heard that AT&T had UNIX specifically designed for their Control panel, while their Data panel runs separate software (and on separate hardware obviously), while modern systems have powerful CPUs tempting us to squeeze more performance out of it, and SIMD instructions make it even more tempting, I think we'd better resist it when programming something belong to the Control panel per se, but do it in programming something belong to the Data panel. And appears Data panel programs are being shifted to GPUs nowadays, which feels right. Regards, Compl > On 2020-07-30, at 20:10, YueCompl via Haskell-Cafe wrote: > > Hi Peter, > > Great to hear from you! > > For the record tskiplist (and stm-containers together) did improve my situation a great lot with respect to scalability at concurrency/parallelism! > > I'm still far from the stage to squeeze last drops of performance, currently I'm just making sure performance wise concerns be reasonable during my PoC in correctness and ergonomics of my HPC architecture (an in-memory graph + out-of-core (mmap) array DBMS powered computation cluster, with shared storage), and after parallelism appears acceptable, I seemingly suffer from serious GC issue at up scaling on process working memory size. I'm suspecting it's because of the added more TVars and/or aggressive circular structures of them in my case, and can not find a way to overcome it by far. > > Thanks for your detailed information! > > Best regards, > Compl > > >> On 2020-07-30, at 19:19, Peter Robinson > wrote: >> >> Hi Compl, >> >+ This package provides a proof-of-concept implementation of a skip list in STM >> >> This has to mean something but I can't figure out yet. >> >> Dear Peter Robinson, I hope you can see this message and get in the loop of discussion. >> >> >> The reason for adding this sentence was that tskiplist hasn't been optimized for production use. Later on, I wrote an implementation of a concurrent skip list with atomic operations that performs significantly better, but it's operations work in the IO monad. >> >> I'm surprised to hear that you're getting poor performance even when using the stm-container package, which I believe was meant to be used in production. A while ago, I ran some benchmarks comparing concurrent dictionary data structures (such as stm-container) under various workloads. While STMContainers.Map wasn't as fast as the concurrent-hashtable package, the results indicate that the performance doesn't degrade too much under larger workloads. >> >> You can find these benchmark results here (10^6 randomly generated insertion/deletion/lookup requests distributed among 32 threads): >> https://lowerbound.io/blog/bench2-32.html >> And some explanations about the benchmarks are here: >> https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html >> >> One issue that I came across when implementing the tskiplist package was this: If a thread wants to insert some item into the skip list, it needs to search for the entry point by performing readTVar operations starting at the list head. So, on average, a thread will read O(log n) TVars (assuming a skip list of n items) and, if any of these O(log n) TVars are modified by a simultaneously running thread, the STM runtime will observe a (false) conflict and rerun the transaction. It's not clear to me how to resolve this issue without access to something like unreadTVar (see [1]). >> >> Best, >> Peter >> >> [1] UnreadTVar: Extending Haskell Software Transactional Memory for Performance (2007) by Nehir Sonmez , Cristian Perfumo , Srdjan Stipic , Adrian Cristal , Osman S. Unsal , Mateo Valero. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From carter.schonwald at gmail.com Thu Jul 30 14:04:04 2020 From: carter.schonwald at gmail.com (Carter Schonwald) Date: Thu, 30 Jul 2020 10:04:04 -0400 Subject: [Haskell-cafe] [ANN] isbn - ISBN Validation and Manipulation In-Reply-To: References: Message-ID: cool! Thanks for sharing! (i mean, who doesn't like to read a good book!) Question: would it be worth considering internally having the digits be stored as half bytes for space usage? eg ISBN10 -> use ~ 5 bytes, and ISBN13 be ~ 7 bytes (this could be within a word64 as a tiny 8 slot array, OR or any sort of vector/array datatype)? (or instead of a decimal binary rep, pack it into a word64 as the number itself for either?) Granted this would complicate some of the internals engineering a teeny bit it looks like, in the current code, if there's no hyphens, it'll keep the input text value as the internal rep, *which can cause space leaks* if it's a slice of a much larger text input you otherwise do not intend to retain. When it generates its own text string, well... the text package uses utf16 (aka 2 bytes per character for valid ascii) as the internal rep, so the buffer representation will occupy 10 * 2 bytes or 13* 2bytes, so 20-26 bytes within the buffer, ignoring the extra word or two of indexing/offsets! point being ... it seems like could embed them (with a teeny bit of work) as follows data ISBN = IsIBSN10 ISBN10 | isISBN13 ISBN13 newtype ISBN10 = ISBN10 Word64 newtype ISBN13 = ISBN13 Word64 and then i'd probably be inclined to do use Data.Bits and do the "word 64 as a 16 slot array of word4's," which would also support the base 11 digit at the end encoding constraint, since word4 == base 16 :) then a teeny bit of work to do the right "right" ords and shows etc on this rep etc i hope the design i'm pointing out makes sense for you (and if i'm not being clear, please holler and i'll try to help) and again, thanks for sharing this! -Carter On Wed, Jul 29, 2020 at 1:39 PM Christian Charukiewicz via Haskell-Cafe < haskell-cafe at haskell.org> wrote: > Hello Haskell Cafe, > > I wanted to share my first ever Haskell package: isbn > > https://hackage.haskell.org/package/isbn > > The package is motivated by my need to validate ISBNs (the unique > identifier associated with every book published since 1970) in a Haskell > application I am building. I published isbn as a back in May but yesterday > I made some improvements the API and I think it is now ready to share as > v1.1.0.0. > > I have been using Haskell commercially for a few years, and have made > several contributions to various packages, but as mentioned, this is my > first time authoring and publishing a package. If anyone has any feedback, > I would be happy to hear it. > > Thank you, > > Christian Charukiewicz > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From branimir.maksimovic at gmail.com Thu Jul 30 14:30:24 2020 From: branimir.maksimovic at gmail.com (Branimir Maksimovic) Date: Thu, 30 Jul 2020 16:30:24 +0200 Subject: [Haskell-cafe] For STM to practically serve large in-mem datasets with cyclic structures WAS: STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <8D814D14-3F54-4D2E-A3F1-EC8A2DCB1E31@icloud.com> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <59bd91ed-a25e-d87a-1d22-3f72e0f80828@icloud.com> <8D814D14-3F54-4D2E-A3F1-EC8A2DCB1E31@icloud.com> Message-ID: If you want performance, you should use custom solutions... https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/knucleotide-ghc-1.html This one was taken because of library hashmap and my custom solution hashtable was much faster. Greets. On 7/30/20 3:28 PM, YueCompl via Haskell-Cafe wrote: > For the record, overhead of STM over IO (or other means where manual > composition of transactions needed) based concurrency control, is a > price I'm willing to pay in my use case, as it's not > machine-performance critical in distributing input data + parameters > to a cluster of worker nodes, and collecting their results into > permanent storage or a data pipeline. But to keep professionally > crafting well synced, race-free scheduling code is barely affordable > by my org, as shape of datasets, relationship between them and > algorithms processing them are varying at fast paces, we have > difficulty, or lack the willingness, to hire some workforce > specifically to keep each new data pipeline race free, it has to be, > but better at cost of machine-hours instead of human head counts. > > While easily compositing stm code, wrapped in scriptable procedures, > will enable our analysts to author the scheduling scripts without too > much concerns. Then our programmers can focus on performance critical > parts of the data processing code, like optimization of tight-loops. > > Only if not in the tight loops, I think it's acceptable by us, that up > to 2~3 order of magnitude slower for an stm solution compared to its > best rivals, as long as it's scalable. For a (maybe cheating) example, > if fully optimized code can return result in 10 ms after an analyst > clicked a button, we don't bother if unoptimized stm script needs 10 > second, so long as the result is correct. > > In a philosophic thinking, I heard that AT&T had UNIX specifically > designed for their Control panel, while their Data panel runs separate > software (and on separate hardware obviously), while modern systems > have powerful CPUs tempting us to squeeze more performance out of it, > and SIMD instructions make it even more tempting, I think we'd better > resist it when programming something belong to the Control panel per > se, but do it in programming something belong to the Data panel. And > appears Data panel programs are being shifted to GPUs nowadays, which > feels right. > > Regards, > Compl > > >> On 2020-07-30, at 20:10, YueCompl via Haskell-Cafe >> > wrote: >> >> Hi Peter, >> >> Great to hear from you! >> >> For the record tskiplist (and stm-containers together) did improve my >> situation a great lot with respect to scalability at >> concurrency/parallelism! >> >> I'm still far from the stage to squeeze last drops of performance, >> currently I'm just making sure performance wise concerns be >> reasonable during my PoC in correctness and ergonomics of my HPC >> architecture (an in-memory graph + out-of-core (mmap) array DBMS >> powered computation cluster, with shared storage), and after >> parallelism appears acceptable, I seemingly suffer from serious GC >> issue at up scaling on process working memory size. I'm suspecting >> it's because of the added more TVars and/or aggressive circular >> structures of them in my case, and can not find a way to overcome it >> by far. >> >> Thanks for your detailed information! >> >> Best regards, >> Compl >> >> >>> On 2020-07-30, at 19:19, Peter Robinson >> > wrote: >>> >>> Hi Compl, >>> >>> >+ This package provides a proof-of-concept implementation >>> of a skip list in STM >>> >>> This has to mean something but I can't figure out yet. >>> >>> Dear Peter Robinson, I hope you can see this message and get >>> in the loop of discussion. >>> >>> >>>  The reason for adding this sentence was that tskiplist hasn't been >>> optimized for production use. Later on, I wrote an implementation of >>> a concurrent skip list with atomic operations that performs >>> significantly better, but it's operations work in the IO monad. >>> >>> I'm surprised to hear that you're getting poor performance even when >>> using the stm-container package, which I believe was meant to be >>> used in production. A while ago, I ran some benchmarks comparing >>> concurrent dictionary data structures (such as stm-container) under >>> various workloads. While STMContainers.Map wasn't as fast as the >>> concurrent-hashtable package, the results indicate that the >>> performance doesn't degrade too much under larger workloads. >>> >>> You can find these benchmark results here (10^6 randomly generated >>> insertion/deletion/lookup requests distributed among 32 threads): >>> https://lowerbound.io/blog/bench2-32.html >>> And some explanations about the benchmarks are here: >>> https://lowerbound.io/blog/2019-10-24_concurrent_hash_table_performance.html >>> >>> One issue that I came across when implementing the tskiplist package >>> was this: If a thread wants to insert some item into the skip list, >>> it needs to search for the entry point by performing readTVar >>> operations starting at the list head. So, on average, a thread will >>> read O(log n) TVars (assuming a skip list of n items) and, if any of >>> these O(log n) TVars are modified by a simultaneously running >>> thread, the STM runtime will observe a (false) conflict and rerun >>> the transaction. It's not clear to me how to resolve this issue >>> without access to something like unreadTVar (see [1]). >>> >>> Best, >>> Peter >>> >>> [1] UnreadTVar: Extending Haskell Software Transactional Memory for >>> Performance (2007)  by Nehir Sonmez , Cristian Perfumo , Srdjan >>> Stipic , Adrian Cristal , Osman S. Unsal , Mateo Valero. >>> >>> _______________________________________________ >>> Haskell-Cafe mailing list >>> To (un)subscribe, modify options or view archives go to: >>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >>> Only members subscribed via the mailman list are allowed to post. >> >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. > > > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From charukiewicz at protonmail.com Thu Jul 30 23:46:17 2020 From: charukiewicz at protonmail.com (Christian Charukiewicz) Date: Thu, 30 Jul 2020 23:46:17 +0000 Subject: [Haskell-cafe] [ANN] isbn - ISBN Validation and Manipulation In-Reply-To: References: Message-ID: <2c4qf8vbS89STvaRcnwViohuPf7-70B0sBEk0yLw0BUMfMG0z5sJ1udwZa9lgoQvLq8KgNy1pdvK97FL4YU_yGsbSrQ8powJ-N3dWiZRp-I=@protonmail.com> Hey Carter, I really appreciate the reply. This is exactly the type of feedback that I find very helpful, as it helps me become a better Haskell programmer. To address your points: 1. Great catch about making a copy of the input. I didn't notice the potential for space leaks here, but what you're saying makes total sense. I just added a call to Data.Text.copy around the inputs of the validation functions, so that should mitigate this risk. I released this as isbn-1.1.0.1 on Hackage: https://hackage.haskell.org/package/isbn-1.1.0.1 2. In terms of the changing the internal representation of ISBN to be something other than Text, this is definitely something I'm open to, since it's not that hard to think of a situation where someone is working with a large number of ISBNs and space use matters. I like your suggestion to use Word64, but this would be the first time I have worked with Word64 in this manner, so I'll have to look into it a bit more. I did also receive some other feedback from someone else about the exact same thing. I incorporated your feedback and the other feedback into an issue in the GitHub repo if you'd like to take share your thoughts: https://github.com/charukiewicz/hs-isbn/issues/2 Thanks again, Christian ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Thursday, July 30, 2020 9:04 AM, Carter Schonwald wrote: > cool! Thanks for sharing! > (i mean, who doesn't like to read a good book!) > > Question: would it be worth considering internally having the digits be stored as half bytes for space usage? eg ISBN10 -> use ~ 5 bytes, and ISBN13 be ~ 7 bytes (this could be within a word64 as a tiny 8 slot array, OR or any sort of vector/array datatype)? (or instead of a decimal binary rep, pack it into a word64 as the number itself for either?) Granted this would complicate some of the internals engineering a teeny bit > > it looks like, in the current code, if there's no hyphens, it'll keep the input text value as the internal rep, *which can cause space leaks* if it's a slice of a much larger text input you otherwise do not intend to retain. When it generates its own text string, well... the text package uses utf16 (aka 2 bytes per character for valid ascii) as the internal rep, so the buffer representation will occupy 10 * 2 bytes or 13* 2bytes, so 20-26 bytes within the buffer, ignoring the extra word or two of indexing/offsets! > > point being ... it seems like could embed them (with a teeny bit of work) as follows > > data ISBN = > IsIBSN10 ISBN10 > | isISBN13 ISBN13 > > newtype ISBN10 = ISBN10 Word64 > newtype ISBN13 = ISBN13 Word64 > > and then i'd probably be inclined to do use Data.Bits and do the "word 64 as a 16 slot array of word4's," which would also support the base 11 digit at the end encoding constraint, since word4 == base 16 :) > then a teeny bit of work to do the right "right" ords and shows etc on this rep etc > > i hope the design i'm pointing out makes sense for you (and if i'm not being clear, please holler and i'll try to help) > > and again, thanks for sharing this! > -Carter > > On Wed, Jul 29, 2020 at 1:39 PM Christian Charukiewicz via Haskell-Cafe wrote: > >> Hello Haskell Cafe, >> >> I wanted to share my first ever Haskell package: isbn >> >> https://hackage.haskell.org/package/isbn >> >> The package is motivated by my need to validate ISBNs (the unique identifier associated with every book published since 1970) in a Haskell application I am building. I published isbn as a back in May but yesterday I made some improvements the API and I think it is now ready to share as v1.1.0.0. >> >> I have been using Haskell commercially for a few years, and have made several contributions to various packages, but as mentioned, this is my first time authoring and publishing a package. If anyone has any feedback, I would be happy to hear it. >> >> Thank you, >> >> Christian Charukiewicz >> _______________________________________________ >> Haskell-Cafe mailing list >> To (un)subscribe, modify options or view archives go to: >> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe >> Only members subscribed via the mailman list are allowed to post. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ben at well-typed.com Fri Jul 31 13:36:10 2020 From: ben at well-typed.com (Ben Gamari) Date: Fri, 31 Jul 2020 09:36:10 -0400 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> Message-ID: <87v9i3g62h.fsf@smart-cactus.org> Simon Peyton Jones via Haskell-Cafe writes: > > Compl’s problem is (apparently) that execution becomes dominated by > > GC. That doesn’t sound like a constant-factor overhead from TVars, no > > matter how efficient (or otherwise) they are. It sounds more like a > > space leak to me; perhaps you need some strict evaluation or > > something. > > My point is only: before re-engineering STM it would make sense to get > a much more detailed insight into what is actually happening, and > where the space and time is going. We have tools to do this (heap > profiling, Threadscope, …) but I know they need some skill and insight > to use well. But we don’t have nearly enough insight to draw > meaningful conclusions yet. > > Maybe someone with experience of performance debugging might feel able > to help Compl? > Compl, If you want to discuss the issue feel free to get in touch on IRC. I would be happy to help. It would be great if we had something of a decision tree for performance tuning of Haskell code in the users guide or Wiki. We have so many tools yet there isn't a comprehensive overview of 1. what factors might affect which runtime characteristics of your program 2. which tools can be used to measure which factors 3. how these factors can be improved Cheers, - Ben -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 487 bytes Desc: not available URL: From compl.yue at icloud.com Fri Jul 31 14:35:52 2020 From: compl.yue at icloud.com (YueCompl) Date: Fri, 31 Jul 2020 22:35:52 +0800 Subject: [Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates? In-Reply-To: <87v9i3g62h.fsf@smart-cactus.org> References: <4CAE51E9-2720-4C2F-9971-00FABACA35BC@icloud.com> <74f14ebd-241b-0b4f-03d5-ea130f69a1c6@icloud.com> <41fc9d8e-2356-f3a7-6f00-c8a51d3b95f7@icloud.com> <8136052f-8b5c-c31f-7487-d03f6a86a5d3@icloud.com> <8d25fa78-9a66-6e24-7771-ba0afa127a04@icloud.com> <87v9i3g62h.fsf@smart-cactus.org> Message-ID: <33DE75B7-F33F-48EC-8D08-825D264F9C78@icloud.com> Hi Ben, Thanks as always for your great support! And at the moment I'm working on a minimum working example to reproduce the symptoms, I intend to work out a program depends only on libraries bundled with GHC, so it can be easily diagnosed without my complex env, but so far no reprod yet. I'll come with some piece of code once it can reproduce something. Thanks in advance. Sincerely, Compl > On 2020-07-31, at 21:36, Ben Gamari wrote: > > Simon Peyton Jones via Haskell-Cafe writes: > >>> Compl’s problem is (apparently) that execution becomes dominated by >>> GC. That doesn’t sound like a constant-factor overhead from TVars, no >>> matter how efficient (or otherwise) they are. It sounds more like a >>> space leak to me; perhaps you need some strict evaluation or >>> something. >> >> My point is only: before re-engineering STM it would make sense to get >> a much more detailed insight into what is actually happening, and >> where the space and time is going. We have tools to do this (heap >> profiling, Threadscope, …) but I know they need some skill and insight >> to use well. But we don’t have nearly enough insight to draw >> meaningful conclusions yet. >> >> Maybe someone with experience of performance debugging might feel able >> to help Compl? >> > Compl, > > If you want to discuss the issue feel free to get in touch on IRC. I > would be happy to help. > > It would be great if we had something of a decision tree for performance > tuning of Haskell code in the users guide or Wiki. We have so many tools > yet there isn't a comprehensive overview of > > 1. what factors might affect which runtime characteristics of your > program > 2. which tools can be used to measure which factors > 3. how these factors can be improved > > Cheers, > > - Ben > _______________________________________________ > Haskell-Cafe mailing list > To (un)subscribe, modify options or view archives go to: > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe > Only members subscribed via the mailman list are allowed to post.