[Haskell-cafe] Haskell-Cafe Digest, Vol 165, Issue 19

Sun May 14 01:20:28 UTC 2017

On 5/12/17, haskell-cafe-request at haskell.org
<haskell-cafe-request at haskell.org> wrote:
> Send Haskell-Cafe mailing list submissions to
> 	haskell-cafe at haskell.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> 	http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> or, via email, send a message with subject or body 'help' to
> 	haskell-cafe-request at haskell.org
>
> You can reach the person managing the list at
> 	haskell-cafe-owner at haskell.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Haskell-Cafe digest..."
>
>
> Today's Topics:
>
>    1. Haskell IDE Engine users? (Alan & Kim Zimmerman)
>    2. Fast JSON validation - reducing allocations (David Turner)
>    3. Re: Fast JSON validation - reducing allocations (Bardur Arantsson)
>    4. Re: Fast JSON validation - reducing allocations (David Turner)
>    5. Re: Fast JSON validation - reducing allocations (Bardur Arantsson)
>    6. Re: Fast JSON validation - reducing allocations (Eric Seidel)
>    7. Re: Fast JSON validation - reducing allocations (Ben Gamari)
>    8. Re: Fast JSON validation - reducing allocations (Ben Gamari)
>    9. Re: Fast JSON validation - reducing allocations (David Turner)
>   10. Re: Fast JSON validation - reducing allocations (Ben Gamari)
>   11. Re: Fast JSON validation - reducing allocations (David Turner)
>   12. Re: Fast JSON validation - reducing allocations (Arjen)
>   13. Re: Fast JSON validation - reducing allocations (David Turner)
>   14. Re: Fast JSON validation - reducing allocations (Arjen)
>   15. Re: Fast JSON validation - reducing allocations (Arjen)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 11 May 2017 16:02:17 +0200
> From: "Alan & Kim Zimmerman" <alan.zimm at gmail.com>
> To: haskell <haskell-cafe at haskell.org>
> Subject: [Haskell-cafe] Haskell IDE Engine users?
> Message-ID:
> 	<CANma=H9bVt9bY=PmTBy4kCrMGUVAdD1A_sWWmc96EazrkmM3Ug at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Stupid question, but is anyone using/integrating HIE into anything at the
> moment?
>
> I am considering rearranging the internals to orient it toward being mainly
> a Language Server Protocol backend , and I want to see who would be
> affected.
>
> Alan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/1d25664b/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 11 May 2017 17:12:15 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <Haskell-Cafe at haskell.org>
> Subject: [Haskell-cafe] Fast JSON validation - reducing allocations
> Message-ID:
> 	<CAErCyQ8wKwzU9uH8iHbT7n++aJB3UHsso5tB4Y7tn=-m5o1W7A at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Café,
>
> We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
> would like to validate (NB not parse) as quickly as possible. We just need
> to check that the bytes are a syntactically valid JSON document and then
> pass them elsewhere; we do not care what the content is. The documents are
> encoded in UTF-8 and have no extraneous whitespace.
>
> We have made a number of attempts:
>
> 1. Use Aeson to parse the documents and then discard the result. This was
> actually pretty good: Criterion clocked it at ~30us per document.
>
> 2. A hand-written Attoparsec parser. This did quite a bit worse: ~70us per
> document.
>
> 3. Use Get (from the binary package), which was a bit messier as it was
> factored to avoid all backtracking, but managed to get down to around ~20us
> per document.
>
> 4. Build a table-based, Word8-labelled DFA (with a minor hack to deal with
> values-within-values) and run it with ByteString.Lazy.foldl'. Also around
> ~20us per document.
>
> We didn't pursue the Attoparsec idea any further, but the other three all
> allocate ~85kB per document, and this now seems to be the limiting factor
> for performance. In more detail, looking at the threadscope output, we see
> it parse 5 or 6 documents in ~150us (allocating ~500kB) and then run the
> GC, which takes about another 150us. The GC stats indicate that it's only
> copying a couple of kB of live data each time, so most of that 500kB is
> junk.
>
> Now 85kB per document is ~96B per byte, which is only a pointer-and-a-half.
> It's not entirely surprising that there's this much allocation going on
> each time round the loop, but we'd like to try and get rid of it. I've
> tried the usual strictness tricks and have peered in a mystified fashion at
> the output of -ddump-simpl but am none the wiser.
>
> There's a copy of the relevant code for option 4 at
> https://github.com/DaveCTurner/json-validator. I've hacked around with it a
> bit since producing the numbers above, so it's now apparently a bit slower
> than Aeson but allocates less too (~65kB).
>
> Could anyone help, e.g. by pointing me at the bit in the Core that is
> allocating within the main loop?
>
> Many thanks,
>
> David
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/4911e5f8/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 11 May 2017 18:30:18 +0200
> From: Bardur Arantsson <spam at scientician.net>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <of23ik$qog$1 at blaine.gmane.org>
> Content-Type: text/plain; charset=utf-8
>
> On 2017-05-11 18:12, David Turner wrote:
>> Dear Café,
>>
>> We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
>> would like to validate (NB not parse) as quickly as possible. We just
>> need to check that the bytes are a syntactically valid JSON document and
>> then pass them elsewhere; we do not care what the content is. The
>> documents are encoded in UTF-8 and have no extraneous whitespace.
>>
>
> No particular recommendations, but you might want to look into
> semi-indexing[1] as a strategy. It looks plausible that it would be
> possible to do that without a lot of allocation; see the paper for
> details. (I there's also a demo implementation in Python on GH.)
>
> [1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 11 May 2017 17:59:32 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID:
> 	<CAErCyQ_svveeCq1fsvsNKxsgD0tDaQJqYJVse7W3JeuHVHSpEA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Interesting, thanks for the link. However we're trying to do _even less_
> than that - this is just the "scan the document to produce the Elias-Fano
> encoding" step, except without needing to keep a hold of it for later. It's
> not quite as trivial as that paper makes out as (a) it doesn't mention the
> possibility that the documents might not be well-formed, and (b) it doesn't
> really talk about dealing with the special delimiting characters within
> string literals, for which you need to drag along a bunch of state while
> you're scanning.
>
> This'd be pretty trivial to do in a C-like language now we've got the DFA
> tables built, and I may resort to that at some point if we can't work out
> how to avoid the allocations in Haskell-land, but I'd quite like to be able
> to solve problems of this form without unnecessarily resorting to C.
>
> Cheers,
>
> On 11 May 2017 at 17:30, Bardur Arantsson <spam at scientician.net> wrote:
>
>> On 2017-05-11 18:12, David Turner wrote:
>> > Dear Café,
>> >
>> > We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
>> > would like to validate (NB not parse) as quickly as possible. We just
>> > need to check that the bytes are a syntactically valid JSON document and
>> > then pass them elsewhere; we do not care what the content is. The
>> > documents are encoded in UTF-8 and have no extraneous whitespace.
>> >
>>
>> No particular recommendations, but you might want to look into
>> semi-indexing[1] as a strategy. It looks plausible that it would be
>> possible to do that without a lot of allocation; see the paper for
>> details. (I there's also a demo implementation in Python on GH.)
>>
>> [1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/3feb556a/attachment-0001.html>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 11 May 2017 19:11:04 +0200
> From: Bardur Arantsson <spam at scientician.net>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <of25v2$9o2$1 at blaine.gmane.org>
> Content-Type: text/plain; charset=utf-8
>
> On 2017-05-11 18:59, David Turner wrote:
>> Interesting, thanks for the link. However we're trying to do _even less_
>> than that - this is just the "scan the document to produce the
>> Elias-Fano encoding" step, except without needing to keep a hold of it
>> for later. It's not quite as trivial as that paper makes out as (a) it
>> doesn't mention the possibility that the documents might not be
>> well-formed, and (b) it doesn't really talk about dealing with the
>> special delimiting characters within string literals, for which you need
>> to drag along a bunch of state while you're scanning.
>
> My thinking was actually that it could be used reduce the allocations
> since everything (AFAIR, it's been a while) would basically just be
> linear datastructures and indexes.
>
>> This'd be pretty trivial to do in a C-like language now we've got the
>> DFA tables built, and I may resort to that at some point if we can't
>> work out how to avoid the allocations in Haskell-land, but I'd quite
>> like to be able to solve problems of this form without unnecessarily
>> resorting to C.
>
> We'll there's always
>
>      https://hackage.haskell.org/package/inline-c
>
> :)
>
> Regards,
>
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 11 May 2017 10:26:49 -0700
> From: Eric Seidel <eric at seidel.io>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID:
> 	<1494523609.3067104.973553536.32A155B5 at webmail.messagingengine.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> On Thu, May 11, 2017, at 09:12, David Turner wrote:
>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> allocating within the main loop?
>
> GHC has a -ticky flag that tracks precisely where allocations are
> happening. It's quite low level (you may have to stare at the STG in
> addition to the Core to interpret the results) but I've found it
> incredibly useful when trying to understand performance swings in GHC's
> own benchmarks.
>
> https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>
> You can enable it per module, which is nice break from the
> rebuild-the-world -prof, but the flip side IIRC is that if the
> allocations are happening in a module that wasn't compiled with -ticky
> (e.g. somewhere inside Data.Array) you might not see them.
>
> Hope this helps!
> Eric
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 11 May 2017 13:57:58 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dct25-561bs at mythic-beasts.com>
> Cc: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <87k25ngtsp.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> Eric Seidel <eric at seidel.io> writes:
>
>> On Thu, May 11, 2017, at 09:12, David Turner wrote:
>>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>>> allocating within the main loop?
>>
>> GHC has a -ticky flag that tracks precisely where allocations are
>> happening. It's quite low level (you may have to stare at the STG in
>> addition to the Core to interpret the results) but I've found it
>> incredibly useful when trying to understand performance swings in GHC's
>> own benchmarks.
>>
>> https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>>
>> You can enable it per module, which is nice break from the
>> rebuild-the-world -prof, but the flip side IIRC is that if the
>> allocations are happening in a module that wasn't compiled with -ticky
>> (e.g. somewhere inside Data.Array) you might not see them.
>>
> Indeed, -ticky is wonderful for this sort of thing since it doesn't
> affect Core-to-Core optimization, meaning that the instrumented program
> will behave very similarly to the uninstrumented version (just a bit
> slower due to counter overhead).
>
> I had a quick look at your example over lunch. I started like this
> (after commenting out the aeson benchmark in Main),
>
>     $ ghc app/Main.hs -isrc  -O2 -ddump-stg -fforce-recomp \
>           -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \
>           -ticky -ticky-allocd
>     $ app/Main +RTS -rticky.out
>
> This produces a Ticky report (named ticky.out) in the current directory.
> This includes a table which in my case looked something like,
>
>         Entries      Alloc    Alloc'd  Non-void Arguments      STG Name
>
> --------------------------------------------------------------------------------
>             123        192         48   1 L                    go9{v slyX}
> (main at main:Automaton)
>               2          0         32   1 L                    go8{v slyH}
> (main at main:Automaton)
>         1465016   58600640  146501600   0                      $w$j1{v slMc}
> (main at main:Automaton) in slHB
>       161884268          011761148448   0                      $w$j1{v slJT}
> (main at main:Automaton) in slHA
>               0          0   11720128   4 ppww                 $s$wgo2{v
> slHz} (main at main:Automaton) in r4s
>       16353241111790448768   13185144   5 wwLpp                $wgo7{v slHA}
> (main at main:Automaton) in r4s
>         1831270  167011824   13185144   6 ppwLww               $s$wgo1{v
> slHB} (main at main:Automaton) in r4s
>          183127          0   13185144   0                      $w$j{v slGN}
> (main at main:Automaton) in r4s
>             992       8624      10944   1 L                    go7{v slwV}
> (main at main:Automaton) in r4m
>               3         72          0   1 C
> main at main:Automaton.showGraph120{v r3O}
>             681          0          0   2 ML                   go6{v rkSd}
> (main at main:Automaton)
>
> The important things we care about here are Alloc (that is how many
> bytes each name allocates) and Alloc'd (that is, how many of each
> closure are allocated in bytes). [Aside: we explicitly enabled the
> Alloc'd column during compilation with the -ticky-allocd flag; if you
> omit it this column will contain only zeros. In general I find it
> helpful to have so I generally enable it, despite the overhead it
> introduces.]
>
> The table suggests a few nice starting points: $w$j1_slJT is allocated
> quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output
> (src/Automaton.dump-stg) to try to understand why this is. Reading STG
> generally tends to be quite hard due to the rather large, deeply nested
> trees that most programs compile to. That being said, a good editor can
> help immensely (I also have my ghc-dump [1] package which I've been
> meaning to add STG support to, which would enable some interesting
> analysis techniques).
>
> Unfortunately, my lunch is now all gone so I'll have to drop this here
> for now. However, I'll try to pick it up again tonight. Do keep us
> apprised of your progress!
>
> Cheers,
>
> - Ben
>
>
> [1] https://github.com/bgamari/ghc-dump
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/7005fe35/attachment-0001.sig>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 11 May 2017 14:40:14 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Haskell Cafe
> 	<Haskell-Cafe at haskell.org>
> Cc: Luke Maurer <maurerl at cs.uoregon.edu>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <87bmqzgru9.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> Ccing Luke Maurer under the assumption that he will appreciate seeing
> the fruits of his labor.
>
>
> David Turner <dct25-561bs at mythic-beasts.com> writes:
>
>> Dear Café,
>>
> (snip)
>>
>> There's a copy of the relevant code for option 4 at
>> https://github.com/DaveCTurner/json-validator. I've hacked around with it
>> a
>> bit since producing the numbers above, so it's now apparently a bit slower
>> than Aeson but allocates less too (~65kB).
>>
>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> allocating within the main loop?
>>
> While turning this over in my head I realized that this is the sort of
> program which may be helped significantly by GHC 8.2's improved join
> point handling. Indeed the timings seem to support this hypothesis:
>
> GHC 8.0.2
>
>     benchmarking json-validator/Automaton/testEvent
>     time                 22.84 μs   (22.76 μs .. 22.94 μs)
>                          1.000 R²   (1.000 R² .. 1.000 R²)
>     mean                 22.84 μs   (22.76 μs .. 22.94 μs)
>     std dev              297.4 ns   (221.8 ns .. 378.8 ns)
>
>     Alloc rate     4,015,256,078,038 bytes per MUT second
>
> GHC 8.2
>
>     benchmarking json-validator/Automaton/testEvent
>     time                 9.221 μs   (9.141 μs .. 9.318 μs)
>                          0.998 R²   (0.996 R² .. 1.000 R²)
>     mean                 9.163 μs   (9.084 μs .. 9.356 μs)
>     std dev              399.8 ns   (193.0 ns .. 745.4 ns)
>     variance introduced by outliers: 54% (severely inflated)
>
>     Alloc rate           123,141,635 bytes per MUT second
>
>
> Wow! I suspect your allocations have now vanished.
>
> I didn't verify that the improvement really was due to more join points,
> but it seems quite likely. Well done Luke!
>
> Cheers,
>
> - Ben
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/f0c0ff60/attachment-0001.sig>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 11 May 2017 19:56:00 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <Haskell-Cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID:
> 	<CAErCyQ-FyBCEQ-2zTFd5NyeM_Pp_LFGv=w1cr4_2UByfoPNoFw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Ben, Eric, thanks for your help. A whole new world of low-level statistics
> opens up before me...
>
> On 11 May 2017 at 18:57, Ben Gamari <ben at smart-cactus.org> wrote:
>> Eric Seidel <eric at seidel.io> writes:
>>
>> > On Thu, May 11, 2017, at 09:12, David Turner wrote:
>> >> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> >> allocating within the main loop?
>> >
>> > GHC has a -ticky flag that tracks precisely where allocations are
>> > happening. It's quite low level (you may have to stare at the STG in
>> > addition to the Core to interpret the results) but I've found it
>> > incredibly useful when trying to understand performance swings in GHC's
>> > own benchmarks.
>> >
>> > https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>> >
>> > You can enable it per module, which is nice break from the
>> > rebuild-the-world -prof, but the flip side IIRC is that if the
>> > allocations are happening in a module that wasn't compiled with -ticky
>> > (e.g. somewhere inside Data.Array) you might not see them.
>>
>> Indeed, -ticky is wonderful for this sort of thing since it doesn't
>> affect Core-to-Core optimization, meaning that the instrumented program
>> will behave very similarly to the uninstrumented version (just a bit
>> slower due to counter overhead).
>>
>> I had a quick look at your example over lunch. I started like this
>> (after commenting out the aeson benchmark in Main),
>>
>>     $ ghc app/Main.hs -isrc  -O2 -ddump-stg -fforce-recomp \
>>           -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \
>>           -ticky -ticky-allocd
>>     $ app/Main +RTS -rticky.out
>
>
> Ok, I was just about to ask about the Alloc'd column as mine was all
> zeroes, but the -ticky-allocd was what I needed, thanks.
>
>> The table suggests a few nice starting points: $w$j1_slJT is allocated
>> quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output
>> (src/Automaton.dump-stg) to try to understand why this is. Reading STG
>> generally tends to be quite hard due to the rather large, deeply nested
>> trees that most programs compile to.
>
> You're not wrong.
>
> I've trawled through the STG as best as I can for a newbie. The function
> with a large number in the 'Alloc' column looks pretty much to be the heart
> of the `step` function. There's a lot of nesting but it looks largely to be
> just checking the bounds on array indices, which I don't think allocates
> anything.
>
> However I see this this little snippet at the very bottom of the nesting:
>
> case
>     indexArray# [ds1_sl1b
>                  y2_sl5H]
> of
> _ [Occ=Dead]
> { (##) ipv8_sl60 [Occ=Once!] ->
>       case
>           ipv8_sl60
>       of
>       _ [Occ=Dead]
>       { GHC.Word.W8# dt6_sl62 [Occ=Once] ->
>             case
>                 ss_sl5v
>             of
>             dt7_sl63
>             { __DEFAULT ->
>                   (#,,#) [__word 1
>                           dt6_sl62
>                           dt7_sl63];
>             };
>       };
> };
>
>
> I'm guessing that the `indexArray` call is the line that reads `nextState =
> aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me
> it looks like it might be unboxing the thing it's getting out of the array.
> Not sure that counts as an allocation, but I'm surprised anyway.
>
> I'll keep digging anyway, but it'd be good to hear of any progress at your
> end too. Hope it didn't spoil your lunch!
>
> Cheers,
>
>> While turning this over in my head I realized that this is the sort of
>> program which may be helped significantly by GHC 8.2's improved join
>> point handling. Indeed the timings seem to support this hypothesis:
>>
>> GHC 8.0.2
>>
>>     benchmarking json-validator/Automaton/testEvent
>>     time                 22.84 μs   (22.76 μs .. 22.94 μs)
>>                          1.000 R²   (1.000 R² .. 1.000 R²)
>>     mean                 22.84 μs   (22.76 μs .. 22.94 μs)
>>     std dev              297.4 ns   (221.8 ns .. 378.8 ns)
>>
>>     Alloc rate     4,015,256,078,038 bytes per MUT second
>>
>> GHC 8.2
>>
>>     benchmarking json-validator/Automaton/testEvent
>>     time                 9.221 μs   (9.141 μs .. 9.318 μs)
>>                          0.998 R²   (0.996 R² .. 1.000 R²)
>>     mean                 9.163 μs   (9.084 μs .. 9.356 μs)
>>     std dev              399.8 ns   (193.0 ns .. 745.4 ns)
>>     variance introduced by outliers: 54% (severely inflated)
>>
>>     Alloc rate           123,141,635 bytes per MUT second
>>
>>
>> Wow! I suspect your allocations have now vanished.
>
> Ooo, that's more like it.
>
> Could you run again using the following to get Criterion's estimate of the
> allocations-per-call?
>
> json-validator-exe --regress allocated:iters +RTS -T
>
> Cheers,
>
> David
>
> On 11 May 2017 at 19:45, David Turner <dave.c.turner at gmail.com> wrote:
>
>> On 11 May 2017 at 19:40, Ben Gamari <ben at smart-cactus.org> wrote:
>>
>>> Ccing Luke Maurer under the assumption that he will appreciate seeing
>>> the fruits of his labor.
>>>
>>>
>>> David Turner <dct25-561bs at mythic-beasts.com> writes:
>>>
>>> > Dear Café,
>>> >
>>> (snip)
>>> >
>>> > There's a copy of the relevant code for option 4 at
>>> > https://github.com/DaveCTurner/json-validator. I've hacked around with
>>> it a
>>> > bit since producing the numbers above, so it's now apparently a bit
>>> slower
>>> > than Aeson but allocates less too (~65kB).
>>> >
>>> > Could anyone help, e.g. by pointing me at the bit in the Core that is
>>> > allocating within the main loop?
>>> >
>>> While turning this over in my head I realized that this is the sort of
>>> program which may be helped significantly by GHC 8.2's improved join
>>> point handling. Indeed the timings seem to support this hypothesis:
>>>
>>> GHC 8.0.2
>>>
>>>     benchmarking json-validator/Automaton/testEvent
>>>     time                 22.84 μs   (22.76 μs .. 22.94 μs)
>>>                          1.000 R²   (1.000 R² .. 1.000 R²)
>>>     mean                 22.84 μs   (22.76 μs .. 22.94 μs)
>>>     std dev              297.4 ns   (221.8 ns .. 378.8 ns)
>>>
>>>     Alloc rate     4,015,256,078,038 bytes per MUT second
>>>
>>> GHC 8.2
>>>
>>>     benchmarking json-validator/Automaton/testEvent
>>>     time                 9.221 μs   (9.141 μs .. 9.318 μs)
>>>                          0.998 R²   (0.996 R² .. 1.000 R²)
>>>     mean                 9.163 μs   (9.084 μs .. 9.356 μs)
>>>     std dev              399.8 ns   (193.0 ns .. 745.4 ns)
>>>     variance introduced by outliers: 54% (severely inflated)
>>>
>>>     Alloc rate           123,141,635 bytes per MUT second
>>>
>>>
>>> Wow! I suspect your allocations have now vanished.
>>>
>>
>>
>> Ooo, that's more like it.
>>
>> Could you run again using the following to get Criterion's estimate of the
>> allocations-per-call?
>>
>> json-validator-exe --regress allocated:iters +RTS -T
>>
>> Cheers,
>>
>> David
>>
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/dbaa6e9f/attachment-0001.html>
>
> ------------------------------
>
> Message: 10
> Date: Thu, 11 May 2017 15:32:41 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dave.c.turner at gmail.com>
> Cc: Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <878tm3gpeu.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> David Turner <dave.c.turner at gmail.com> writes:
>
>> Ben, Eric, thanks for your help. A whole new world of low-level statistics
>> opens up before me...
>>
> No worries!
>
>> You're not wrong.
>>
>> I've trawled through the STG as best as I can for a newbie. The function
>> with a large number in the 'Alloc' column looks pretty much to be the
>> heart
>> of the `step` function. There's a lot of nesting but it looks largely to
>> be
>> just checking the bounds on array indices, which I don't think allocates
>> anything.
>>
> Something that I should have mentioned earlier is that STG has the nice
> property that all allocation is syntactically obvious: allocated
> closures manifest as `let`s. This makes it fairly easy to pick out
> possible allocation sites, even in large dumps.
>
>> However I see this this little snippet at the very bottom of the nesting:
>>
>> case
>>     indexArray# [ds1_sl1b
>>                  y2_sl5H]
>> of
>> _ [Occ=Dead]
>> { (##) ipv8_sl60 [Occ=Once!] ->
>>       case
>>           ipv8_sl60
>>       of
>>       _ [Occ=Dead]
>>       { GHC.Word.W8# dt6_sl62 [Occ=Once] ->
>>             case
>>                 ss_sl5v
>>             of
>>             dt7_sl63
>>             { __DEFAULT ->
>>                   (#,,#) [__word 1
>>                           dt6_sl62
>>                           dt7_sl63];
>>             };
>>       };
>> };
>>
>>
>> I'm guessing that the `indexArray` call is the line that reads `nextState
>> =
>> aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me
>> it looks like it might be unboxing the thing it's getting out of the
>> array.
>> Not sure that counts as an allocation, but I'm surprised anyway.
>>
> The unboxing should just involve looking at the value field of the W8#
> closure (e.g. a memory dereference) and putting the value in a machine
> register. There is no allocation here as far as I can see.
>
>
>> I'll keep digging anyway, but it'd be good to hear of any progress at your
>> end too. Hope it didn't spoil your lunch!
>
> Far from it! It was nice to have a puzzle to ponder.
>
> Cheers,
>
> - Ben
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/abf07018/attachment-0001.sig>
>
> ------------------------------
>
> Message: 11
> Date: Fri, 12 May 2017 09:10:50 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Ben Gamari <ben at smart-cactus.org>, Haskell Cafe
> 	<haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID:
> 	<CAErCyQ8dDyZPa8z_8hC8U0o+6Q6QeP1yxh3+Hjjj8s--WNq9ng at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Morning all,
>
> On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>
>>
>> Something that I should have mentioned earlier is that STG has the nice
>> property that all allocation is syntactically obvious: allocated
>> closures manifest as `let`s. This makes it fairly easy to pick out
>> possible allocation sites, even in large dumps.
>>
>
>
> Ah, that's very useful to know!
>
> Armed with that knowledge, I came to the conclusion that the allocation was
> for the sharing of the `nextState` variable. Inlining it brings it down to
> 20us and 22kB per iteration.
>
> https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc2e76f19bef98f42e0b233524
>
> Getting closer, but it looks like waiting for 8.2 is a better answer.
> Looking forward to it!
>
> Cheers,
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170512/1e039627/attachment-0001.html>
>
> ------------------------------
>
> Message: 12
> Date: Fri, 12 May 2017 10:27:40 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Ben Gamari
> 	<ben at smart-cactus.org>, Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <1494577660.20209.1.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote:
>> Morning all,
>>
>> On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>> > Something that I should have mentioned earlier is that STG has the
>> > nice
>> > property that all allocation is syntactically obvious: allocated
>> > closures manifest as `let`s. This makes it fairly easy to pick out
>> > possible allocation sites, even in large dumps.
>>
>>
>> Ah, that's very useful to know!
>>
>> Armed with that knowledge, I came to the conclusion that the
>> allocation was for the sharing of the `nextState` variable. Inlining
>> it brings it down to 20us and 22kB per iteration.
>>
>> https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc
>> 2e76f19bef98f42e0b233524
>>
>> Getting closer, but it looks like waiting for 8.2 is a better answer.
>> Looking forward to it!
>>
>> Cheers,
>>
>
> Maybe this is a silly question, and please let me know why if so, but:
>
> Has anyone thought about parallelizing it for multiple messages in
> order to "produce garbage faster"? While reducing allocation will make
> the single validations faster, doing multiple ones might improve the
> throughput per GC ratio. This assumes that the amount of live data in
> the heap is small, making GC sort of constant time, and having multiple
> cores available.
>
> I wonder whether a few strategically placed par's and pseq's might
> allow you to scale horizontally without requiring invasive changes to
> the program's structure. Apologies for not trying to do this myself
> first ;-).
>
> kind regards, Arjen
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 12 May 2017 10:48:01 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Arjen <arjenvanweelden at gmail.com>, Haskell Cafe
> 	<haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID:
> 	<CAErCyQ9A+f4bFR7PGrWcFDHJVik__3pQNHac4GXZOt6gW8WLqQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On 12 May 2017 at 09:27, Arjen <arjenvanweelden at gmail.com> wrote:
>
>>
>> Maybe this is a silly question, and please let me know why if so, but:
>>
>> Has anyone thought about parallelizing it for multiple messages in
>> order to "produce garbage faster"? While reducing allocation will make
>> the single validations faster, doing multiple ones might improve the
>> throughput per GC ratio. This assumes that the amount of live data in
>> the heap is small, making GC sort of constant time, and having multiple
>> cores available.
>>
>
> Not a silly question at all. Adding the following incantation:
>
>     `using` parListChunk 100 rseq
>
> does quite happily spread things across all 4 cores on my development
> machine, and it's certainly a bit faster. To give some stats, it processes
> ~24 events between GCs rather than ~6, and collects ~2MB rather than
> ~500kB. The throughput becomes a lot less consistent, however, at least
> partly due to some bigger GC pauses along the way. As Ben's statistics
> showed, our allocation rate on one thread is around 4TBps, which is already
> stressing the GC out a bit, and parallelising it doesn't make that problem
> any easier.
>
> I know in the OP I said "we have a stream" (accidental MLK misquote) but in
> fact there are a bunch of parallel streams whose number exceeds the number
> of available cores, so we don't anticipate any enormous benefit from
> spreading the processing of any one stream across multiple cores:
> single-threaded performance is what we think we should be concentrating on.
>
> Cheers,
>
> David
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170512/90b18fdd/attachment-0001.html>
>
> ------------------------------
>
> Message: 14
> Date: Fri, 12 May 2017 11:52:04 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Ben Gamari
> 	<ben at smart-cactus.org>, Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <1494582724.20209.3.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 10:27 +0200, Arjen wrote:
>> On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote:
>> > Morning all,
>> >
>> > On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>> > > Something that I should have mentioned earlier is that STG has
>> the
>> > > nice
>> > > property that all allocation is syntactically obvious: allocated
>> > > closures manifest as `let`s. This makes it fairly easy to pick
>> out
>> > > possible allocation sites, even in large dumps.
>> >
>> >
>> > Ah, that's very useful to know!
>> >
>> > Armed with that knowledge, I came to the conclusion that the
>> > allocation was for the sharing of the `nextState` variable.
>> Inlining
>> > it brings it down to 20us and 22kB per iteration.
>> >
>> > https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7
>> bc
>> > 2e76f19bef98f42e0b233524
>> >
>> > Getting closer, but it looks like waiting for 8.2 is a better
>> answer.
>> > Looking forward to it!
>> >
>> > Cheers,
>> >
>>
>> Maybe this is a silly question, and please let me know why if so,
>> but:
>>
>> Has anyone thought about parallelizing it for multiple messages in
>> order to "produce garbage faster"? While reducing allocation will
>> make
>> the single validations faster, doing multiple ones might improve the
>> throughput per GC ratio. This assumes that the amount of live data in
>> the heap is small, making GC sort of constant time, and having
>> multiple
>> cores available.
>>
>> I wonder whether a few strategically placed par's and pseq's might
>> allow you to scale horizontally without requiring invasive changes to
>> the program's structure. Apologies for not trying to do this myself
>> first ;-).
>>
>
> Unfortunately, this does not seem to help. I tried it and while I did
> see 3.5 and 3.95 CPU's used with parList and parListChunk 8, the
> timings show it to be slower than sequential evaluation of a list of
> testEvent. Thank you for the problem to try this on :-)
>
> Please let me know if I made a mistake in the testing code below.
>
> kind regards, Arjen
>
> ---8<---
>
> benchmarking json-validator/Automaton/testEvent
> time                 26.57 μs   (26.07 μs .. 27.12 μs)
>                      0.987 R²   (0.973 R² .. 0.995 R²)
> mean                 31.01 μs   (28.17 μs .. 44.49 μs)
> std dev              15.65 μs   (5.258 μs .. 35.77 μs)
> variance introduced by outliers: 99% (severely inflated)
>
> benchmarking json-validator/Automaton-List/testEventList
> time                 65.11 ms   (61.54 ms .. 69.54 ms)
>                      0.982 R²   (0.951 R² .. 0.996 R²)
> mean                 59.18 ms   (56.28 ms .. 63.22 ms)
> std dev              5.801 ms   (4.344 ms .. 8.238 ms)
> variance introduced by outliers: 32% (moderately inflated)
>
> benchmarking json-validator/Automaton-ParList/testEventList
> time                 243.9 ms   (214.9 ms .. 270.6 ms)
>                      0.996 R²   (0.986 R² .. 1.000 R²)
> mean                 253.7 ms   (243.6 ms .. 260.8 ms)
> std dev              10.32 ms   (6.781 ms .. 13.19 ms)
> variance introduced by outliers: 16% (moderately inflated)
>
> benchmarking json-validator/Automaton-ParListChunk/testEventList
> time                 211.4 ms   (193.1 ms .. 232.3 ms)
>                      0.997 R²   (0.990 R² .. 1.000 R²)
> mean                 200.3 ms   (193.0 ms .. 206.6 ms)
> std dev              9.256 ms   (7.106 ms .. 10.21 ms)
> variance introduced by outliers: 14% (moderately inflated)
>
>   19,225,632,224 bytes allocated in the heap
>      109,968,376 bytes copied during GC
>        2,062,736 bytes maximum residency (20 sample(s))
>        2,250,352 bytes maximum slop
>               10 MB total memory in use (0 MB lost due to
> fragmentation)
>
>                                      Tot time (elapsed)  Avg pause  Max
> pause
>   Gen  0     28722 colls, 28722
> par   17.829s   1.591s     0.0001s    0.0160s
>   Gen  1        20 colls,    19
> par    0.255s   0.054s     0.0027s    0.0058s
>
>   Parallel GC work balance: 7.41% (serial 0%, perfect 100%)
>
>   TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)
>
>   SPARKS: 25625 (25570 converted, 0 overflowed, 0 dud, 20 GC'd, 35
> fizzled)
>
>   INIT    time    0.001s  (  0.002s elapsed)
>   MUT     time   44.403s  ( 22.013s elapsed)
>   GC      time   18.084s  (  1.645s elapsed)
>   EXIT    time    0.001s  (  0.001s elapsed)
>   Total   time   62.490s  ( 23.660s elapsed)
>
>   Alloc rate    432,981,654 bytes per MUT second
>
>   Productivity  71.1% of total user, 93.0% of total elapsed
>
> gc_alloc_block_sync: 25370
> whitehole_spin: 0
> gen[0].sync: 405
> gen[1].sync: 22
>
> ---8<---
>
> {-# LANGUAGE OverloadedStrings #-}
>
> module Main where
>
> import           Criterion.Main
> import           Data.Aeson
> import qualified Data.ByteString.Lazy as BL
> import qualified Data.Text.Encoding   as T
> import qualified Data.Text.IO as T
> import qualified Data.Text as T (append, pack)
> import qualified Automaton
> import Control.Parallel.Strategies
>
> testEvent :: Int -> BL.ByteString
> testEvent i = BL.fromStrict $ T.encodeUtf8 (T.append
> "{\"stream\":\"actuals-
> stream\",\"submitter\":{\"type\":\"other\",\"description\":\"redactedre
> dac\"},\"driverActivities\":[{\"driverActivity\":{\"journey\":{\"headco
> de\":\"1A01\",\"description\":\"redactedredactedredactedredactedredacte
> dredacte\"},\"activity\":[{\"arrivalTime\":null,\"sequence\":1,\"tiploc
> \":\"REDACTE\",\"stop\":true,\"departureTime\":\"2016-06-
> 09T18:22:28.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:24:43.000000000000Z\",\"sequence\":2,\"tiploc\":\"REDACTE\",\"sto
> p\":true,\"departureTime\":\"2016-06-
> 09T18:25:51.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:26:58.000000000000Z\",\"sequence\":3,\"tiploc\":\"REDACT\",\"stop
> \":true,\"departureTime\":\"2016-06-
> 09T18:28:08.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:29:57.000000000000Z\",\"sequence\":4,\"tiploc\":\"REDACTE\",\"sto
> p\":true,\"departureTime\":null}]},\"activityUserId\":\"521be60a-02f2-
> 4892-b468-c17d9c1c4fcf\"}],\"submissionTime\":\"2016-06-
> 09T18:36:45.831486000000Z\",\"type\":\"driverActivityLogged" (T.append
> (T.pack (show i)) "\"}"))
>
> data AnyJSON = AnyJSON
>   deriving (Show, Eq)
>
> instance FromJSON AnyJSON where
>   parseJSON _ = pure AnyJSON
>
> main :: IO ()
> main = print (testEventList [0]) >> defaultMain
>   [ bgroup "json-validator/Automaton"
>     [ bench "testEvent" $ whnf Automaton.isValidJson (testEvent 0)
>     ]
>   , bgroup "json-validator/Automaton-List"
>     [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` rseq)) [1..1000]
>     ]
>   , bgroup "json-validator/Automaton-ParList"
>     [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` parList rseq)) [1..1000]
>     ]
>   , bgroup "json-validator/Automaton-ParListChunk"
>     [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` parListChunk 8 rseq)) [1..1000]
>     ]
>   ]
>   where
>     testEventList = map (Automaton.isValidJson . testEvent)
>
>
>
> ------------------------------
>
> Message: 15
> Date: Fri, 12 May 2017 12:00:02 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Haskell Cafe
> 	<haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> 	allocations
> Message-ID: <1494583202.20209.5.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 10:48 +0100, David Turner wrote:
>> On 12 May 2017 at 09:27, Arjen <arjenvanweelden at gmail.com> wrote:
>> > Maybe this is a silly question, and please let me know why if so,
>> > but:
>> >
>> > Has anyone thought about parallelizing it for multiple messages in
>> > order to "produce garbage faster"? While reducing allocation will
>> > make
>> > the single validations faster, doing multiple ones might improve
>> > the
>> > throughput per GC ratio. This assumes that the amount of live data
>> > in
>> > the heap is small, making GC sort of constant time, and having
>> > multiple
>> > cores available.
>>
>> Not a silly question at all. Adding the following incantation:
>>
>>     `using` parListChunk 100 rseq
>>
>> does quite happily spread things across all 4 cores on my development
>> machine, and it's certainly a bit faster. To give some stats, it
>> processes ~24 events between GCs rather than ~6, and collects ~2MB
>> rather than ~500kB. The throughput becomes a lot less consistent,
>> however, at least partly due to some bigger GC pauses along the way.
>> As Ben's statistics showed, our allocation rate on one thread is
>> around 4TBps, which is already stressing the GC out a bit, and
>> parallelising it doesn't make that problem any easier.
>>
>> I know in the OP I said "we have a stream" (accidental MLK misquote)
>> but in fact there are a bunch of parallel streams whose number
>> exceeds the number of available cores, so we don't anticipate any
>> enormous benefit from spreading the processing of any one stream
>> across multiple cores: single-threaded performance is what we think
>> we should be concentrating on.
>>
>> Cheers,
>>
>> David
>>
> Happy to read that you have more success than I with parListChunk.
> I expect(ed) the sparks to be cheap, so this might help if the various
> incoming streams are very unbalanced in length and/or message sizes.
> I agree that multi-processing a single stream does not make much sense
> if you already have multiple concurrent streams.
>
> Nice to see that adding on parallelism in Haskell (GHC) is that easy
> (in this case) and with a very good speed-up factor!
>
> kind regards, Arjen
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>
>
> ------------------------------
>
> End of Haskell-Cafe Digest, Vol 165, Issue 19
> *********************************************
>