[Haskell-cafe] Haskell-Cafe Digest, Vol 165, Issue 19
Keith Ramsay
ramsaykeith at gmail.com
Sun May 14 01:20:28 UTC 2017
On 5/12/17, haskell-cafe-request at haskell.org
<haskell-cafe-request at haskell.org> wrote:
> Send Haskell-Cafe mailing list submissions to
> haskell-cafe at haskell.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> or, via email, send a message with subject or body 'help' to
> haskell-cafe-request at haskell.org
>
> You can reach the person managing the list at
> haskell-cafe-owner at haskell.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Haskell-Cafe digest..."
>
>
> Today's Topics:
>
> 1. Haskell IDE Engine users? (Alan & Kim Zimmerman)
> 2. Fast JSON validation - reducing allocations (David Turner)
> 3. Re: Fast JSON validation - reducing allocations (Bardur Arantsson)
> 4. Re: Fast JSON validation - reducing allocations (David Turner)
> 5. Re: Fast JSON validation - reducing allocations (Bardur Arantsson)
> 6. Re: Fast JSON validation - reducing allocations (Eric Seidel)
> 7. Re: Fast JSON validation - reducing allocations (Ben Gamari)
> 8. Re: Fast JSON validation - reducing allocations (Ben Gamari)
> 9. Re: Fast JSON validation - reducing allocations (David Turner)
> 10. Re: Fast JSON validation - reducing allocations (Ben Gamari)
> 11. Re: Fast JSON validation - reducing allocations (David Turner)
> 12. Re: Fast JSON validation - reducing allocations (Arjen)
> 13. Re: Fast JSON validation - reducing allocations (David Turner)
> 14. Re: Fast JSON validation - reducing allocations (Arjen)
> 15. Re: Fast JSON validation - reducing allocations (Arjen)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Thu, 11 May 2017 16:02:17 +0200
> From: "Alan & Kim Zimmerman" <alan.zimm at gmail.com>
> To: haskell <haskell-cafe at haskell.org>
> Subject: [Haskell-cafe] Haskell IDE Engine users?
> Message-ID:
> <CANma=H9bVt9bY=PmTBy4kCrMGUVAdD1A_sWWmc96EazrkmM3Ug at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Stupid question, but is anyone using/integrating HIE into anything at the
> moment?
>
> I am considering rearranging the internals to orient it toward being mainly
> a Language Server Protocol backend , and I want to see who would be
> affected.
>
> Alan
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/1d25664b/attachment-0001.html>
>
> ------------------------------
>
> Message: 2
> Date: Thu, 11 May 2017 17:12:15 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <Haskell-Cafe at haskell.org>
> Subject: [Haskell-cafe] Fast JSON validation - reducing allocations
> Message-ID:
> <CAErCyQ8wKwzU9uH8iHbT7n++aJB3UHsso5tB4Y7tn=-m5o1W7A at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Dear Café,
>
> We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
> would like to validate (NB not parse) as quickly as possible. We just need
> to check that the bytes are a syntactically valid JSON document and then
> pass them elsewhere; we do not care what the content is. The documents are
> encoded in UTF-8 and have no extraneous whitespace.
>
> We have made a number of attempts:
>
> 1. Use Aeson to parse the documents and then discard the result. This was
> actually pretty good: Criterion clocked it at ~30us per document.
>
> 2. A hand-written Attoparsec parser. This did quite a bit worse: ~70us per
> document.
>
> 3. Use Get (from the binary package), which was a bit messier as it was
> factored to avoid all backtracking, but managed to get down to around ~20us
> per document.
>
> 4. Build a table-based, Word8-labelled DFA (with a minor hack to deal with
> values-within-values) and run it with ByteString.Lazy.foldl'. Also around
> ~20us per document.
>
> We didn't pursue the Attoparsec idea any further, but the other three all
> allocate ~85kB per document, and this now seems to be the limiting factor
> for performance. In more detail, looking at the threadscope output, we see
> it parse 5 or 6 documents in ~150us (allocating ~500kB) and then run the
> GC, which takes about another 150us. The GC stats indicate that it's only
> copying a couple of kB of live data each time, so most of that 500kB is
> junk.
>
> Now 85kB per document is ~96B per byte, which is only a pointer-and-a-half.
> It's not entirely surprising that there's this much allocation going on
> each time round the loop, but we'd like to try and get rid of it. I've
> tried the usual strictness tricks and have peered in a mystified fashion at
> the output of -ddump-simpl but am none the wiser.
>
> There's a copy of the relevant code for option 4 at
> https://github.com/DaveCTurner/json-validator. I've hacked around with it a
> bit since producing the numbers above, so it's now apparently a bit slower
> than Aeson but allocates less too (~65kB).
>
> Could anyone help, e.g. by pointing me at the bit in the Core that is
> allocating within the main loop?
>
> Many thanks,
>
> David
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/4911e5f8/attachment-0001.html>
>
> ------------------------------
>
> Message: 3
> Date: Thu, 11 May 2017 18:30:18 +0200
> From: Bardur Arantsson <spam at scientician.net>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <of23ik$qog$1 at blaine.gmane.org>
> Content-Type: text/plain; charset=utf-8
>
> On 2017-05-11 18:12, David Turner wrote:
>> Dear Café,
>>
>> We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
>> would like to validate (NB not parse) as quickly as possible. We just
>> need to check that the bytes are a syntactically valid JSON document and
>> then pass them elsewhere; we do not care what the content is. The
>> documents are encoded in UTF-8 and have no extraneous whitespace.
>>
>
> No particular recommendations, but you might want to look into
> semi-indexing[1] as a strategy. It looks plausible that it would be
> possible to do that without a lot of allocation; see the paper for
> details. (I there's also a demo implementation in Python on GH.)
>
> [1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
>
>
>
> ------------------------------
>
> Message: 4
> Date: Thu, 11 May 2017 17:59:32 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID:
> <CAErCyQ_svveeCq1fsvsNKxsgD0tDaQJqYJVse7W3JeuHVHSpEA at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Interesting, thanks for the link. However we're trying to do _even less_
> than that - this is just the "scan the document to produce the Elias-Fano
> encoding" step, except without needing to keep a hold of it for later. It's
> not quite as trivial as that paper makes out as (a) it doesn't mention the
> possibility that the documents might not be well-formed, and (b) it doesn't
> really talk about dealing with the special delimiting characters within
> string literals, for which you need to drag along a bunch of state while
> you're scanning.
>
> This'd be pretty trivial to do in a C-like language now we've got the DFA
> tables built, and I may resort to that at some point if we can't work out
> how to avoid the allocations in Haskell-land, but I'd quite like to be able
> to solve problems of this form without unnecessarily resorting to C.
>
> Cheers,
>
> On 11 May 2017 at 17:30, Bardur Arantsson <spam at scientician.net> wrote:
>
>> On 2017-05-11 18:12, David Turner wrote:
>> > Dear Café,
>> >
>> > We have a stream of ~900byte JSON documents (lazy ByteStrings) that we
>> > would like to validate (NB not parse) as quickly as possible. We just
>> > need to check that the bytes are a syntactically valid JSON document and
>> > then pass them elsewhere; we do not care what the content is. The
>> > documents are encoded in UTF-8 and have no extraneous whitespace.
>> >
>>
>> No particular recommendations, but you might want to look into
>> semi-indexing[1] as a strategy. It looks plausible that it would be
>> possible to do that without a lot of allocation; see the paper for
>> details. (I there's also a demo implementation in Python on GH.)
>>
>> [1] http://www.di.unipi.it/~ottavian/files/semi_index_cikm.pdf
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/3feb556a/attachment-0001.html>
>
> ------------------------------
>
> Message: 5
> Date: Thu, 11 May 2017 19:11:04 +0200
> From: Bardur Arantsson <spam at scientician.net>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <of25v2$9o2$1 at blaine.gmane.org>
> Content-Type: text/plain; charset=utf-8
>
> On 2017-05-11 18:59, David Turner wrote:
>> Interesting, thanks for the link. However we're trying to do _even less_
>> than that - this is just the "scan the document to produce the
>> Elias-Fano encoding" step, except without needing to keep a hold of it
>> for later. It's not quite as trivial as that paper makes out as (a) it
>> doesn't mention the possibility that the documents might not be
>> well-formed, and (b) it doesn't really talk about dealing with the
>> special delimiting characters within string literals, for which you need
>> to drag along a bunch of state while you're scanning.
>
> My thinking was actually that it could be used reduce the allocations
> since everything (AFAIR, it's been a while) would basically just be
> linear datastructures and indexes.
>
>> This'd be pretty trivial to do in a C-like language now we've got the
>> DFA tables built, and I may resort to that at some point if we can't
>> work out how to avoid the allocations in Haskell-land, but I'd quite
>> like to be able to solve problems of this form without unnecessarily
>> resorting to C.
>
> We'll there's always
>
> https://hackage.haskell.org/package/inline-c
>
> :)
>
> Regards,
>
>
>
> ------------------------------
>
> Message: 6
> Date: Thu, 11 May 2017 10:26:49 -0700
> From: Eric Seidel <eric at seidel.io>
> To: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID:
> <1494523609.3067104.973553536.32A155B5 at webmail.messagingengine.com>
> Content-Type: text/plain; charset="utf-8"
>
>
> On Thu, May 11, 2017, at 09:12, David Turner wrote:
>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> allocating within the main loop?
>
> GHC has a -ticky flag that tracks precisely where allocations are
> happening. It's quite low level (you may have to stare at the STG in
> addition to the Core to interpret the results) but I've found it
> incredibly useful when trying to understand performance swings in GHC's
> own benchmarks.
>
> https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>
> You can enable it per module, which is nice break from the
> rebuild-the-world -prof, but the flip side IIRC is that if the
> allocations are happening in a module that wasn't compiled with -ticky
> (e.g. somewhere inside Data.Array) you might not see them.
>
> Hope this helps!
> Eric
>
>
> ------------------------------
>
> Message: 7
> Date: Thu, 11 May 2017 13:57:58 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dct25-561bs at mythic-beasts.com>
> Cc: haskell-cafe at haskell.org
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <87k25ngtsp.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> Eric Seidel <eric at seidel.io> writes:
>
>> On Thu, May 11, 2017, at 09:12, David Turner wrote:
>>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>>> allocating within the main loop?
>>
>> GHC has a -ticky flag that tracks precisely where allocations are
>> happening. It's quite low level (you may have to stare at the STG in
>> addition to the Core to interpret the results) but I've found it
>> incredibly useful when trying to understand performance swings in GHC's
>> own benchmarks.
>>
>> https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>>
>> You can enable it per module, which is nice break from the
>> rebuild-the-world -prof, but the flip side IIRC is that if the
>> allocations are happening in a module that wasn't compiled with -ticky
>> (e.g. somewhere inside Data.Array) you might not see them.
>>
> Indeed, -ticky is wonderful for this sort of thing since it doesn't
> affect Core-to-Core optimization, meaning that the instrumented program
> will behave very similarly to the uninstrumented version (just a bit
> slower due to counter overhead).
>
> I had a quick look at your example over lunch. I started like this
> (after commenting out the aeson benchmark in Main),
>
> $ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \
> -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \
> -ticky -ticky-allocd
> $ app/Main +RTS -rticky.out
>
> This produces a Ticky report (named ticky.out) in the current directory.
> This includes a table which in my case looked something like,
>
> Entries Alloc Alloc'd Non-void Arguments STG Name
>
> --------------------------------------------------------------------------------
> 123 192 48 1 L go9{v slyX}
> (main at main:Automaton)
> 2 0 32 1 L go8{v slyH}
> (main at main:Automaton)
> 1465016 58600640 146501600 0 $w$j1{v slMc}
> (main at main:Automaton) in slHB
> 161884268 011761148448 0 $w$j1{v slJT}
> (main at main:Automaton) in slHA
> 0 0 11720128 4 ppww $s$wgo2{v
> slHz} (main at main:Automaton) in r4s
> 16353241111790448768 13185144 5 wwLpp $wgo7{v slHA}
> (main at main:Automaton) in r4s
> 1831270 167011824 13185144 6 ppwLww $s$wgo1{v
> slHB} (main at main:Automaton) in r4s
> 183127 0 13185144 0 $w$j{v slGN}
> (main at main:Automaton) in r4s
> 992 8624 10944 1 L go7{v slwV}
> (main at main:Automaton) in r4m
> 3 72 0 1 C
> main at main:Automaton.showGraph120{v r3O}
> 681 0 0 2 ML go6{v rkSd}
> (main at main:Automaton)
>
> The important things we care about here are Alloc (that is how many
> bytes each name allocates) and Alloc'd (that is, how many of each
> closure are allocated in bytes). [Aside: we explicitly enabled the
> Alloc'd column during compilation with the -ticky-allocd flag; if you
> omit it this column will contain only zeros. In general I find it
> helpful to have so I generally enable it, despite the overhead it
> introduces.]
>
> The table suggests a few nice starting points: $w$j1_slJT is allocated
> quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output
> (src/Automaton.dump-stg) to try to understand why this is. Reading STG
> generally tends to be quite hard due to the rather large, deeply nested
> trees that most programs compile to. That being said, a good editor can
> help immensely (I also have my ghc-dump [1] package which I've been
> meaning to add STG support to, which would enable some interesting
> analysis techniques).
>
> Unfortunately, my lunch is now all gone so I'll have to drop this here
> for now. However, I'll try to pick it up again tonight. Do keep us
> apprised of your progress!
>
> Cheers,
>
> - Ben
>
>
> [1] https://github.com/bgamari/ghc-dump
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/7005fe35/attachment-0001.sig>
>
> ------------------------------
>
> Message: 8
> Date: Thu, 11 May 2017 14:40:14 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Haskell Cafe
> <Haskell-Cafe at haskell.org>
> Cc: Luke Maurer <maurerl at cs.uoregon.edu>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <87bmqzgru9.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> Ccing Luke Maurer under the assumption that he will appreciate seeing
> the fruits of his labor.
>
>
> David Turner <dct25-561bs at mythic-beasts.com> writes:
>
>> Dear Café,
>>
> (snip)
>>
>> There's a copy of the relevant code for option 4 at
>> https://github.com/DaveCTurner/json-validator. I've hacked around with it
>> a
>> bit since producing the numbers above, so it's now apparently a bit slower
>> than Aeson but allocates less too (~65kB).
>>
>> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> allocating within the main loop?
>>
> While turning this over in my head I realized that this is the sort of
> program which may be helped significantly by GHC 8.2's improved join
> point handling. Indeed the timings seem to support this hypothesis:
>
> GHC 8.0.2
>
> benchmarking json-validator/Automaton/testEvent
> time 22.84 μs (22.76 μs .. 22.94 μs)
> 1.000 R² (1.000 R² .. 1.000 R²)
> mean 22.84 μs (22.76 μs .. 22.94 μs)
> std dev 297.4 ns (221.8 ns .. 378.8 ns)
>
> Alloc rate 4,015,256,078,038 bytes per MUT second
>
> GHC 8.2
>
> benchmarking json-validator/Automaton/testEvent
> time 9.221 μs (9.141 μs .. 9.318 μs)
> 0.998 R² (0.996 R² .. 1.000 R²)
> mean 9.163 μs (9.084 μs .. 9.356 μs)
> std dev 399.8 ns (193.0 ns .. 745.4 ns)
> variance introduced by outliers: 54% (severely inflated)
>
> Alloc rate 123,141,635 bytes per MUT second
>
>
> Wow! I suspect your allocations have now vanished.
>
> I didn't verify that the improvement really was due to more join points,
> but it seems quite likely. Well done Luke!
>
> Cheers,
>
> - Ben
>
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/f0c0ff60/attachment-0001.sig>
>
> ------------------------------
>
> Message: 9
> Date: Thu, 11 May 2017 19:56:00 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Haskell Cafe <Haskell-Cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID:
> <CAErCyQ-FyBCEQ-2zTFd5NyeM_Pp_LFGv=w1cr4_2UByfoPNoFw at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Ben, Eric, thanks for your help. A whole new world of low-level statistics
> opens up before me...
>
> On 11 May 2017 at 18:57, Ben Gamari <ben at smart-cactus.org> wrote:
>> Eric Seidel <eric at seidel.io> writes:
>>
>> > On Thu, May 11, 2017, at 09:12, David Turner wrote:
>> >> Could anyone help, e.g. by pointing me at the bit in the Core that is
>> >> allocating within the main loop?
>> >
>> > GHC has a -ticky flag that tracks precisely where allocations are
>> > happening. It's quite low level (you may have to stare at the STG in
>> > addition to the Core to interpret the results) but I've found it
>> > incredibly useful when trying to understand performance swings in GHC's
>> > own benchmarks.
>> >
>> > https://ghc.haskell.org/trac/ghc/wiki/Debugging/TickyTicky
>> >
>> > You can enable it per module, which is nice break from the
>> > rebuild-the-world -prof, but the flip side IIRC is that if the
>> > allocations are happening in a module that wasn't compiled with -ticky
>> > (e.g. somewhere inside Data.Array) you might not see them.
>>
>> Indeed, -ticky is wonderful for this sort of thing since it doesn't
>> affect Core-to-Core optimization, meaning that the instrumented program
>> will behave very similarly to the uninstrumented version (just a bit
>> slower due to counter overhead).
>>
>> I had a quick look at your example over lunch. I started like this
>> (after commenting out the aeson benchmark in Main),
>>
>> $ ghc app/Main.hs -isrc -O2 -ddump-stg -fforce-recomp \
>> -ddump-to-file -dsuppress-idinfo -ddump-simpl -rtsopts \
>> -ticky -ticky-allocd
>> $ app/Main +RTS -rticky.out
>
>
> Ok, I was just about to ask about the Alloc'd column as mine was all
> zeroes, but the -ticky-allocd was what I needed, thanks.
>
>> The table suggests a few nice starting points: $w$j1_slJT is allocated
>> quite a bit, probably by $wgo7_slHA. Next we can turn to the STG output
>> (src/Automaton.dump-stg) to try to understand why this is. Reading STG
>> generally tends to be quite hard due to the rather large, deeply nested
>> trees that most programs compile to.
>
> You're not wrong.
>
> I've trawled through the STG as best as I can for a newbie. The function
> with a large number in the 'Alloc' column looks pretty much to be the heart
> of the `step` function. There's a lot of nesting but it looks largely to be
> just checking the bounds on array indices, which I don't think allocates
> anything.
>
> However I see this this little snippet at the very bottom of the nesting:
>
> case
> indexArray# [ds1_sl1b
> y2_sl5H]
> of
> _ [Occ=Dead]
> { (##) ipv8_sl60 [Occ=Once!] ->
> case
> ipv8_sl60
> of
> _ [Occ=Dead]
> { GHC.Word.W8# dt6_sl62 [Occ=Once] ->
> case
> ss_sl5v
> of
> dt7_sl63
> { __DEFAULT ->
> (#,,#) [__word 1
> dt6_sl62
> dt7_sl63];
> };
> };
> };
>
>
> I'm guessing that the `indexArray` call is the line that reads `nextState =
> aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me
> it looks like it might be unboxing the thing it's getting out of the array.
> Not sure that counts as an allocation, but I'm surprised anyway.
>
> I'll keep digging anyway, but it'd be good to hear of any progress at your
> end too. Hope it didn't spoil your lunch!
>
> Cheers,
>
>> While turning this over in my head I realized that this is the sort of
>> program which may be helped significantly by GHC 8.2's improved join
>> point handling. Indeed the timings seem to support this hypothesis:
>>
>> GHC 8.0.2
>>
>> benchmarking json-validator/Automaton/testEvent
>> time 22.84 μs (22.76 μs .. 22.94 μs)
>> 1.000 R² (1.000 R² .. 1.000 R²)
>> mean 22.84 μs (22.76 μs .. 22.94 μs)
>> std dev 297.4 ns (221.8 ns .. 378.8 ns)
>>
>> Alloc rate 4,015,256,078,038 bytes per MUT second
>>
>> GHC 8.2
>>
>> benchmarking json-validator/Automaton/testEvent
>> time 9.221 μs (9.141 μs .. 9.318 μs)
>> 0.998 R² (0.996 R² .. 1.000 R²)
>> mean 9.163 μs (9.084 μs .. 9.356 μs)
>> std dev 399.8 ns (193.0 ns .. 745.4 ns)
>> variance introduced by outliers: 54% (severely inflated)
>>
>> Alloc rate 123,141,635 bytes per MUT second
>>
>>
>> Wow! I suspect your allocations have now vanished.
>
> Ooo, that's more like it.
>
> Could you run again using the following to get Criterion's estimate of the
> allocations-per-call?
>
> json-validator-exe --regress allocated:iters +RTS -T
>
> Cheers,
>
> David
>
> On 11 May 2017 at 19:45, David Turner <dave.c.turner at gmail.com> wrote:
>
>> On 11 May 2017 at 19:40, Ben Gamari <ben at smart-cactus.org> wrote:
>>
>>> Ccing Luke Maurer under the assumption that he will appreciate seeing
>>> the fruits of his labor.
>>>
>>>
>>> David Turner <dct25-561bs at mythic-beasts.com> writes:
>>>
>>> > Dear Café,
>>> >
>>> (snip)
>>> >
>>> > There's a copy of the relevant code for option 4 at
>>> > https://github.com/DaveCTurner/json-validator. I've hacked around with
>>> it a
>>> > bit since producing the numbers above, so it's now apparently a bit
>>> slower
>>> > than Aeson but allocates less too (~65kB).
>>> >
>>> > Could anyone help, e.g. by pointing me at the bit in the Core that is
>>> > allocating within the main loop?
>>> >
>>> While turning this over in my head I realized that this is the sort of
>>> program which may be helped significantly by GHC 8.2's improved join
>>> point handling. Indeed the timings seem to support this hypothesis:
>>>
>>> GHC 8.0.2
>>>
>>> benchmarking json-validator/Automaton/testEvent
>>> time 22.84 μs (22.76 μs .. 22.94 μs)
>>> 1.000 R² (1.000 R² .. 1.000 R²)
>>> mean 22.84 μs (22.76 μs .. 22.94 μs)
>>> std dev 297.4 ns (221.8 ns .. 378.8 ns)
>>>
>>> Alloc rate 4,015,256,078,038 bytes per MUT second
>>>
>>> GHC 8.2
>>>
>>> benchmarking json-validator/Automaton/testEvent
>>> time 9.221 μs (9.141 μs .. 9.318 μs)
>>> 0.998 R² (0.996 R² .. 1.000 R²)
>>> mean 9.163 μs (9.084 μs .. 9.356 μs)
>>> std dev 399.8 ns (193.0 ns .. 745.4 ns)
>>> variance introduced by outliers: 54% (severely inflated)
>>>
>>> Alloc rate 123,141,635 bytes per MUT second
>>>
>>>
>>> Wow! I suspect your allocations have now vanished.
>>>
>>
>>
>> Ooo, that's more like it.
>>
>> Could you run again using the following to get Criterion's estimate of the
>> allocations-per-call?
>>
>> json-validator-exe --regress allocated:iters +RTS -T
>>
>> Cheers,
>>
>> David
>>
>>
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/dbaa6e9f/attachment-0001.html>
>
> ------------------------------
>
> Message: 10
> Date: Thu, 11 May 2017 15:32:41 -0400
> From: Ben Gamari <ben at smart-cactus.org>
> To: David Turner <dave.c.turner at gmail.com>
> Cc: Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <878tm3gpeu.fsf at ben-laptop.smart-cactus.org>
> Content-Type: text/plain; charset="utf-8"
>
> David Turner <dave.c.turner at gmail.com> writes:
>
>> Ben, Eric, thanks for your help. A whole new world of low-level statistics
>> opens up before me...
>>
> No worries!
>
>> You're not wrong.
>>
>> I've trawled through the STG as best as I can for a newbie. The function
>> with a large number in the 'Alloc' column looks pretty much to be the
>> heart
>> of the `step` function. There's a lot of nesting but it looks largely to
>> be
>> just checking the bounds on array indices, which I don't think allocates
>> anything.
>>
> Something that I should have mentioned earlier is that STG has the nice
> property that all allocation is syntactically obvious: allocated
> closures manifest as `let`s. This makes it fairly easy to pick out
> possible allocation sites, even in large dumps.
>
>> However I see this this little snippet at the very bottom of the nesting:
>>
>> case
>> indexArray# [ds1_sl1b
>> y2_sl5H]
>> of
>> _ [Occ=Dead]
>> { (##) ipv8_sl60 [Occ=Once!] ->
>> case
>> ipv8_sl60
>> of
>> _ [Occ=Dead]
>> { GHC.Word.W8# dt6_sl62 [Occ=Once] ->
>> case
>> ss_sl5v
>> of
>> dt7_sl63
>> { __DEFAULT ->
>> (#,,#) [__word 1
>> dt6_sl62
>> dt7_sl63];
>> };
>> };
>> };
>>
>>
>> I'm guessing that the `indexArray` call is the line that reads `nextState
>> =
>> aTransitionsTable makeAutomaton AU.! (currentState, nextByte)`, and to me
>> it looks like it might be unboxing the thing it's getting out of the
>> array.
>> Not sure that counts as an allocation, but I'm surprised anyway.
>>
> The unboxing should just involve looking at the value field of the W8#
> closure (e.g. a memory dereference) and putting the value in a machine
> register. There is no allocation here as far as I can see.
>
>
>> I'll keep digging anyway, but it'd be good to hear of any progress at your
>> end too. Hope it didn't spoil your lunch!
>
> Far from it! It was nice to have a puzzle to ponder.
>
> Cheers,
>
> - Ben
> -------------- next part --------------
> A non-text attachment was scrubbed...
> Name: signature.asc
> Type: application/pgp-signature
> Size: 487 bytes
> Desc: not available
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170511/abf07018/attachment-0001.sig>
>
> ------------------------------
>
> Message: 11
> Date: Fri, 12 May 2017 09:10:50 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Ben Gamari <ben at smart-cactus.org>, Haskell Cafe
> <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID:
> <CAErCyQ8dDyZPa8z_8hC8U0o+6Q6QeP1yxh3+Hjjj8s--WNq9ng at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> Morning all,
>
> On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>
>>
>> Something that I should have mentioned earlier is that STG has the nice
>> property that all allocation is syntactically obvious: allocated
>> closures manifest as `let`s. This makes it fairly easy to pick out
>> possible allocation sites, even in large dumps.
>>
>
>
> Ah, that's very useful to know!
>
> Armed with that knowledge, I came to the conclusion that the allocation was
> for the sharing of the `nextState` variable. Inlining it brings it down to
> 20us and 22kB per iteration.
>
> https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc2e76f19bef98f42e0b233524
>
> Getting closer, but it looks like waiting for 8.2 is a better answer.
> Looking forward to it!
>
> Cheers,
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170512/1e039627/attachment-0001.html>
>
> ------------------------------
>
> Message: 12
> Date: Fri, 12 May 2017 10:27:40 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Ben Gamari
> <ben at smart-cactus.org>, Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <1494577660.20209.1.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote:
>> Morning all,
>>
>> On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>> > Something that I should have mentioned earlier is that STG has the
>> > nice
>> > property that all allocation is syntactically obvious: allocated
>> > closures manifest as `let`s. This makes it fairly easy to pick out
>> > possible allocation sites, even in large dumps.
>>
>>
>> Ah, that's very useful to know!
>>
>> Armed with that knowledge, I came to the conclusion that the
>> allocation was for the sharing of the `nextState` variable. Inlining
>> it brings it down to 20us and 22kB per iteration.
>>
>> https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7bc
>> 2e76f19bef98f42e0b233524
>>
>> Getting closer, but it looks like waiting for 8.2 is a better answer.
>> Looking forward to it!
>>
>> Cheers,
>>
>
> Maybe this is a silly question, and please let me know why if so, but:
>
> Has anyone thought about parallelizing it for multiple messages in
> order to "produce garbage faster"? While reducing allocation will make
> the single validations faster, doing multiple ones might improve the
> throughput per GC ratio. This assumes that the amount of live data in
> the heap is small, making GC sort of constant time, and having multiple
> cores available.
>
> I wonder whether a few strategically placed par's and pseq's might
> allow you to scale horizontally without requiring invasive changes to
> the program's structure. Apologies for not trying to do this myself
> first ;-).
>
> kind regards, Arjen
>
>
> ------------------------------
>
> Message: 13
> Date: Fri, 12 May 2017 10:48:01 +0100
> From: David Turner <dct25-561bs at mythic-beasts.com>
> To: Arjen <arjenvanweelden at gmail.com>, Haskell Cafe
> <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID:
> <CAErCyQ9A+f4bFR7PGrWcFDHJVik__3pQNHac4GXZOt6gW8WLqQ at mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
>
> On 12 May 2017 at 09:27, Arjen <arjenvanweelden at gmail.com> wrote:
>
>>
>> Maybe this is a silly question, and please let me know why if so, but:
>>
>> Has anyone thought about parallelizing it for multiple messages in
>> order to "produce garbage faster"? While reducing allocation will make
>> the single validations faster, doing multiple ones might improve the
>> throughput per GC ratio. This assumes that the amount of live data in
>> the heap is small, making GC sort of constant time, and having multiple
>> cores available.
>>
>
> Not a silly question at all. Adding the following incantation:
>
> `using` parListChunk 100 rseq
>
> does quite happily spread things across all 4 cores on my development
> machine, and it's certainly a bit faster. To give some stats, it processes
> ~24 events between GCs rather than ~6, and collects ~2MB rather than
> ~500kB. The throughput becomes a lot less consistent, however, at least
> partly due to some bigger GC pauses along the way. As Ben's statistics
> showed, our allocation rate on one thread is around 4TBps, which is already
> stressing the GC out a bit, and parallelising it doesn't make that problem
> any easier.
>
> I know in the OP I said "we have a stream" (accidental MLK misquote) but in
> fact there are a bunch of parallel streams whose number exceeds the number
> of available cores, so we don't anticipate any enormous benefit from
> spreading the processing of any one stream across multiple cores:
> single-threaded performance is what we think we should be concentrating on.
>
> Cheers,
>
> David
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20170512/90b18fdd/attachment-0001.html>
>
> ------------------------------
>
> Message: 14
> Date: Fri, 12 May 2017 11:52:04 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Ben Gamari
> <ben at smart-cactus.org>, Haskell Cafe <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <1494582724.20209.3.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 10:27 +0200, Arjen wrote:
>> On Fri, 2017-05-12 at 09:10 +0100, David Turner wrote:
>> > Morning all,
>> >
>> > On 11 May 2017 at 20:32, Ben Gamari <ben at smart-cactus.org> wrote:
>> > > Something that I should have mentioned earlier is that STG has
>> the
>> > > nice
>> > > property that all allocation is syntactically obvious: allocated
>> > > closures manifest as `let`s. This makes it fairly easy to pick
>> out
>> > > possible allocation sites, even in large dumps.
>> >
>> >
>> > Ah, that's very useful to know!
>> >
>> > Armed with that knowledge, I came to the conclusion that the
>> > allocation was for the sharing of the `nextState` variable.
>> Inlining
>> > it brings it down to 20us and 22kB per iteration.
>> >
>> > https://github.com/DaveCTurner/json-validator/commit/ec994ec9226ca7
>> bc
>> > 2e76f19bef98f42e0b233524
>> >
>> > Getting closer, but it looks like waiting for 8.2 is a better
>> answer.
>> > Looking forward to it!
>> >
>> > Cheers,
>> >
>>
>> Maybe this is a silly question, and please let me know why if so,
>> but:
>>
>> Has anyone thought about parallelizing it for multiple messages in
>> order to "produce garbage faster"? While reducing allocation will
>> make
>> the single validations faster, doing multiple ones might improve the
>> throughput per GC ratio. This assumes that the amount of live data in
>> the heap is small, making GC sort of constant time, and having
>> multiple
>> cores available.
>>
>> I wonder whether a few strategically placed par's and pseq's might
>> allow you to scale horizontally without requiring invasive changes to
>> the program's structure. Apologies for not trying to do this myself
>> first ;-).
>>
>
> Unfortunately, this does not seem to help. I tried it and while I did
> see 3.5 and 3.95 CPU's used with parList and parListChunk 8, the
> timings show it to be slower than sequential evaluation of a list of
> testEvent. Thank you for the problem to try this on :-)
>
> Please let me know if I made a mistake in the testing code below.
>
> kind regards, Arjen
>
> ---8<---
>
> benchmarking json-validator/Automaton/testEvent
> time 26.57 μs (26.07 μs .. 27.12 μs)
> 0.987 R² (0.973 R² .. 0.995 R²)
> mean 31.01 μs (28.17 μs .. 44.49 μs)
> std dev 15.65 μs (5.258 μs .. 35.77 μs)
> variance introduced by outliers: 99% (severely inflated)
>
> benchmarking json-validator/Automaton-List/testEventList
> time 65.11 ms (61.54 ms .. 69.54 ms)
> 0.982 R² (0.951 R² .. 0.996 R²)
> mean 59.18 ms (56.28 ms .. 63.22 ms)
> std dev 5.801 ms (4.344 ms .. 8.238 ms)
> variance introduced by outliers: 32% (moderately inflated)
>
> benchmarking json-validator/Automaton-ParList/testEventList
> time 243.9 ms (214.9 ms .. 270.6 ms)
> 0.996 R² (0.986 R² .. 1.000 R²)
> mean 253.7 ms (243.6 ms .. 260.8 ms)
> std dev 10.32 ms (6.781 ms .. 13.19 ms)
> variance introduced by outliers: 16% (moderately inflated)
>
> benchmarking json-validator/Automaton-ParListChunk/testEventList
> time 211.4 ms (193.1 ms .. 232.3 ms)
> 0.997 R² (0.990 R² .. 1.000 R²)
> mean 200.3 ms (193.0 ms .. 206.6 ms)
> std dev 9.256 ms (7.106 ms .. 10.21 ms)
> variance introduced by outliers: 14% (moderately inflated)
>
> 19,225,632,224 bytes allocated in the heap
> 109,968,376 bytes copied during GC
> 2,062,736 bytes maximum residency (20 sample(s))
> 2,250,352 bytes maximum slop
> 10 MB total memory in use (0 MB lost due to
> fragmentation)
>
> Tot time (elapsed) Avg pause Max
> pause
> Gen 0 28722 colls, 28722
> par 17.829s 1.591s 0.0001s 0.0160s
> Gen 1 20 colls, 19
> par 0.255s 0.054s 0.0027s 0.0058s
>
> Parallel GC work balance: 7.41% (serial 0%, perfect 100%)
>
> TASKS: 13 (1 bound, 12 peak workers (12 total), using -N4)
>
> SPARKS: 25625 (25570 converted, 0 overflowed, 0 dud, 20 GC'd, 35
> fizzled)
>
> INIT time 0.001s ( 0.002s elapsed)
> MUT time 44.403s ( 22.013s elapsed)
> GC time 18.084s ( 1.645s elapsed)
> EXIT time 0.001s ( 0.001s elapsed)
> Total time 62.490s ( 23.660s elapsed)
>
> Alloc rate 432,981,654 bytes per MUT second
>
> Productivity 71.1% of total user, 93.0% of total elapsed
>
> gc_alloc_block_sync: 25370
> whitehole_spin: 0
> gen[0].sync: 405
> gen[1].sync: 22
>
> ---8<---
>
> {-# LANGUAGE OverloadedStrings #-}
>
> module Main where
>
> import Criterion.Main
> import Data.Aeson
> import qualified Data.ByteString.Lazy as BL
> import qualified Data.Text.Encoding as T
> import qualified Data.Text.IO as T
> import qualified Data.Text as T (append, pack)
> import qualified Automaton
> import Control.Parallel.Strategies
>
> testEvent :: Int -> BL.ByteString
> testEvent i = BL.fromStrict $ T.encodeUtf8 (T.append
> "{\"stream\":\"actuals-
> stream\",\"submitter\":{\"type\":\"other\",\"description\":\"redactedre
> dac\"},\"driverActivities\":[{\"driverActivity\":{\"journey\":{\"headco
> de\":\"1A01\",\"description\":\"redactedredactedredactedredactedredacte
> dredacte\"},\"activity\":[{\"arrivalTime\":null,\"sequence\":1,\"tiploc
> \":\"REDACTE\",\"stop\":true,\"departureTime\":\"2016-06-
> 09T18:22:28.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:24:43.000000000000Z\",\"sequence\":2,\"tiploc\":\"REDACTE\",\"sto
> p\":true,\"departureTime\":\"2016-06-
> 09T18:25:51.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:26:58.000000000000Z\",\"sequence\":3,\"tiploc\":\"REDACT\",\"stop
> \":true,\"departureTime\":\"2016-06-
> 09T18:28:08.000000000000Z\"},{\"arrivalTime\":\"2016-06-
> 09T18:29:57.000000000000Z\",\"sequence\":4,\"tiploc\":\"REDACTE\",\"sto
> p\":true,\"departureTime\":null}]},\"activityUserId\":\"521be60a-02f2-
> 4892-b468-c17d9c1c4fcf\"}],\"submissionTime\":\"2016-06-
> 09T18:36:45.831486000000Z\",\"type\":\"driverActivityLogged" (T.append
> (T.pack (show i)) "\"}"))
>
> data AnyJSON = AnyJSON
> deriving (Show, Eq)
>
> instance FromJSON AnyJSON where
> parseJSON _ = pure AnyJSON
>
> main :: IO ()
> main = print (testEventList [0]) >> defaultMain
> [ bgroup "json-validator/Automaton"
> [ bench "testEvent" $ whnf Automaton.isValidJson (testEvent 0)
> ]
> , bgroup "json-validator/Automaton-List"
> [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` rseq)) [1..1000]
> ]
> , bgroup "json-validator/Automaton-ParList"
> [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` parList rseq)) [1..1000]
> ]
> , bgroup "json-validator/Automaton-ParListChunk"
> [ bench "testEventList" $ whnf (\es -> and (testEventList es
> `using` parListChunk 8 rseq)) [1..1000]
> ]
> ]
> where
> testEventList = map (Automaton.isValidJson . testEvent)
>
>
>
> ------------------------------
>
> Message: 15
> Date: Fri, 12 May 2017 12:00:02 +0200
> From: Arjen <arjenvanweelden at gmail.com>
> To: David Turner <dct25-561bs at mythic-beasts.com>, Haskell Cafe
> <haskell-cafe at haskell.org>
> Subject: Re: [Haskell-cafe] Fast JSON validation - reducing
> allocations
> Message-ID: <1494583202.20209.5.camel at gmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> On Fri, 2017-05-12 at 10:48 +0100, David Turner wrote:
>> On 12 May 2017 at 09:27, Arjen <arjenvanweelden at gmail.com> wrote:
>> > Maybe this is a silly question, and please let me know why if so,
>> > but:
>> >
>> > Has anyone thought about parallelizing it for multiple messages in
>> > order to "produce garbage faster"? While reducing allocation will
>> > make
>> > the single validations faster, doing multiple ones might improve
>> > the
>> > throughput per GC ratio. This assumes that the amount of live data
>> > in
>> > the heap is small, making GC sort of constant time, and having
>> > multiple
>> > cores available.
>>
>> Not a silly question at all. Adding the following incantation:
>>
>> `using` parListChunk 100 rseq
>>
>> does quite happily spread things across all 4 cores on my development
>> machine, and it's certainly a bit faster. To give some stats, it
>> processes ~24 events between GCs rather than ~6, and collects ~2MB
>> rather than ~500kB. The throughput becomes a lot less consistent,
>> however, at least partly due to some bigger GC pauses along the way.
>> As Ben's statistics showed, our allocation rate on one thread is
>> around 4TBps, which is already stressing the GC out a bit, and
>> parallelising it doesn't make that problem any easier.
>>
>> I know in the OP I said "we have a stream" (accidental MLK misquote)
>> but in fact there are a bunch of parallel streams whose number
>> exceeds the number of available cores, so we don't anticipate any
>> enormous benefit from spreading the processing of any one stream
>> across multiple cores: single-threaded performance is what we think
>> we should be concentrating on.
>>
>> Cheers,
>>
>> David
>>
> Happy to read that you have more success than I with parListChunk.
> I expect(ed) the sparks to be cheap, so this might help if the various
> incoming streams are very unbalanced in length and/or message sizes.
> I agree that multi-processing a single stream does not make much sense
> if you already have multiple concurrent streams.
>
> Nice to see that adding on parallelism in Haskell (GHC) is that easy
> (in this case) and with a very good speed-up factor!
>
> kind regards, Arjen
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>
>
> ------------------------------
>
> End of Haskell-Cafe Digest, Vol 165, Issue 19
> *********************************************
>
More information about the Haskell-Cafe
mailing list