[Haskell-cafe] Fast JSON validation - reducing allocations
ben at well-typed.com
Fri May 12 13:44:11 UTC 2017
David Turner <dct25-561bs at mythic-beasts.com> writes:
> On 12 May 2017 at 09:27, Arjen <arjenvanweelden at gmail.com> wrote:
>> Maybe this is a silly question, and please let me know why if so, but:
>> Has anyone thought about parallelizing it for multiple messages in
>> order to "produce garbage faster"? While reducing allocation will make
>> the single validations faster, doing multiple ones might improve the
>> throughput per GC ratio. This assumes that the amount of live data in
>> the heap is small, making GC sort of constant time, and having multiple
>> cores available.
> Not a silly question at all. Adding the following incantation:
> `using` parListChunk 100 rseq
> does quite happily spread things across all 4 cores on my development
> machine, and it's certainly a bit faster. To give some stats, it processes
> ~24 events between GCs rather than ~6, and collects ~2MB rather than
> ~500kB. The throughput becomes a lot less consistent, however, at least
> partly due to some bigger GC pauses along the way. As Ben's statistics
> showed, our allocation rate on one thread is around 4TBps, which is already
> stressing the GC out a bit, and parallelising it doesn't make that problem
> any easier.
To elaborate on this a bit: in my experience allocation rate becomes
even more important for performance (both latency and throughput) when
introducing parallelism. The reason is that our GC is a stop-the-world
collector and consequently whenever one thread needs to GC, it must stop
and wait for all others to stop as well. This means that your program
will be synchronizing constantly, even if your threads are working on
completely non-intersecting sets of heap objects, greatly limiting your
usable parallelism. You can observe this behavior using the eventlog.
It is possible to tune the GC to trade latency for throughput without
changing your program. Increasing the size of the nursery will mean that
each thread will take longer to fill its nursery and consequently
require less frequent GC. However, this means that each minor GC will
For this reason, if your program is truly embarassingly parallel, it is
usually preferrable performance-wise to run multiple processes
concurrently than use pure parallelism (a sad truth in a harsh reality,
in my opinion).
Note that there have been efforts to help this issue in the past: Simon
Marlow has a paper  examining a GC scheme with thread-local heaps,
allowing threads to perform minor collections while other threads
continue to run. Unfortunately this introduced a significant amount of
complexity into the RTS which the performance numbers just didn't
The new compact regions feature also bears on this issue, as it gives
users tools for better managing the effect of their long-lived live data
on GC performance. That being said, this wouldn't help in your case as
you have little long-lived live data.
Some form of the linear types proposal also may some day offer some help
here as it could in principle allow alternatives to today's GC'd heap.
For instance, in your case you may want to use an arena allocation
strategy: when starting a document allocate an arena which will serve as
your nursery, run your computation, copy your result out to the GC'd
heap at the end, and throw away the arena.
Finally, I hope that someone will some day pick up the RTS work again.
This might mean trying another approach to thread-local heaps, allowing
multiple indepedent heaps per RTS, or even allowing multiple RTSs per
process. It is sad that our parallelism story is limited in the ways
that it is and there are still a number of stones that remain unturned.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 487 bytes
Desc: not available
More information about the Haskell-Cafe