[Haskell-cafe] Yet another binary reader (10x faster than Hackage's binary; )

Tue Mar 19 21:58:07 UTC 2019

I was investigating Haskell’s binary serialization libraries and has found
that lot of them (binary, cereal and protocol-buffers) has pretty similar
design. They capable to consume lazy bytestrings and either produce result,
report error or produce partial result, indicating that more input data
required. Such design leads to quite complex readers, and often ends up
with producing closures in runtime. If you take a close look at binary’s
microbenches (binary in general bit faster than cereal, so I focused on
binary) you will discover, that it triggers GC even during simple tasks,
such as reading a bunch of Word64 and summing results. And it relatively
slow. Binary’s performance limited to hundreds of megabytes per second even
in simple scenarios, but actual memory bandwidth should allow us process
gigabytes per second. So, I decided to try a different approach.

My approach is based on two assumptions so if it does not suit your case it
obliviously will not work for you.

First – when dealing with binary data we always know exactly how many bytes
we want to read. This happens because we either read a fixed size primitive
type or length-prefixed variable size data (possibly not “prefixed”, but
you always know expected message length from somewhere).

Second – chunks of data are continuous. I.e. we working with strict
bytestrings more often. Even when you processing chunked stream of data
with unknown number of messages (such as Mesos RecordIO Format) you should
always be able to allocate continuous memory buffer for a single entity you
want to deserialize.

If it’s ok for you lets go on. First thing I’ve created was so called
static reader. Static – because reader knows how mush bytes it wants to
consume at compile time. Such readers suitable for reading combinations of
primitive values. Here the implementation
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/Static.hs
Static
reader is somewhat monadic (you can use do-syntax with
RebindableSyntax extension
like here
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/bench/Bench/Lev/Reader/Static.hs
)
but it actually indexed monad and it unfortunately incompatible with
traditional monads. Offsets and sizes are calculates at type level during
the compile time and according to microbences static reader does not
introduced any overhead compared to hand-written reader using the
low-level indexInt64OffAddr# function (
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/bench/Bench/Handwritten.hs
 )

Next thing was a dynamic reader. Dynamic readers are composed from static
readers, they are fully monadic and allow you to read length-prefixed data
structures (you read length with wrapped static reader and then perform
normal monadic binding to supply a reader which reads a value). Here the
implementation
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/Dynamic.hs

I’ve discovered, that typeclasses are surprisingly fast. So, it’s
completely ok to have something like “(Cursor c) => … Reader” where cursor
contains functions for overflow checking and memory buffer progression. I
was not able to achieve same performance when I passed buffer
callbacks explicitly,
possibly because of inlining of specialization magic. So, I've created
ByteStringCursor (
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy/blob/master/src/Lev/Reader/ByteString.hs
)
which allows to read strict bytestrings. I suppose that cursors introduce
flexibility: you may read strict bytestring or you may define custom cursor
for reading custom memory buffers.

Finally, I’ve written some microbenches for my cases and also ported one
microbench from binary. And my implementation was able to run 10x-15x
faster than binary on strict bytestring. Moreover, it does not trigger GC
during the run at all! You may find the project and all the benches here
https://github.com/PROTEINE-INSAIDERS/lev-tolstoy

So, what's the purpose of my message? It’s quite unlikely that I’ll release
my code on the Hackage. This is because I’m a hobbyist Haskell developer
and doing it all in my free time. But before mr. Putin will close Internet
in Russia I’d like to demonstrate that it’s possible to write very fast
binary deserializer taking in account several assumptions which seems quite
reasonably for me. So, feel free to borrow the ideas, or share the ideas,
if you know how it could be implemented better.

-- 
Sincerely, Stanislav Chernichkin.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20190320/0a52efd3/attachment.html>