[Haskell-cafe] data.binary get reading beyond end of input bytestring?

Wed Jul 28 10:32:16 EDT 2010

Conrad Parker <conrad at metadecks.org> writes:

> Hi,
>
> I am reading data from a file as strict bytestrings and processing
> them in an iteratee. As the parsing code uses Data.Binary, the
> strict bytestrings are then converted to lazy bytestrings (using
> fromWrap which Gregory Collins posted here in January:
>
> -- | wrapped bytestring -> lazy bytestring
> fromWrap :: I.WrappedByteString Word8 -> L.ByteString
> fromWrap = L.fromChunks . (:[]) . I.unWrap

This just makes a 1-chunk lazy bytestring:

    (L.fromChunks . (:[])) :: S.ByteString -> L.ByteString

> ). The parsing is then done with the library function
> Data.Binary.Get.runGetState:
>
> -- | Run the Get monad applies a 'get'-based parser on the input
> -- ByteString. Additional to the result of get it returns the number of
> -- consumed bytes and the rest of the input.
> runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
>
> The issue I am seeing is that runGetState consumes more bytes than the
> length of the input bytestring, while reporting an
> apparently successful get (ie. it does not call error/fail). I was
> able to work around this by checking if the bytes consumed > input
> length, and if so to ignore the result of get and simply prepend the
> input bytestring to the next chunk in the continuation.

Something smells fishy here. I have a hard time believing that binary is
reading more input than is available? Could you post more code please?

> However I am curious as to why this apparent lack of bounds checking
> happens. My guess is that Get does not check the length of the input
> bytestring, perhaps to avoid forcing lazy bytestring inputs; does that
> make sense?
>
> Would a better long-term solution be to use a strict-bytestring binary
> parser (like cereal)? So far I've avoided that as there is
> not yet a corresponding ieee754 parser.

If you're using iteratees you could try attoparsec + attoparsec-iteratee
which would be a more natural way to bolt parsers together. The
attoparsec-iteratee package exports:

    parserToIteratee :: (Monad m) =>
                        Parser a
                     -> IterateeG WrappedByteString Word8 m a

Attoparsec is an incremental parser so this technique allows you to
parse a stream in constant space (i.e. without necessarily having to
retain all of the input). It also hides the details of the annoying
buffering/bytestring twiddling you would be forced to do otherwise.

Cheers,
G
-- 
Gregory Collins <greg at gregorycollins.net>