[Haskell-cafe] data.binary get reading beyond end of input bytestring?

Max Cantor mxcantor at gmail.com
Wed Jul 28 23:38:50 EDT 2010


I have a similar issue, I think.  The problem with attoparsec is it only covers the unmarshalling side, writing data to disk still requires manually marshalling values into ByteStrings.  Data.Binary with Data.Derive provide a clean, proven (encode . decode == id) way of doing this.  

If there's a way to accomplish this with attoparsec, I'd love to know.

Max

On Jul 28, 2010, at 10:32 PM, Gregory Collins wrote:

> Conrad Parker <conrad at metadecks.org> writes:
> 
>> Hi,
>> 
>> I am reading data from a file as strict bytestrings and processing
>> them in an iteratee. As the parsing code uses Data.Binary, the
>> strict bytestrings are then converted to lazy bytestrings (using
>> fromWrap which Gregory Collins posted here in January:
>> 
>> -- | wrapped bytestring -> lazy bytestring
>> fromWrap :: I.WrappedByteString Word8 -> L.ByteString
>> fromWrap = L.fromChunks . (:[]) . I.unWrap
> 
> This just makes a 1-chunk lazy bytestring:
> 
>    (L.fromChunks . (:[])) :: S.ByteString -> L.ByteString
> 
> 
>> ). The parsing is then done with the library function
>> Data.Binary.Get.runGetState:
>> 
>> -- | Run the Get monad applies a 'get'-based parser on the input
>> -- ByteString. Additional to the result of get it returns the number of
>> -- consumed bytes and the rest of the input.
>> runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
>> 
>> The issue I am seeing is that runGetState consumes more bytes than the
>> length of the input bytestring, while reporting an
>> apparently successful get (ie. it does not call error/fail). I was
>> able to work around this by checking if the bytes consumed > input
>> length, and if so to ignore the result of get and simply prepend the
>> input bytestring to the next chunk in the continuation.
> 
> Something smells fishy here. I have a hard time believing that binary is
> reading more input than is available? Could you post more code please?
> 
> 
>> However I am curious as to why this apparent lack of bounds checking
>> happens. My guess is that Get does not check the length of the input
>> bytestring, perhaps to avoid forcing lazy bytestring inputs; does that
>> make sense?
>> 
>> Would a better long-term solution be to use a strict-bytestring binary
>> parser (like cereal)? So far I've avoided that as there is
>> not yet a corresponding ieee754 parser.
> 
> If you're using iteratees you could try attoparsec + attoparsec-iteratee
> which would be a more natural way to bolt parsers together. The
> attoparsec-iteratee package exports:
> 
>    parserToIteratee :: (Monad m) =>
>                        Parser a
>                     -> IterateeG WrappedByteString Word8 m a
> 
> Attoparsec is an incremental parser so this technique allows you to
> parse a stream in constant space (i.e. without necessarily having to
> retain all of the input). It also hides the details of the annoying
> buffering/bytestring twiddling you would be forced to do otherwise.
> 
> Cheers,
> G
> -- 
> Gregory Collins <greg at gregorycollins.net>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe



More information about the Haskell-Cafe mailing list