[Haskell-cafe] data.binary get reading beyond end of input bytestring?

Thu Jul 29 02:53:11 EDT 2010

On 28 July 2010 23:32, Gregory Collins <greg at gregorycollins.net> wrote:
> Conrad Parker <conrad at metadecks.org> writes:
>
>> Hi,
>>
>> I am reading data from a file as strict bytestrings and processing
>> them in an iteratee. As the parsing code uses Data.Binary, the
>> strict bytestrings are then converted to lazy bytestrings (using
>> fromWrap which Gregory Collins posted here in January:
>>
>> -- | wrapped bytestring -> lazy bytestring
>> fromWrap :: I.WrappedByteString Word8 -> L.ByteString
>> fromWrap = L.fromChunks . (:[]) . I.unWrap
>
> This just makes a 1-chunk lazy bytestring:
>
>    (L.fromChunks . (:[])) :: S.ByteString -> L.ByteString
>
>
>> ). The parsing is then done with the library function
>> Data.Binary.Get.runGetState:
>>
>> -- | Run the Get monad applies a 'get'-based parser on the input
>> -- ByteString. Additional to the result of get it returns the number of
>> -- consumed bytes and the rest of the input.
>> runGetState :: Get a -> L.ByteString -> Int64 -> (a, L.ByteString, Int64)
>>
>> The issue I am seeing is that runGetState consumes more bytes than the
>> length of the input bytestring, while reporting an
>> apparently successful get (ie. it does not call error/fail). I was
>> able to work around this by checking if the bytes consumed > input
>> length, and if so to ignore the result of get and simply prepend the
>> input bytestring to the next chunk in the continuation.
>
> Something smells fishy here. I have a hard time believing that binary is
> reading more input than is available? Could you post more code please?

The issue seems to just be the return value for "bytes consumed" from
getLazyByteString. Here's a small example.

conrad at hunter:~/src/haskell/binary-overrun$ cat overrun.hs
{-# LANGUAGE OverloadedStrings #-}

import Data.Binary
import Data.Binary.Get
import qualified Data.ByteString.Lazy.Char8 as C

data TenChars = TenChars C.ByteString deriving (Show)

instance Binary TenChars where
    get = getLazyByteString 10 >>= return . TenChars
    put = undefined

consume bs = do
    let (ret, rem, len) = runGetState (get :: Get TenChars) bs 0
    putStrLn $ "Input: " ++ show bs ++ ", length " ++ (show $ C.length bs)
    putStrLn $ "    consumed " ++ (show len) ++ " bytes without error."
    putStrLn $ "    Output: " ++ show ret
    putStrLn $ "    Remain: " ++ show rem

main = do
    consume "1234567890ABCDE"
    consume "1234567890"
    consume "12345"
conrad at hunter:~/src/haskell/binary-overrun$ ./overrun
Input: Chunk "1234567890ABCDE" Empty, length 15
    consumed 10 bytes without error.
    Output: TenChars (Chunk "1234567890" Empty)
    Remain: Chunk "ABCDE" Empty
Input: Chunk "1234567890" Empty, length 10
    consumed 10 bytes without error.
    Output: TenChars (Chunk "1234567890" Empty)
    Remain: Empty
Input: Chunk "12345" Empty, length 5
    consumed 10 bytes without error.
    Output: TenChars (Chunk "12345" Empty)
    Remain: Empty

Here, the third example claims to have consumed 10 bytes out of the
available 5, and does not fail. The issue is that this return value
cannot be used for maintaining offsets. It is documented that it will
not fail, but the returned len value seems to be incorrect.

I've now added a check that fails if the returned bytestring is
shorter than required.

>> However I am curious as to why this apparent lack of bounds checking
>> happens. My guess is that Get does not check the length of the input
>> bytestring, perhaps to avoid forcing lazy bytestring inputs; does that
>> make sense?
>>
>> Would a better long-term solution be to use a strict-bytestring binary
>> parser (like cereal)? So far I've avoided that as there is
>> not yet a corresponding ieee754 parser.
>
> If you're using iteratees you could try attoparsec + attoparsec-iteratee
> which would be a more natural way to bolt parsers together. The
> attoparsec-iteratee package exports:
>
>    parserToIteratee :: (Monad m) =>
>                        Parser a
>                     -> IterateeG WrappedByteString Word8 m a
>
> Attoparsec is an incremental parser so this technique allows you to
> parse a stream in constant space (i.e. without necessarily having to
> retain all of the input). It also hides the details of the annoying
> buffering/bytestring twiddling you would be forced to do otherwise.

thanks for the pointer :)

Conrad.