[Haskell-cafe] Data.Binary poor read performance

jutaro jnf at arcor.de
Tue Feb 24 17:42:26 EST 2009

wren ng thornton wrote:
> If you have many identical strings then you will save lots by memoizing 
> your strings into Integers, and then serializing that memo table and the 
> integerized version of your data structure. The amount of savings 
> decreases as the number of duplications decrease, though since you don't 
> need the memo table itself you should be able to serialize it in a way 
> that doesn't have much overhead.

I had problems with the size of the allocated heap space after serializing 
and loading data with the binary package. The reason was that 
binary does not support sharing of identical elements and considered a 
restricted solution for strings and certain other data types first, but 
came up with a generic solution in the end.
(I did it just last weekend).

I put the Binary monad in a state transformer with maps for memoization:
type PutShared = St.StateT (Map Object Int, Int) PutM ()
type GetShared = St.StateT (IntMap Object) Bin.Get

In addition to standard get ant put methods:
class (Typeable α, Ord α, Eq α) ⇒ BinaryShared α  where
    put :: α  →  PutShared
    get :: GetShared α
I added putShared and getShared methods with memoization:
    putShared :: (α →  PutShared) →  α →  PutShared
    getShared :: GetShared α →  GetShared α 

For types that I don't want memoization I can either refer to the underlying 
binary monad for primitive types, e.g.:
instance BinaryShared Int where
    put = lift∘Bin.put
    get = lift Bin.get
or stay in the BinaryShared monad for types of which I may memoize
components, e.g.:
instance (BinaryShared a, BinaryShared b) ⇒ BinaryShared (a,b) where
    put (a,b)          = put a ≫ put b
    get                 = liftM2 (,) get get

And for types for which I want memoization, I wrap it with putShared and
getShared ,e.g:
instance BinaryShared a ⇒ BinaryShared [a] where
    put    = putShared (λl →  lift (Bin.put (length l)) ≫ mapM_ put l)
    get    = getShared (do
                n ←  lift (Bin.get :: Bin.Get Int)
                replicateM n get)
This save 1/3 of heap space to my application. I didn't measure time.
Maybe it would be useful to have something like this in the binary module.


PS: And here is the dirty implementation, in the case someone finds it

    putShared :: (α →  PutShared) →  α →  PutShared
    putShared fput v = do
        (dict, unique) ←  St.get
        case (ObjC v) `Map.lookup` dict of
            Just i  →  lift (Bin.putWord8 0 ≫ putWord64be (fromIntegral i))
            Nothing →  do
                St.put (dict,unique + 1)
                lift (Bin.putWord8 1)
                lift (putWord64be (fromIntegral unique))
                fput v
                (dict2, unique2) ←  St.get
                let newDict = Map.insert (ObjC v) unique dict2
                St.put (newDict,unique2)

    getShared :: GetShared α →  GetShared α
    getShared f = do
        dict ←  St.get
        w ←  lift Bin.getWord8
        case w of
            0 →  do
                i   ←  lift (liftM fromIntegral (getWord64be))
                case  IMap.lookup i dict of
                    Just (ObjC obj) →  return (forceJust (cast obj)
                                            "Shared≫getShared: Cast failed")
                    Nothing →  error ◊ "Shared≫getShared : Dont find in Map
" ⊕ show i
            1 →  do
                i   ←  lift (liftM fromIntegral (getWord64be))
                obj ←  f
                dict2 ←  St.get
                St.put (IMap.insert i (ObjC obj) dict2)
                return obj
            _ →  error ◊ "Shared≫getShared : Encoding error"

data Object = ∀ α. (Typeable α, Ord α, Eq α) ⇒ ObjC {unObj :: α}

instance Eq Object where
    (ObjC a) ≡ (ObjC b) = if typeOf a ≠ typeOf b
                                then False
                                else (Just a) ≡ cast b -- can someone
explain to me why this works?

instance Ord Object where
    compare (ObjC a) (ObjC b) = if typeOf a ≠ typeOf b
                                then compare
((unsafePerformIO∘typeRepKey∘typeOf) a)
((unsafePerformIO∘typeRepKey∘typeOf) b)
                                else compare (Just a) (cast b)

encodeSer :: BinaryShared a ⇒ a →  L.ByteString
encodeSer v = runPut (evalStateT (put v) (Map.empty,0))

decodeSer :: BinaryShared α  ⇒ L.ByteString →  α
decodeSer =  runGet (evalStateT get IMap.empty)

View this message in context: http://www.nabble.com/Data.Binary-poor-read-performance-tp22167466p22192337.html
Sent from the Haskell - Haskell-Cafe mailing list archive at Nabble.com.

More information about the Haskell-Cafe mailing list