Pickling a finite map (Binary + zlib) [was: [Haskell-cafe] Data.Binary poor read performance]

Don Stewart dons at galois.com
Tue Feb 24 02:59:16 EST 2009


dons:
> wren:
> > Neil Mitchell wrote:
> >> 2) The storage for String seems to be raw strings, which is nice.
> >> Would I get a substantial speedup by moving to bytestrings instead of
> >> strings? If I hashed the strings and stored common ones in a hash
> >> table is it likely to be a big win?
> >
> > Bytestrings should help. The big wins in this application are likely to  
> > be cache issues, though the improved memory/GC overhead is nice too.
> >
> 
> Here's a quick demo using Data.Binary directly.
> 
> Now, let's read back in and decode it back to a Map 
> 
>     main = do
>         [f] <- getArgs
>         m   <- decodeFile f
>         print (M.size (m :: M.Map B.ByteString Int))
>         print "done"
> 
> Easy enough:
> 
>     $ time ./A dict +RTS -K20M
>     52848
>     "done"
>     ./A dict +RTS -K20M  1.51s user 0.06s system 99% cpu 1.582 total


  
> Compressed dictionary is much smaller. Let's load it back in and unpickle it:
>     
>     main = do
>         [f] <- getArgs
>         m <- (decode . decompress) `fmap` L.readFile f
>         print (M.size (m :: M.Map B.ByteString Int))
>         print "done"
> 
> Also cute. But how does it run:
> 
>     $ time ./A dict.gz
>     52848
>     "done"
>     ./A dict.gz  0.28s user 0.03s system 98% cpu 0.310 total
> 
> Interesting. So extracting the Map from a compressed bytestring in memory is
> a fair bit faster than loading it  directly, uncompressed from disk.
> 


Note the difference, as Duncan and Bulat pointed out, is a bit
surprising. Perhaps the Map instance is a bit weird? We already know
that bytestring IO is fine.

Just serialising straight lists of pairs,

    import Data.Binary
    import Data.List
    import qualified Data.ByteString.Char8 as B
    import qualified Data.ByteString.Lazy  as L
    import System.Environment
    import qualified Data.Map as M
    import Codec.Compression.GZip

    main = do
        [f] <- getArgs
        s   <- B.readFile f
        let m = [ (head n, length n) | n <- (group . B.lines $ s) ]
        L.writeFile "dict.gz" . encode $ m
        print "done"

    $ time ./binary /usr/share/dict/cracklib-small
    "done"
    ./binary /usr/share/dict/cracklib-small  0.13s user 0.04s system 99% cpu
    0.173 total

    $ du -hs dict 
    1.3M    dict

And reading them back in,

    main = do
        [f] <- getArgs
        m <- decode `fmap` L.readFile f
        print (length (m :: [(B.ByteString,Int)]))
        print "done"

    $ time ./binary dict
    52848
    "done"
    ./binary dict  0.04s user 0.01s system 99% cpu 0.047 total

Is fast. So there's some complication in the Map serialisation. Adding in zlib,
to check,

    main = do
        [f] <- getArgs
        s   <- B.readFile f
        let m = [ (head n, length n) | n <- (group . B.lines $ s) ]
        L.writeFile "dict.gz" . compress . encode $ m
        print "done"

    $ time ./binary /usr/share/dict/cracklib-small 
    "done"
    ./binary /usr/share/dict/cracklib-small  0.25s user 0.03s system
    100% cpu 0.277 total

Compression takes longer, as expected, and reading it back in,

    main = do
        [f] <- getArgs
        m <- (decode . decompress) `fmap` L.readFile f
        print (length (m :: [(B.ByteString,Int)]))
        print "done"

    $ time ./binary dict.gz
    52848
    "done"
    ./binary dict.gz  0.03s user 0.01s system 98% cpu 0.040 total

About the same.

Looks like the Map reading/showing via association lists could do with
further work. 

Anyone want to dig around in the Map instance? (There's also some patches for
an alternative lazy Map serialisation, if people are keen to load maps -- happstack devs?).

-- Don


More information about the Haskell-Cafe mailing list