Pickling a finite map (Binary + zlib) [was: [Haskell-cafe]
Data.Binary poor read performance]
Don Stewart
dons at galois.com
Mon Feb 23 20:03:14 EST 2009
wren:
> Neil Mitchell wrote:
>> 2) The storage for String seems to be raw strings, which is nice.
>> Would I get a substantial speedup by moving to bytestrings instead of
>> strings? If I hashed the strings and stored common ones in a hash
>> table is it likely to be a big win?
>
> Bytestrings should help. The big wins in this application are likely to
> be cache issues, though the improved memory/GC overhead is nice too.
>
Here's a quick demo using Data.Binary directly.
First, let's read in the dictionary file, and build a big, worst-case finite
map of words to their occurence (always 1).
import Data.Binary
import Data.List
import qualified Data.ByteString.Char8 as B
import System.Environment
import qualified Data.Map as M
main = do
[f] <- getArgs
s <- B.readFile f
let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
encodeFile "dict" m
print "done"
So that writes a "dict" file with a binary encoded Map ByteString Int.
Using ghc -O2 --make for everying.
$ time ./A /usr/share/dict/cracklib-small
"done"
./A /usr/share/dict/cracklib-small 0.28s user 0.03s system 94% cpu 0.331 total
Yields a dictionary file Map:
$ du -hs dict
1.3M dict
Now, let's read back in and decode it back to a Map
main = do
[f] <- getArgs
m <- decodeFile f
print (M.size (m :: M.Map B.ByteString Int))
print "done"
Easy enough:
$ time ./A dict +RTS -K20M
52848
"done"
./A dict +RTS -K20M 1.51s user 0.06s system 99% cpu 1.582 total
Ok. So 1.5s to decode a 1.3M Map. There may be better ways to build the Map since we know the input will be sorted, but
the Data.Binary instance can't do that.
Since decode/encode are a nice pure function on lazy bytestrings, we can try a trick of
compressing/decompressing the dictionary in memory.
Compressing the dictionary:
import Data.Binary
import Data.List
import qualified Data.ByteString.Char8 as B
import qualified Data.ByteString.Lazy as L
import System.Environment
import qualified Data.Map as M
import Codec.Compression.GZip
main = do
[f] <- getArgs
s <- B.readFile f
let m = M.fromList [ (head n, length n) | n <- (group . B.lines $ s) ]
L.writeFile "dict.gz" . compress . encode $ m
print "done"
Pretty cool, imo, is "compress . encode":
$ time ./A /usr/share/dict/cracklib-small
"done"
./A /usr/share/dict/cracklib-small 0.38s user 0.02s system 85% cpu 0.470 total
Ok. So building a compressed dictionary takes only a bit longer than uncompressed one (zlib is fast),
$ du -hs dict.gz
216K dict.gz
Compressed dictionary is much smaller. Let's load it back in and unpickle it:
main = do
[f] <- getArgs
m <- (decode . decompress) `fmap` L.readFile f
print (M.size (m :: M.Map B.ByteString Int))
print "done"
Also cute. But how does it run:
$ time ./A dict.gz
52848
"done"
./A dict.gz 0.28s user 0.03s system 98% cpu 0.310 total
Interesting. So extracting the Map from a compressed bytestring in memory is a fair bit faster than loading it
directly, uncompressed from disk.
Neil, does that give you some ballpark figures to work with?
-- Don
More information about the Haskell-Cafe
mailing list