[Haskell-cafe] implementing python-style dictionary in Haskell

Tue Nov 18 07:01:56 EST 2008

2008/11/18 kenny lu <haskellmail at gmail.com>:
> Dear all,
>
> I am trying to implement the python-style dictionary in Haskell.
>
> Python dictionary is a data structure that maps one key to one value.
> For instance, a python dictionary
> d = {'a':1, 'b':2 }
> maps key 'a' to 1, 'b' to 2.
> Python dictionary allows for update. e.g. the statement
> d['a'] = 3
> changes the value pointed by 'a' from 1 to 3.
>
> Internally, python dictionary is implemented using hash table.
>
> My first attempt is to use Data.HashTable. However it was immediately
> abandoned, as I realize the memory usage is unreasonably huge.
>
> ...
>
> I tested all three implementations by building a dictionary of size 1000000.
> The result shows that the Map and the Trie approaches handle collision well,
> but
> the IntMap approach does not.
>
>
> Here is a comparison of memory usage
>
> Map     : 345 MB
> IntMap : 146 MB
> Trie     : 282 MB
> Python : 94 MB
>
> Here is a comparison of execution time (on an intel dual core 2.0G)
>
> Map: 26 sec
> IntMap: 9 sec
> Trie: 12 sec
> Python: 2.24 sec
>
>
> The above number shows that my implementations of python style dictionary
> are space/time in-efficient as compared to python.
>
> Can some one point out what's wrong with my implementations?
>

First of all, you use Strings. That's a very bad thing when you care
about memory restrictions. Fire up ghci type something like this:
> let aas = replicate (1024*1024*10) 'a'
> -- 22 Mb memory usage
> length aas
> 10485760
> -- 270 Mb memory usage
10 Mb string caused as much as 250 Mb increase in ghci's memory consumption.

My guess? Use hashtables with ByteStrings.
I rewrote part of your code. Results are quite promising.

Haskell:
121 Mb total memory in use

 INIT  time    0.02s  (  0.00s elapsed)
 MUT   time    0.84s  (  1.00s elapsed)
 GC    time    1.97s  (  2.02s elapsed)
 EXIT  time    0.00s  (  0.00s elapsed)
 Total time    2.83s  (  3.02s elapsed)

 %GC time      69.6%  (66.8% elapsed)
Python:
$ time python dict.py
256

real    0m2.278s
user    0m2.233s
sys     0m0.078s

memory: 101 Mb (as reported by Windows' Task Manager).

The code:

--- cut ---

import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Int
import Data.Bits

(...)

parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int)
parse_a_line_BS line = case BS.words line of
		      [key,val] -> (key,(read . BS.unpack) val)
		      _ -> error " parse error.  "

main :: IO ()
main = do
  dict <- HT.new (==) hashByteString
  indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt")
  mapM_ (\ (k,v) -> HT.insert dict k v) indata
  HT.lookup dict (BS.pack "key256") >>= \v -> case v of
                                                Just vv -> putStrLn (show vv)
                                                Nothing -> putStrLn
("Not found")

-- derived from Data.HashTable.hashString
hashByteString :: BS.ByteString -> Int32
hashByteString = BS.foldl' f golden
   where f m c = fromIntegral (ord c) * magic + hashInt32 m
         magic = 0xdeadbeef

hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x

mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
   where r :: Int64
         r = fromIntegral a * fromIntegral b

golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32

--- cut ---

I had to rewrite hashString to work for ByteStrings - basically it's
just using different foldl'.

All best

Christopher Skrzętnicki