[Haskell-cafe] implementing python-style dictionary in Haskell

Don Stewart dons at galois.com
Tue Nov 18 10:33:29 EST 2008


Which version of GHC and which version of the Data.ByteString library?
There was an inlining bug related to Data.Map /Data.IntMap performance
fixed between the 6.8.x release and the current bytestring release.

In testing, Data.Map with strict bytestring keys matched the python (C
implemented) dictionary, after I fixed the inlining for word lookups.

You'll need to be using bytestring 0.9.1.x though.

-- Don

haskellmail:
>    Dear all,
> 
>    I am trying to implement the python-style dictionary in Haskell.
> 
>    Python dictionary is a data structure that maps one key to one value.
>    For instance, a python dictionary
>    d = {'a':1, 'b':2 }
>    maps key 'a' to 1, 'b' to 2.
>    Python dictionary allows for update. e.g. the statement
>    d['a'] = 3
>    changes the value pointed by 'a' from 1 to 3.
> 
>    Internally, python dictionary is implemented using hash table.
> 
>    My first attempt is to use Data.HashTable. However it was immediately
>    abandoned, as I realize the memory usage is unreasonably huge.
> 
>    == SECOND ATTEMPT ==
>    My second attempt is to use Data.Map
> 
>    {-# OPTIONS_GHC -fglasgow-exts #-}
>    module Main where
> 
>    import qualified Data.HashTable as HT
>    import qualified Data.IntMap as IM
>    import qualified Data.Map as DM
>    import qualified Data.ByteString.Char8 as S
>    import Data.Char
> 
>    -- the Dict type class
>    class Dict d k v | d -> k v where
>        empty :: d
>        insert :: k -> v -> d -> d
>        lookup :: k -> d -> Maybe v
>        update :: k -> v -> d -> d
> 
>    -- Let's use string as key
>    type Key = String
> 
>    -- insert key-value pairs into a dictionary
>    fromList :: Dict d k a => [(k,a)] -> d
>    fromList l =
>        foldl (\d (key,val) -> insert key val d) empty l
> 
>    instance Dict (DM.Map S.ByteString a) Key a where
>        empty = DM.empty
>        insert key val dm =
>        let packed_key = S.pack key
>        in DM.insert packed_key val dm
>        lookup key dm =
>        let packed_key = S.pack key
>            in DM.lookup packed_key dm
>        update key val dm =
>        let packed_key = S.pack key
>            in DM.update (\x -> Just val) packed_key dm
> 
>    Which kinda works, however since Map is implemented using a balanced tree,
>    therefore,
>    when as the dictionary grows, it takes a long time to insert new key-value
>    pair.
> 
>    == THIRD ATTEMPT ==
>    My third attempt is to use Data.IntMap
> 
>    -- an implementation of Dict using IntMap
>    instance Dict (IM.IntMap a) Key a where
>        empty = IM.empty
>        insert key val im =
>        let int_key = fromIntegral (HT.hashString key)
>        in IM.insert int_key val im
>        lookup key im =
>        let int_key = fromIntegral (HT.hashString key)
>        in IM.lookup int_key im
>        update key val im =
>        let int_key = fromIntegral (HT.hashString key)
>            in IM.update (\x -> Just val) int_key im
> 
>    This implementation is faster than the Map approach, however this
>    implementation
>    can't handle collision well, two keys which are hashed into the same
>    integer will overwrite each other.
> 
>    == FOURTH ATTEMPT ==
> 
>    My fourth implementation is to use Trie. The idea is to split a string (a
>    key) into
>    a list of 4-character chunks. Each chunk can be mapped into a 32-bit
>    integer without collision.
>    We then insert the value with this list of chunks into the Trie.
> 
>    -- an implementation of Dict using Trie
>    instance Dict (Trie a) Key a where
>        empty = emptyTrie
>        insert key val trie =
>        let key_chain = chain key
>        in insertTrie key_chain val trie
>        lookup key trie =
>        let key_chain = chain key
>            in lookupTrie key_chain trie
>        update key val trie =
>        let key_chain = chain key
>            in updateTrie key_chain val trie
> 
>    -- an auxillary function that "splits" string into small pieces,
>    -- 4 characters per piece, 4 chars = 32 bit
>    chain :: Key -> [Key]
>    chain k | length k > 4 = let (k',ks) = splitAt 4 k
>                 in (k':chain ks)
>        | otherwise    = [k]
> 
>    -- a collision-free hash function which turns four chars into Int32
>    safehash :: [Char] -> Int
>    safehash cs | length cs > 4 = error "safehash failed."
>            | otherwise =
>            sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]
> 
>    -- a trie datatype
>    data Trie a = Trie [a] (IM.IntMap (Trie a))
> 
>    -- the empty trie
>    emptyTrie = Trie [] (IM.empty)
> 
>    -- insert value into the trie
>    insertTrie :: [String] -> a -> Trie a -> Trie a
>    insertTrie [] i (Trie is maps) = Trie (i:is) maps
>    insertTrie (word:words) i (Trie is maps) =
>        let key = safehash word
>        in case IM.lookup key maps of
>         { Just trie -> let trie' = insertTrie words i trie
>                    maps' = IM.update (\x -> Just trie') key maps
>                in Trie is maps'
>         ; Nothing -> let trie = emptyTrie
>                  trie' = insertTrie words i trie
>                  maps' = IM.insert key trie' maps
>                  in Trie is maps'
>         }
> 
>    -- lookup value from the trie
>    lookupTrie :: [String] -> Trie a -> Maybe a
>    lookupTrie [] (Trie vs _) =
>        case vs of
>          [] -> Nothing
>          (x:_) -> Just x
>    lookupTrie (word:words) (Trie is maps) =
>        let key = safehash word
>        in case IM.lookup key maps of
>           Just trie -> lookupTrie words trie
>           Nothing   -> Nothing
> 
>    -- update the trie with the given value.
>    updateTrie :: [String] -> a -> Trie a -> Trie a
>    -- we only update the first value and leave the rest unchanged.
>    updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
>    updateTrie (word:words) v  (Trie is maps) =
>        let key = safehash word
>        in case IM.lookup key maps of
>           Just trie -> let trie' = updateTrie words v trie
>                    maps'  = IM.update (\x -> Just trie') key maps
>                in Trie is maps'
>           Nothing   -> Trie is maps
> 
>    == BENCH MARK ==
> 
>    I have a main function which builds a dictionary from a text file.
>    Each line of the file is a key-value pair separated by a space.
> 
>    e.g.
> 
>    key1 1
>    key2 2
>    ...
> 
>    main :: IO ()
>    main = do { content <- readFile "in.txt"
>          ; let -- change this following type annotation
>                -- to change different type of the dictionary
>                -- dict :: DM.Map S.ByteString Int
>                -- dict :: IM.IntMap Int
>                dict :: Trie Int
>                dict = fromList (map parse_a_line  (lines content))
>          ; case Main.lookup "key256" dict of
>            { Just v -> putStrLn (show v)
>            ; Nothing -> putStrLn "Not found"
>            }
>            -- read a line here so that we can pause the program
>                -- and look at the memory usage.
>          ; v <- readLn
>          ; putStrLn v
>          }
>        where  parse_a_line :: String -> (Key,Int)
>               parse_a_line line = case words line of
>                      [key,val] -> (key,read val)
>                      _ -> error " parse error.  "
> 
>    I tested all three implementations by building a dictionary of size
>    1000000.
>    The result shows that the Map and the Trie approaches handle collision
>    well, but
>    the IntMap approach does not.
> 
>    Here is a comparison of memory usage
> 
>    Map     : 345 MB
>    IntMap : 146 MB
>    Trie     : 282 MB
>    Python : 94 MB
> 
>    Here is a comparison of execution time (on an intel dual core 2.0G)
> 
>    Map: 26 sec
>    IntMap: 9 sec
>    Trie: 12 sec
>    Python: 2.24 sec
> 
>    The above number shows that my implementations of python style dictionary 
>    are space/time in-efficient as compared to python.
> 
>    Can some one point out what's wrong with my implementations?
> 
>    I've attached my code in the tgz file.
> 
>    Cheers,
>    Kenny


> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe



More information about the Haskell-Cafe mailing list