[Haskell-cafe] implementing python-style dictionary in Haskell

Don Stewart dons at galois.com
Tue Nov 18 10:44:05 EST 2008


Great. Assuming you're following the advice to use bytestrings, please
install the newest bytestring library version, here,

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring

Data.Map or Data.IntMap with bytestrings should be quite efficient.
(or use a trie if more precision is needed)

-- Don

haskellmail:
>    Dear Don,
>    I am using GHC 6.8.1
> 
>    Regards,
>    Kenny
> 
>    On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <[1]dons at galois.com> wrote:
> 
>      Which version of GHC and which version of the Data.ByteString library?
>      There was an inlining bug related to Data.Map /Data.IntMap performance
>      fixed between the 6.8.x release and the current bytestring release.
> 
>      In testing, Data.Map with strict bytestring keys matched the python (C
>      implemented) dictionary, after I fixed the inlining for word lookups.
> 
>      You'll need to be using bytestring 0.9.1.x though.
> 
>      -- Don
> 
>      haskellmail:
>      >    Dear all,
>      >
>      >    I am trying to implement the python-style dictionary in Haskell.
>      >
>      >    Python dictionary is a data structure that maps one key to one
>      value.
>      >    For instance, a python dictionary
>      >    d = {'a':1, 'b':2 }
>      >    maps key 'a' to 1, 'b' to 2.
>      >    Python dictionary allows for update. e.g. the statement
>      >    d['a'] = 3
>      >    changes the value pointed by 'a' from 1 to 3.
>      >
>      >    Internally, python dictionary is implemented using hash table.
>      >
>      >    My first attempt is to use Data.HashTable. However it was
>      immediately
>      >    abandoned, as I realize the memory usage is unreasonably huge.
>      >
>      >    == SECOND ATTEMPT ==
>      >    My second attempt is to use Data.Map
>      >
>      >    {-# OPTIONS_GHC -fglasgow-exts #-}
>      >    module Main where
>      >
>      >    import qualified Data.HashTable as HT
>      >    import qualified Data.IntMap as IM
>      >    import qualified Data.Map as DM
>      >    import qualified Data.ByteString.Char8 as S
>      >    import Data.Char
>      >
>      >    -- the Dict type class
>      >    class Dict d k v | d -> k v where
>      >        empty :: d
>      >        insert :: k -> v -> d -> d
>      >        lookup :: k -> d -> Maybe v
>      >        update :: k -> v -> d -> d
>      >
>      >    -- Let's use string as key
>      >    type Key = String
>      >
>      >    -- insert key-value pairs into a dictionary
>      >    fromList :: Dict d k a => [(k,a)] -> d
>      >    fromList l =
>      >        foldl (\d (key,val) -> insert key val d) empty l
>      >
>      >    instance Dict (DM.Map S.ByteString a) Key a where
>      >        empty = DM.empty
>      >        insert key val dm =
>      >        let packed_key = S.pack key
>      >        in DM.insert packed_key val dm
>      >        lookup key dm =
>      >        let packed_key = S.pack key
>      >            in DM.lookup packed_key dm
>      >        update key val dm =
>      >        let packed_key = S.pack key
>      >            in DM.update (\x -> Just val) packed_key dm
>      >
>      >    Which kinda works, however since Map is implemented using a
>      balanced tree,
>      >    therefore,
>      >    when as the dictionary grows, it takes a long time to insert new
>      key-value
>      >    pair.
>      >
>      >    == THIRD ATTEMPT ==
>      >    My third attempt is to use Data.IntMap
>      >
>      >    -- an implementation of Dict using IntMap
>      >    instance Dict (IM.IntMap a) Key a where
>      >        empty = IM.empty
>      >        insert key val im =
>      >        let int_key = fromIntegral (HT.hashString key)
>      >        in IM.insert int_key val im
>      >        lookup key im =
>      >        let int_key = fromIntegral (HT.hashString key)
>      >        in IM.lookup int_key im
>      >        update key val im =
>      >        let int_key = fromIntegral (HT.hashString key)
>      >            in IM.update (\x -> Just val) int_key im
>      >
>      >    This implementation is faster than the Map approach, however this
>      >    implementation
>      >    can't handle collision well, two keys which are hashed into the
>      same
>      >    integer will overwrite each other.
>      >
>      >    == FOURTH ATTEMPT ==
>      >
>      >    My fourth implementation is to use Trie. The idea is to split a
>      string (a
>      >    key) into
>      >    a list of 4-character chunks. Each chunk can be mapped into a
>      32-bit
>      >    integer without collision.
>      >    We then insert the value with this list of chunks into the Trie.
>      >
>      >    -- an implementation of Dict using Trie
>      >    instance Dict (Trie a) Key a where
>      >        empty = emptyTrie
>      >        insert key val trie =
>      >        let key_chain = chain key
>      >        in insertTrie key_chain val trie
>      >        lookup key trie =
>      >        let key_chain = chain key
>      >            in lookupTrie key_chain trie
>      >        update key val trie =
>      >        let key_chain = chain key
>      >            in updateTrie key_chain val trie
>      >
>      >    -- an auxillary function that "splits" string into small pieces,
>      >    -- 4 characters per piece, 4 chars = 32 bit
>      >    chain :: Key -> [Key]
>      >    chain k | length k > 4 = let (k',ks) = splitAt 4 k
>      >                 in (k':chain ks)
>      >        | otherwise    = [k]
>      >
>      >    -- a collision-free hash function which turns four chars into Int32
>      >    safehash :: [Char] -> Int
>      >    safehash cs | length cs > 4 = error "safehash failed."
>      >            | otherwise =
>      >            sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]
>      >
>      >    -- a trie datatype
>      >    data Trie a = Trie [a] (IM.IntMap (Trie a))
>      >
>      >    -- the empty trie
>      >    emptyTrie = Trie [] (IM.empty)
>      >
>      >    -- insert value into the trie
>      >    insertTrie :: [String] -> a -> Trie a -> Trie a
>      >    insertTrie [] i (Trie is maps) = Trie (i:is) maps
>      >    insertTrie (word:words) i (Trie is maps) =
>      >        let key = safehash word
>      >        in case IM.lookup key maps of
>      >         { Just trie -> let trie' = insertTrie words i trie
>      >                    maps' = IM.update (\x -> Just trie') key maps
>      >                in Trie is maps'
>      >         ; Nothing -> let trie = emptyTrie
>      >                  trie' = insertTrie words i trie
>      >                  maps' = IM.insert key trie' maps
>      >                  in Trie is maps'
>      >         }
>      >
>      >    -- lookup value from the trie
>      >    lookupTrie :: [String] -> Trie a -> Maybe a
>      >    lookupTrie [] (Trie vs _) =
>      >        case vs of
>      >          [] -> Nothing
>      >          (x:_) -> Just x
>      >    lookupTrie (word:words) (Trie is maps) =
>      >        let key = safehash word
>      >        in case IM.lookup key maps of
>      >           Just trie -> lookupTrie words trie
>      >           Nothing   -> Nothing
>      >
>      >    -- update the trie with the given value.
>      >    updateTrie :: [String] -> a -> Trie a -> Trie a
>      >    -- we only update the first value and leave the rest unchanged.
>      >    updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
>      >    updateTrie (word:words) v  (Trie is maps) =
>      >        let key = safehash word
>      >        in case IM.lookup key maps of
>      >           Just trie -> let trie' = updateTrie words v trie
>      >                    maps'  = IM.update (\x -> Just trie') key maps
>      >                in Trie is maps'
>      >           Nothing   -> Trie is maps
>      >
>      >    == BENCH MARK ==
>      >
>      >    I have a main function which builds a dictionary from a text file.
>      >    Each line of the file is a key-value pair separated by a space.
>      >
>      >    e.g.
>      >
>      >    key1 1
>      >    key2 2
>      >    ...
>      >
>      >    main :: IO ()
>      >    main = do { content <- readFile "in.txt"
>      >          ; let -- change this following type annotation
>      >                -- to change different type of the dictionary
>      >                -- dict :: DM.Map S.ByteString Int
>      >                -- dict :: IM.IntMap Int
>      >                dict :: Trie Int
>      >                dict = fromList (map parse_a_line  (lines content))
>      >          ; case Main.lookup "key256" dict of
>      >            { Just v -> putStrLn (show v)
>      >            ; Nothing -> putStrLn "Not found"
>      >            }
>      >            -- read a line here so that we can pause the program
>      >                -- and look at the memory usage.
>      >          ; v <- readLn
>      >          ; putStrLn v
>      >          }
>      >        where  parse_a_line :: String -> (Key,Int)
>      >               parse_a_line line = case words line of
>      >                      [key,val] -> (key,read val)
>      >                      _ -> error " parse error.  "
>      >
>      >    I tested all three implementations by building a dictionary of size
>      >    1000000.
>      >    The result shows that the Map and the Trie approaches handle
>      collision
>      >    well, but
>      >    the IntMap approach does not.
>      >
>      >    Here is a comparison of memory usage
>      >
>      >    Map     : 345 MB
>      >    IntMap : 146 MB
>      >    Trie     : 282 MB
>      >    Python : 94 MB
>      >
>      >    Here is a comparison of execution time (on an intel dual core 2.0G)
>      >
>      >    Map: 26 sec
>      >    IntMap: 9 sec
>      >    Trie: 12 sec
>      >    Python: 2.24 sec
>      >
>      >    The above number shows that my implementations of python style
>      dictionary
>      >    are space/time in-efficient as compared to python.
>      >
>      >    Can some one point out what's wrong with my implementations?
>      >
>      >    I've attached my code in the tgz file.
>      >
>      >    Cheers,
>      >    Kenny
> 
>      > _______________________________________________
>      > Haskell-Cafe mailing list
>      > [2]Haskell-Cafe at haskell.org
>      > [3]http://www.haskell.org/mailman/listinfo/haskell-cafe
> 
> References
> 
>    Visible links
>    1. mailto:dons at galois.com
>    2. mailto:Haskell-Cafe at haskell.org
>    3. http://www.haskell.org/mailman/listinfo/haskell-cafe


More information about the Haskell-Cafe mailing list