[Haskell-cafe] implementing python-style dictionary in Haskell
Don Stewart
dons at galois.com
Tue Nov 18 10:33:29 EST 2008
Which version of GHC and which version of the Data.ByteString library?
There was an inlining bug related to Data.Map /Data.IntMap performance
fixed between the 6.8.x release and the current bytestring release.
In testing, Data.Map with strict bytestring keys matched the python (C
implemented) dictionary, after I fixed the inlining for word lookups.
You'll need to be using bytestring 0.9.1.x though.
-- Don
haskellmail:
> Dear all,
>
> I am trying to implement the python-style dictionary in Haskell.
>
> Python dictionary is a data structure that maps one key to one value.
> For instance, a python dictionary
> d = {'a':1, 'b':2 }
> maps key 'a' to 1, 'b' to 2.
> Python dictionary allows for update. e.g. the statement
> d['a'] = 3
> changes the value pointed by 'a' from 1 to 3.
>
> Internally, python dictionary is implemented using hash table.
>
> My first attempt is to use Data.HashTable. However it was immediately
> abandoned, as I realize the memory usage is unreasonably huge.
>
> == SECOND ATTEMPT ==
> My second attempt is to use Data.Map
>
> {-# OPTIONS_GHC -fglasgow-exts #-}
> module Main where
>
> import qualified Data.HashTable as HT
> import qualified Data.IntMap as IM
> import qualified Data.Map as DM
> import qualified Data.ByteString.Char8 as S
> import Data.Char
>
> -- the Dict type class
> class Dict d k v | d -> k v where
> empty :: d
> insert :: k -> v -> d -> d
> lookup :: k -> d -> Maybe v
> update :: k -> v -> d -> d
>
> -- Let's use string as key
> type Key = String
>
> -- insert key-value pairs into a dictionary
> fromList :: Dict d k a => [(k,a)] -> d
> fromList l =
> foldl (\d (key,val) -> insert key val d) empty l
>
> instance Dict (DM.Map S.ByteString a) Key a where
> empty = DM.empty
> insert key val dm =
> let packed_key = S.pack key
> in DM.insert packed_key val dm
> lookup key dm =
> let packed_key = S.pack key
> in DM.lookup packed_key dm
> update key val dm =
> let packed_key = S.pack key
> in DM.update (\x -> Just val) packed_key dm
>
> Which kinda works, however since Map is implemented using a balanced tree,
> therefore,
> when as the dictionary grows, it takes a long time to insert new key-value
> pair.
>
> == THIRD ATTEMPT ==
> My third attempt is to use Data.IntMap
>
> -- an implementation of Dict using IntMap
> instance Dict (IM.IntMap a) Key a where
> empty = IM.empty
> insert key val im =
> let int_key = fromIntegral (HT.hashString key)
> in IM.insert int_key val im
> lookup key im =
> let int_key = fromIntegral (HT.hashString key)
> in IM.lookup int_key im
> update key val im =
> let int_key = fromIntegral (HT.hashString key)
> in IM.update (\x -> Just val) int_key im
>
> This implementation is faster than the Map approach, however this
> implementation
> can't handle collision well, two keys which are hashed into the same
> integer will overwrite each other.
>
> == FOURTH ATTEMPT ==
>
> My fourth implementation is to use Trie. The idea is to split a string (a
> key) into
> a list of 4-character chunks. Each chunk can be mapped into a 32-bit
> integer without collision.
> We then insert the value with this list of chunks into the Trie.
>
> -- an implementation of Dict using Trie
> instance Dict (Trie a) Key a where
> empty = emptyTrie
> insert key val trie =
> let key_chain = chain key
> in insertTrie key_chain val trie
> lookup key trie =
> let key_chain = chain key
> in lookupTrie key_chain trie
> update key val trie =
> let key_chain = chain key
> in updateTrie key_chain val trie
>
> -- an auxillary function that "splits" string into small pieces,
> -- 4 characters per piece, 4 chars = 32 bit
> chain :: Key -> [Key]
> chain k | length k > 4 = let (k',ks) = splitAt 4 k
> in (k':chain ks)
> | otherwise = [k]
>
> -- a collision-free hash function which turns four chars into Int32
> safehash :: [Char] -> Int
> safehash cs | length cs > 4 = error "safehash failed."
> | otherwise =
> sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ]
>
> -- a trie datatype
> data Trie a = Trie [a] (IM.IntMap (Trie a))
>
> -- the empty trie
> emptyTrie = Trie [] (IM.empty)
>
> -- insert value into the trie
> insertTrie :: [String] -> a -> Trie a -> Trie a
> insertTrie [] i (Trie is maps) = Trie (i:is) maps
> insertTrie (word:words) i (Trie is maps) =
> let key = safehash word
> in case IM.lookup key maps of
> { Just trie -> let trie' = insertTrie words i trie
> maps' = IM.update (\x -> Just trie') key maps
> in Trie is maps'
> ; Nothing -> let trie = emptyTrie
> trie' = insertTrie words i trie
> maps' = IM.insert key trie' maps
> in Trie is maps'
> }
>
> -- lookup value from the trie
> lookupTrie :: [String] -> Trie a -> Maybe a
> lookupTrie [] (Trie vs _) =
> case vs of
> [] -> Nothing
> (x:_) -> Just x
> lookupTrie (word:words) (Trie is maps) =
> let key = safehash word
> in case IM.lookup key maps of
> Just trie -> lookupTrie words trie
> Nothing -> Nothing
>
> -- update the trie with the given value.
> updateTrie :: [String] -> a -> Trie a -> Trie a
> -- we only update the first value and leave the rest unchanged.
> updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
> updateTrie (word:words) v (Trie is maps) =
> let key = safehash word
> in case IM.lookup key maps of
> Just trie -> let trie' = updateTrie words v trie
> maps' = IM.update (\x -> Just trie') key maps
> in Trie is maps'
> Nothing -> Trie is maps
>
> == BENCH MARK ==
>
> I have a main function which builds a dictionary from a text file.
> Each line of the file is a key-value pair separated by a space.
>
> e.g.
>
> key1 1
> key2 2
> ...
>
> main :: IO ()
> main = do { content <- readFile "in.txt"
> ; let -- change this following type annotation
> -- to change different type of the dictionary
> -- dict :: DM.Map S.ByteString Int
> -- dict :: IM.IntMap Int
> dict :: Trie Int
> dict = fromList (map parse_a_line (lines content))
> ; case Main.lookup "key256" dict of
> { Just v -> putStrLn (show v)
> ; Nothing -> putStrLn "Not found"
> }
> -- read a line here so that we can pause the program
> -- and look at the memory usage.
> ; v <- readLn
> ; putStrLn v
> }
> where parse_a_line :: String -> (Key,Int)
> parse_a_line line = case words line of
> [key,val] -> (key,read val)
> _ -> error " parse error. "
>
> I tested all three implementations by building a dictionary of size
> 1000000.
> The result shows that the Map and the Trie approaches handle collision
> well, but
> the IntMap approach does not.
>
> Here is a comparison of memory usage
>
> Map : 345 MB
> IntMap : 146 MB
> Trie : 282 MB
> Python : 94 MB
>
> Here is a comparison of execution time (on an intel dual core 2.0G)
>
> Map: 26 sec
> IntMap: 9 sec
> Trie: 12 sec
> Python: 2.24 sec
>
> The above number shows that my implementations of python style dictionary
> are space/time in-efficient as compared to python.
>
> Can some one point out what's wrong with my implementations?
>
> I've attached my code in the tgz file.
>
> Cheers,
> Kenny
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
More information about the Haskell-Cafe
mailing list