[Haskell-cafe] implementing python-style dictionary in Haskell
kenny lu
haskellmail at gmail.com
Tue Nov 18 10:36:21 EST 2008
Dear Don,
I am using GHC 6.8.1
Regards,
Kenny
On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <dons at galois.com> wrote:
> Which version of GHC and which version of the Data.ByteString library?
> There was an inlining bug related to Data.Map /Data.IntMap performance
> fixed between the 6.8.x release and the current bytestring release.
>
> In testing, Data.Map with strict bytestring keys matched the python (C
> implemented) dictionary, after I fixed the inlining for word lookups.
>
> You'll need to be using bytestring 0.9.1.x though.
>
> -- Don
>
> haskellmail:
> > Dear all,
> >
> > I am trying to implement the python-style dictionary in Haskell.
> >
> > Python dictionary is a data structure that maps one key to one value.
> > For instance, a python dictionary
> > d = {'a':1, 'b':2 }
> > maps key 'a' to 1, 'b' to 2.
> > Python dictionary allows for update. e.g. the statement
> > d['a'] = 3
> > changes the value pointed by 'a' from 1 to 3.
> >
> > Internally, python dictionary is implemented using hash table.
> >
> > My first attempt is to use Data.HashTable. However it was immediately
> > abandoned, as I realize the memory usage is unreasonably huge.
> >
> > == SECOND ATTEMPT ==
> > My second attempt is to use Data.Map
> >
> > {-# OPTIONS_GHC -fglasgow-exts #-}
> > module Main where
> >
> > import qualified Data.HashTable as HT
> > import qualified Data.IntMap as IM
> > import qualified Data.Map as DM
> > import qualified Data.ByteString.Char8 as S
> > import Data.Char
> >
> > -- the Dict type class
> > class Dict d k v | d -> k v where
> > empty :: d
> > insert :: k -> v -> d -> d
> > lookup :: k -> d -> Maybe v
> > update :: k -> v -> d -> d
> >
> > -- Let's use string as key
> > type Key = String
> >
> > -- insert key-value pairs into a dictionary
> > fromList :: Dict d k a => [(k,a)] -> d
> > fromList l =
> > foldl (\d (key,val) -> insert key val d) empty l
> >
> > instance Dict (DM.Map S.ByteString a) Key a where
> > empty = DM.empty
> > insert key val dm =
> > let packed_key = S.pack key
> > in DM.insert packed_key val dm
> > lookup key dm =
> > let packed_key = S.pack key
> > in DM.lookup packed_key dm
> > update key val dm =
> > let packed_key = S.pack key
> > in DM.update (\x -> Just val) packed_key dm
> >
> > Which kinda works, however since Map is implemented using a balanced
> tree,
> > therefore,
> > when as the dictionary grows, it takes a long time to insert new
> key-value
> > pair.
> >
> > == THIRD ATTEMPT ==
> > My third attempt is to use Data.IntMap
> >
> > -- an implementation of Dict using IntMap
> > instance Dict (IM.IntMap a) Key a where
> > empty = IM.empty
> > insert key val im =
> > let int_key = fromIntegral (HT.hashString key)
> > in IM.insert int_key val im
> > lookup key im =
> > let int_key = fromIntegral (HT.hashString key)
> > in IM.lookup int_key im
> > update key val im =
> > let int_key = fromIntegral (HT.hashString key)
> > in IM.update (\x -> Just val) int_key im
> >
> > This implementation is faster than the Map approach, however this
> > implementation
> > can't handle collision well, two keys which are hashed into the same
> > integer will overwrite each other.
> >
> > == FOURTH ATTEMPT ==
> >
> > My fourth implementation is to use Trie. The idea is to split a string
> (a
> > key) into
> > a list of 4-character chunks. Each chunk can be mapped into a 32-bit
> > integer without collision.
> > We then insert the value with this list of chunks into the Trie.
> >
> > -- an implementation of Dict using Trie
> > instance Dict (Trie a) Key a where
> > empty = emptyTrie
> > insert key val trie =
> > let key_chain = chain key
> > in insertTrie key_chain val trie
> > lookup key trie =
> > let key_chain = chain key
> > in lookupTrie key_chain trie
> > update key val trie =
> > let key_chain = chain key
> > in updateTrie key_chain val trie
> >
> > -- an auxillary function that "splits" string into small pieces,
> > -- 4 characters per piece, 4 chars = 32 bit
> > chain :: Key -> [Key]
> > chain k | length k > 4 = let (k',ks) = splitAt 4 k
> > in (k':chain ks)
> > | otherwise = [k]
> >
> > -- a collision-free hash function which turns four chars into Int32
> > safehash :: [Char] -> Int
> > safehash cs | length cs > 4 = error "safehash failed."
> > | otherwise =
> > sum [ (ord c)*(256^i) | (c,i) <- zip cs [0..3] ]
> >
> > -- a trie datatype
> > data Trie a = Trie [a] (IM.IntMap (Trie a))
> >
> > -- the empty trie
> > emptyTrie = Trie [] (IM.empty)
> >
> > -- insert value into the trie
> > insertTrie :: [String] -> a -> Trie a -> Trie a
> > insertTrie [] i (Trie is maps) = Trie (i:is) maps
> > insertTrie (word:words) i (Trie is maps) =
> > let key = safehash word
> > in case IM.lookup key maps of
> > { Just trie -> let trie' = insertTrie words i trie
> > maps' = IM.update (\x -> Just trie') key maps
> > in Trie is maps'
> > ; Nothing -> let trie = emptyTrie
> > trie' = insertTrie words i trie
> > maps' = IM.insert key trie' maps
> > in Trie is maps'
> > }
> >
> > -- lookup value from the trie
> > lookupTrie :: [String] -> Trie a -> Maybe a
> > lookupTrie [] (Trie vs _) =
> > case vs of
> > [] -> Nothing
> > (x:_) -> Just x
> > lookupTrie (word:words) (Trie is maps) =
> > let key = safehash word
> > in case IM.lookup key maps of
> > Just trie -> lookupTrie words trie
> > Nothing -> Nothing
> >
> > -- update the trie with the given value.
> > updateTrie :: [String] -> a -> Trie a -> Trie a
> > -- we only update the first value and leave the rest unchanged.
> > updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
> > updateTrie (word:words) v (Trie is maps) =
> > let key = safehash word
> > in case IM.lookup key maps of
> > Just trie -> let trie' = updateTrie words v trie
> > maps' = IM.update (\x -> Just trie') key maps
> > in Trie is maps'
> > Nothing -> Trie is maps
> >
> > == BENCH MARK ==
> >
> > I have a main function which builds a dictionary from a text file.
> > Each line of the file is a key-value pair separated by a space.
> >
> > e.g.
> >
> > key1 1
> > key2 2
> > ...
> >
> > main :: IO ()
> > main = do { content <- readFile "in.txt"
> > ; let -- change this following type annotation
> > -- to change different type of the dictionary
> > -- dict :: DM.Map S.ByteString Int
> > -- dict :: IM.IntMap Int
> > dict :: Trie Int
> > dict = fromList (map parse_a_line (lines content))
> > ; case Main.lookup "key256" dict of
> > { Just v -> putStrLn (show v)
> > ; Nothing -> putStrLn "Not found"
> > }
> > -- read a line here so that we can pause the program
> > -- and look at the memory usage.
> > ; v <- readLn
> > ; putStrLn v
> > }
> > where parse_a_line :: String -> (Key,Int)
> > parse_a_line line = case words line of
> > [key,val] -> (key,read val)
> > _ -> error " parse error. "
> >
> > I tested all three implementations by building a dictionary of size
> > 1000000.
> > The result shows that the Map and the Trie approaches handle collision
> > well, but
> > the IntMap approach does not.
> >
> > Here is a comparison of memory usage
> >
> > Map : 345 MB
> > IntMap : 146 MB
> > Trie : 282 MB
> > Python : 94 MB
> >
> > Here is a comparison of execution time (on an intel dual core 2.0G)
> >
> > Map: 26 sec
> > IntMap: 9 sec
> > Trie: 12 sec
> > Python: 2.24 sec
> >
> > The above number shows that my implementations of python style
> dictionary
> > are space/time in-efficient as compared to python.
> >
> > Can some one point out what's wrong with my implementations?
> >
> > I've attached my code in the tgz file.
> >
> > Cheers,
> > Kenny
>
>
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20081118/b029eb85/attachment.htm
More information about the Haskell-Cafe
mailing list