[Haskell-cafe] implementing python-style dictionary in Haskell

kenny lu haskellmail at gmail.com
Tue Nov 18 10:36:21 EST 2008


Dear Don,
I am using GHC 6.8.1

Regards,
Kenny

On Tue, Nov 18, 2008 at 11:33 PM, Don Stewart <dons at galois.com> wrote:

> Which version of GHC and which version of the Data.ByteString library?
> There was an inlining bug related to Data.Map /Data.IntMap performance
> fixed between the 6.8.x release and the current bytestring release.
>
> In testing, Data.Map with strict bytestring keys matched the python (C
> implemented) dictionary, after I fixed the inlining for word lookups.
>
> You'll need to be using bytestring 0.9.1.x though.
>
> -- Don
>
> haskellmail:
> >    Dear all,
> >
> >    I am trying to implement the python-style dictionary in Haskell.
> >
> >    Python dictionary is a data structure that maps one key to one value.
> >    For instance, a python dictionary
> >    d = {'a':1, 'b':2 }
> >    maps key 'a' to 1, 'b' to 2.
> >    Python dictionary allows for update. e.g. the statement
> >    d['a'] = 3
> >    changes the value pointed by 'a' from 1 to 3.
> >
> >    Internally, python dictionary is implemented using hash table.
> >
> >    My first attempt is to use Data.HashTable. However it was immediately
> >    abandoned, as I realize the memory usage is unreasonably huge.
> >
> >    == SECOND ATTEMPT ==
> >    My second attempt is to use Data.Map
> >
> >    {-# OPTIONS_GHC -fglasgow-exts #-}
> >    module Main where
> >
> >    import qualified Data.HashTable as HT
> >    import qualified Data.IntMap as IM
> >    import qualified Data.Map as DM
> >    import qualified Data.ByteString.Char8 as S
> >    import Data.Char
> >
> >    -- the Dict type class
> >    class Dict d k v | d -> k v where
> >        empty :: d
> >        insert :: k -> v -> d -> d
> >        lookup :: k -> d -> Maybe v
> >        update :: k -> v -> d -> d
> >
> >    -- Let's use string as key
> >    type Key = String
> >
> >    -- insert key-value pairs into a dictionary
> >    fromList :: Dict d k a => [(k,a)] -> d
> >    fromList l =
> >        foldl (\d (key,val) -> insert key val d) empty l
> >
> >    instance Dict (DM.Map S.ByteString a) Key a where
> >        empty = DM.empty
> >        insert key val dm =
> >        let packed_key = S.pack key
> >        in DM.insert packed_key val dm
> >        lookup key dm =
> >        let packed_key = S.pack key
> >            in DM.lookup packed_key dm
> >        update key val dm =
> >        let packed_key = S.pack key
> >            in DM.update (\x -> Just val) packed_key dm
> >
> >    Which kinda works, however since Map is implemented using a balanced
> tree,
> >    therefore,
> >    when as the dictionary grows, it takes a long time to insert new
> key-value
> >    pair.
> >
> >    == THIRD ATTEMPT ==
> >    My third attempt is to use Data.IntMap
> >
> >    -- an implementation of Dict using IntMap
> >    instance Dict (IM.IntMap a) Key a where
> >        empty = IM.empty
> >        insert key val im =
> >        let int_key = fromIntegral (HT.hashString key)
> >        in IM.insert int_key val im
> >        lookup key im =
> >        let int_key = fromIntegral (HT.hashString key)
> >        in IM.lookup int_key im
> >        update key val im =
> >        let int_key = fromIntegral (HT.hashString key)
> >            in IM.update (\x -> Just val) int_key im
> >
> >    This implementation is faster than the Map approach, however this
> >    implementation
> >    can't handle collision well, two keys which are hashed into the same
> >    integer will overwrite each other.
> >
> >    == FOURTH ATTEMPT ==
> >
> >    My fourth implementation is to use Trie. The idea is to split a string
> (a
> >    key) into
> >    a list of 4-character chunks. Each chunk can be mapped into a 32-bit
> >    integer without collision.
> >    We then insert the value with this list of chunks into the Trie.
> >
> >    -- an implementation of Dict using Trie
> >    instance Dict (Trie a) Key a where
> >        empty = emptyTrie
> >        insert key val trie =
> >        let key_chain = chain key
> >        in insertTrie key_chain val trie
> >        lookup key trie =
> >        let key_chain = chain key
> >            in lookupTrie key_chain trie
> >        update key val trie =
> >        let key_chain = chain key
> >            in updateTrie key_chain val trie
> >
> >    -- an auxillary function that "splits" string into small pieces,
> >    -- 4 characters per piece, 4 chars = 32 bit
> >    chain :: Key -> [Key]
> >    chain k | length k > 4 = let (k',ks) = splitAt 4 k
> >                 in (k':chain ks)
> >        | otherwise    = [k]
> >
> >    -- a collision-free hash function which turns four chars into Int32
> >    safehash :: [Char] -> Int
> >    safehash cs | length cs > 4 = error "safehash failed."
> >            | otherwise =
> >            sum [ (ord c)*(256^i)   | (c,i) <- zip cs [0..3] ]
> >
> >    -- a trie datatype
> >    data Trie a = Trie [a] (IM.IntMap (Trie a))
> >
> >    -- the empty trie
> >    emptyTrie = Trie [] (IM.empty)
> >
> >    -- insert value into the trie
> >    insertTrie :: [String] -> a -> Trie a -> Trie a
> >    insertTrie [] i (Trie is maps) = Trie (i:is) maps
> >    insertTrie (word:words) i (Trie is maps) =
> >        let key = safehash word
> >        in case IM.lookup key maps of
> >         { Just trie -> let trie' = insertTrie words i trie
> >                    maps' = IM.update (\x -> Just trie') key maps
> >                in Trie is maps'
> >         ; Nothing -> let trie = emptyTrie
> >                  trie' = insertTrie words i trie
> >                  maps' = IM.insert key trie' maps
> >                  in Trie is maps'
> >         }
> >
> >    -- lookup value from the trie
> >    lookupTrie :: [String] -> Trie a -> Maybe a
> >    lookupTrie [] (Trie vs _) =
> >        case vs of
> >          [] -> Nothing
> >          (x:_) -> Just x
> >    lookupTrie (word:words) (Trie is maps) =
> >        let key = safehash word
> >        in case IM.lookup key maps of
> >           Just trie -> lookupTrie words trie
> >           Nothing   -> Nothing
> >
> >    -- update the trie with the given value.
> >    updateTrie :: [String] -> a -> Trie a -> Trie a
> >    -- we only update the first value and leave the rest unchanged.
> >    updateTrie [] y (Trie (x:xs) maps) = Trie (y:xs) maps
> >    updateTrie (word:words) v  (Trie is maps) =
> >        let key = safehash word
> >        in case IM.lookup key maps of
> >           Just trie -> let trie' = updateTrie words v trie
> >                    maps'  = IM.update (\x -> Just trie') key maps
> >                in Trie is maps'
> >           Nothing   -> Trie is maps
> >
> >    == BENCH MARK ==
> >
> >    I have a main function which builds a dictionary from a text file.
> >    Each line of the file is a key-value pair separated by a space.
> >
> >    e.g.
> >
> >    key1 1
> >    key2 2
> >    ...
> >
> >    main :: IO ()
> >    main = do { content <- readFile "in.txt"
> >          ; let -- change this following type annotation
> >                -- to change different type of the dictionary
> >                -- dict :: DM.Map S.ByteString Int
> >                -- dict :: IM.IntMap Int
> >                dict :: Trie Int
> >                dict = fromList (map parse_a_line  (lines content))
> >          ; case Main.lookup "key256" dict of
> >            { Just v -> putStrLn (show v)
> >            ; Nothing -> putStrLn "Not found"
> >            }
> >            -- read a line here so that we can pause the program
> >                -- and look at the memory usage.
> >          ; v <- readLn
> >          ; putStrLn v
> >          }
> >        where  parse_a_line :: String -> (Key,Int)
> >               parse_a_line line = case words line of
> >                      [key,val] -> (key,read val)
> >                      _ -> error " parse error.  "
> >
> >    I tested all three implementations by building a dictionary of size
> >    1000000.
> >    The result shows that the Map and the Trie approaches handle collision
> >    well, but
> >    the IntMap approach does not.
> >
> >    Here is a comparison of memory usage
> >
> >    Map     : 345 MB
> >    IntMap : 146 MB
> >    Trie     : 282 MB
> >    Python : 94 MB
> >
> >    Here is a comparison of execution time (on an intel dual core 2.0G)
> >
> >    Map: 26 sec
> >    IntMap: 9 sec
> >    Trie: 12 sec
> >    Python: 2.24 sec
> >
> >    The above number shows that my implementations of python style
> dictionary
> >    are space/time in-efficient as compared to python.
> >
> >    Can some one point out what's wrong with my implementations?
> >
> >    I've attached my code in the tgz file.
> >
> >    Cheers,
> >    Kenny
>
>
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20081118/b029eb85/attachment.htm


More information about the Haskell-Cafe mailing list