[Haskell-cafe] implementing python-style dictionary in Haskell
Krzysztof Skrzętnicki
gtener at gmail.com
Tue Nov 18 07:01:56 EST 2008
2008/11/18 kenny lu <haskellmail at gmail.com>:
> Dear all,
>
> I am trying to implement the python-style dictionary in Haskell.
>
> Python dictionary is a data structure that maps one key to one value.
> For instance, a python dictionary
> d = {'a':1, 'b':2 }
> maps key 'a' to 1, 'b' to 2.
> Python dictionary allows for update. e.g. the statement
> d['a'] = 3
> changes the value pointed by 'a' from 1 to 3.
>
> Internally, python dictionary is implemented using hash table.
>
> My first attempt is to use Data.HashTable. However it was immediately
> abandoned, as I realize the memory usage is unreasonably huge.
>
> ...
>
> I tested all three implementations by building a dictionary of size 1000000.
> The result shows that the Map and the Trie approaches handle collision well,
> but
> the IntMap approach does not.
>
>
> Here is a comparison of memory usage
>
> Map : 345 MB
> IntMap : 146 MB
> Trie : 282 MB
> Python : 94 MB
>
> Here is a comparison of execution time (on an intel dual core 2.0G)
>
> Map: 26 sec
> IntMap: 9 sec
> Trie: 12 sec
> Python: 2.24 sec
>
>
> The above number shows that my implementations of python style dictionary
> are space/time in-efficient as compared to python.
>
> Can some one point out what's wrong with my implementations?
>
First of all, you use Strings. That's a very bad thing when you care
about memory restrictions. Fire up ghci type something like this:
> let aas = replicate (1024*1024*10) 'a'
> -- 22 Mb memory usage
> length aas
> 10485760
> -- 270 Mb memory usage
10 Mb string caused as much as 250 Mb increase in ghci's memory consumption.
My guess? Use hashtables with ByteStrings.
I rewrote part of your code. Results are quite promising.
Haskell:
121 Mb total memory in use
INIT time 0.02s ( 0.00s elapsed)
MUT time 0.84s ( 1.00s elapsed)
GC time 1.97s ( 2.02s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 2.83s ( 3.02s elapsed)
%GC time 69.6% (66.8% elapsed)
Python:
$ time python dict.py
256
real 0m2.278s
user 0m2.233s
sys 0m0.078s
memory: 101 Mb (as reported by Windows' Task Manager).
The code:
--- cut ---
import qualified Data.ByteString.Lazy.Char8 as BS
import Data.Int
import Data.Bits
(...)
parse_a_line_BS :: BS.ByteString -> (BS.ByteString,Int)
parse_a_line_BS line = case BS.words line of
[key,val] -> (key,(read . BS.unpack) val)
_ -> error " parse error. "
main :: IO ()
main = do
dict <- HT.new (==) hashByteString
indata <- (map parse_a_line_BS `fmap` BS.lines `fmap` BS.readFile "in.txt")
mapM_ (\ (k,v) -> HT.insert dict k v) indata
HT.lookup dict (BS.pack "key256") >>= \v -> case v of
Just vv -> putStrLn (show vv)
Nothing -> putStrLn
("Not found")
-- derived from Data.HashTable.hashString
hashByteString :: BS.ByteString -> Int32
hashByteString = BS.foldl' f golden
where f m c = fromIntegral (ord c) * magic + hashInt32 m
magic = 0xdeadbeef
hashInt32 :: Int32 -> Int32
hashInt32 x = mulHi x golden + x
mulHi :: Int32 -> Int32 -> Int32
mulHi a b = fromIntegral (r `shiftR` 32)
where r :: Int64
r = fromIntegral a * fromIntegral b
golden :: Int32
golden = 1013904242 -- = round ((sqrt 5 - 1) * 2^32) :: Int32
--- cut ---
I had to rewrite hashString to work for ByteStrings - basically it's
just using different foldl'.
All best
Christopher Skrzętnicki
More information about the Haskell-Cafe
mailing list