[Haskell-cafe] Tons of retainers when inserting 611, 756 elements into a Trie

Denis Bueno dbueno at gmail.com
Fri Sep 12 19:05:50 EDT 2008


Dear haskell-cafe,

I've got an anagram-finder ("puzzler") that uses a "dictionary" datatype, which
in turn uses a trie.  In src/hs/Main.hs, I create a new dictionary from a word
list (a file containing one word per line) and perform a query on it in order to
force it to actually load something from disk.  By problem is that though my
word list is just shy of 7MB, puzzler occupies nearly 1GB of resident memory as
soon as the dictionary is loaded.  I'm using GHC 6.8.3 and Gtk2hs 0.9.13 on OS X
10.4.11.

To see for yourself, check out the code with git and run it:

    $ git clone git://github.com/dbueno/puzzler.git
    $ git checkout -b origin/bsbench
    $ cabal configure              # needs gtk2hs
    $ cabal build
    $ ./dist/build/puzzler/puzzler # look at the resident memory

Retainer profiling (attachment "puzzler-hr.pdf") shows that `insertWith' on my
trie is responsible for most of the memory.  As you will see, in
`makeDictionary' I try to strictly apply the combining function to combat these
retainers.  But the problem doesn't go away.

More strangely: puzzler has a second executable, puzzler-test, which runs unit
tests, including loading the exact same anagram query. However, puzzler-test
never uses more than 250MB or so.

The dictionary (in module Puzzler.Anagram) is an array of bytestrings paired
with a Trie mapping bytestrings to sets of indices in the array.

> data Dictionary = Dictionary
>     { dictWords :: Words
>     , sortWords :: Trie IntSet }
> type Words = Array Int ByteString

The Trie is a custom bytestring trie datatype.  Each level is mapped to by the
character it represents.  If that character is a terminal character for some
word (Right (t,x)), then that word maps to x.  Otherwise no word ends at that
character, but there may be longer words, so the trie continues with (Left t).

> newtype Trie a = Trie { unTrie :: IntMap (Either (Trie a) (Trie a, a)) }
>     deriving (Eq, Ord, Show)

The following is where _I think_ the error lies (src/hs/Puzzler/Anagram.hs):

> -- | Creates an anagram dictionary from a file of words, one per line.  The
> -- words may be compound, as long as there is one per line.
> createDictionary :: FilePath -> IO Dictionary
> createDictionary path = (makeDictionary . lines) `liftM` readFile path
>
> -- | Makes an anagram dictionary from a list of words.
> makeDictionary :: [ByteString] -> Dictionary
> makeDictionary ws = Dictionary
>     { dictWords = dw
>     , sortWords = go end Trie.empty }
>   where
>     dw = listArray (0, length ws - 1) ws ; (begin, end) = bounds dw
>     go i t | seq t $ False = undefined
>            | i < begin = t
>            | otherwise = go (i-1)
>                          -- $ Trie.insertWith Set.union (BS.sort (dw!i)) (Set.singleton i) t
>                          $ insertWith' Set.union (BS.sort (dw!i)) (Set.singleton i) t
>
>     insertWith' f k v m | seq k $ False = undefined
>     insertWith' f k v m = case Trie.lookup k m of
>           Nothing -> Trie.insert k v m
>           Just s  -> (Trie.insert k $! f v s) m

I've been scouring haskell-cafe for similar issues, and all seem to have been
solved by a function similar to my `insertWith''.  Uncommenting
`Trie.insertWith'
and commenting out the rest doesn't appear to differ noticeably in memory
consumption.  The implementation of `Trie.insertWith' is:

> insertWith :: (a -> a -> a) -> ByteString -> a -> Trie a -> Trie a
> insertWith f bs x t@(Trie m) = case uncons bs of
>       Nothing      -> t
>       Just (c, cs) -> Trie (Map.alter myAlter (ord c) m)
>         where
>           myAlter Nothing               = Just $ cons (insertWith f cs x empty, x)
>           myAlter (Just (Left  t))      = Just $ cons (insertWith f cs x t, x)
>           myAlter (Just (Right (t,x'))) =
>               Just $ Right (insertWith f cs x t, if isLast then f x x' else x')
>
>           -- If `cs' is empty then `c' is the last char of the word to insert;
>           -- that is, we're done.
>           cons (t, x) = if isLast then Right (t, x) else Left t
>           isLast = BS.null cs

Any ideas?

                              Denis
-------------- next part --------------
A non-text attachment was scrubbed...
Name: puzzler-hr.pdf
Type: application/pdf
Size: 17099 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20080912/3869b8a8/puzzler-hr.pdf


More information about the Haskell-Cafe mailing list