[Haskell-cafe] Trying to reduce memory costs of String duplicates

wren ng thornton wren at freegeek.org
Sat Sep 5 20:31:28 EDT 2009


Günther Schmidt wrote:
> Hi all,
> 
> I'm reading in a data of 216k records into a map of Key, Values Pairs, 
> the values being strings.
> 
> As it happens out of 216k String values there really are only about 6.6k 
> distinct string values, so I could save a lot of RAM if I was able to 
> "insert" only actually *new* string values into the map and use 
> references to (string) values that already are in memory instead.
> 
> Is there a container that would, if I wanted to insert an element, 
> return a pair of either the previously inserted, equal value and the 
> container unchanged, or the new, previously unknown value and the new 
> container amended by that element?

If by "strings" you allow ByteStrings, then you could use the 
bytestring-trie package[1]. This will be more worthwhile than other 
approaches if your 6.6k strings have a lot of repeated prefixes, since 
repeated prefixes will be shared among the unique strings.

Something like the following should work:

     intern    :: ByteString -> Trie ByteString
               -> (ByteString, Trie ByteString)

     intern s t = (fromJust (lookup s t'), t')
         where
         t' = alterBy (\_ _ -> maybe (Just s) Just) s s t


[1] 
http://hackage.haskell.org/cgi-bin/hackage-scripts/package/bytestring-trie

-- 
Live well,
~wren


More information about the Haskell-Cafe mailing list