[Haskell-cafe] Sharing on equality
wren ng thornton
wren at freegeek.org
Mon Dec 19 23:46:58 CET 2011
On 12/13/11 10:52 AM, Johan Brinch wrote:
> Hey all,
>
> Can GHC eliminate one of two equal ByteStrings, when they are compared
> and turns out to be equal?
>
>
> Say i have a map, ByteString -> Int.
>
> I now do a lookup on a ByteString and if it exists, I insert this
> ByteString into a list.
>
> Is it possible to avoid using more memory, than used by the keys in
> the map + the list structure?
>
> I.e. is it possible to eliminate the redundant ByteStrings somehow?
Somehow? yes. Probably the easiest route is just to use:
-- Or better yet, use one of the HashMap structures
type MyMap a = Map ByteString (ByteString,a)
myInsert :: ByteString -> a -> MyMap a -> MyMap a
myInsert k v = insert k (k,v)
myLookup :: ByteString -> MyMap a -> Maybe a
myLookup k = fmap snd . lookup k
...
If you really care a lot about memory overhead and sharing, there are
two other approaches which are more work but have nice payoffs.
The first option is to use a trie structure which automatically prunes
the key segments and allows you to reconstruct the keys. The
bytestring-trie package does this. This approach has its own set of
costs and benefits compared to plain mapping structures, so whether you
want to go down this road will depend on what functionality you need. If
you don't actually need the trie functionality (e.g., the ability to get
the subtrie of all keys with a given prefix, or the ability to look up
the values for all prefixes of a given key), then you're probably better
off using a hashtable-based solution, like the hashmap or
unordered-containers packages.
The second option is to use interning in order to ensure uniqueness of
your expensive structures. That is, you implement the interface:
type InternId
-- e.g., newtype InternId = IId Int
instance Eq InternId
-- this should be faster than (Eq a)
-- you should also have instances for use in unboxed arrays, etc
type InternTable a
-- e.g., newtype InternTable a = IT (IntMap a, Map a InternId)
intern :: a -> InternTable a -> (InternTable a, InternId)
unintern :: InternId -> InternTable a -> a
...
And then you pass around the InternIds instead of the actual strings.
This ensures low memory overhead for passing around and duplicating
things, ensures fast equality comparisons, and ensures uniqueness
whenever you need to deal with the actual structure instead of an
identifier for it. Of course, you can generalize this idea to any data
structure, not just strings; just google for "hash consing".
I have a decently tuned implementation of ByteString interning laying
around, which I'm hoping to put on Hackage before classes start up again
in January. Of course, for extremely fine tuning we'd want to adjust the
size of the InternId representation based on a heuristic upper-bound on
the number of items we'll have, so that we can pack the unboxed arrays
more tightly or do tricks like packing four 8-bit ids into a Word32. Of
course, presenting a nice API for that would require using associated
types which is GHC-only (the fundep version wouldn't be quite so
pretty). I'll probably do that in the future once I locate some round tuits.
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list