Proposal: add laziness to Data.Map / IntMap

Mon Aug 4 20:08:54 EDT 2008

Hi,

I found myself wanting a lazy version of Data.Map today, and by "lazy" I
mean in the node-subtree pointers. I trust the library writers when they put
strictness annotations in the fields of the tree nodes, so I'm wondering
what the various implications of lazy subtrees are, beyond the obvious speed
hit. Does this cause space leaks beyond what lazy value pointers can already
cause? Can someone point me to some reading that discusses this?

Anyway, I'm positing to libraries (rather than haskell-cafe) to gauge
opinion about a possible rewrite of Data.Map and IntMap to remove strictness
annotations (bangs) from the node constructors and move them into the
functions (as seqs).  "Rewrite" is maybe too strong of a word. "Significant
patch" is more like it. It would involve only those functions that construct
Bin values directly, which is not that many. Less than a days work, I think
(yes that means I volunteer.)  Semantics of everything remains unchanged,
but it opens up the possibility for lazy versions of some functions.

The most usefull result of this would be a lazy map (little m). Here's
Data.Map.mapWithKey

  mapWithKey f Tip = Tip
  mapWithKey f (Bin sx kx x l r)
    = Bin sx kx (f kx x) (mapWithKey f l) (mapWithKey f r)

De-banged and then restrictified, it would look like this

  mapWithKey f Tip = Tip
  mapWithKey f (Bin sx kx x l r)
    = seq l' $ seq r' $ Bin sx kx (f kx x) l' r'
        where l = (mapWithKey f l)
                 r = (mapWithKey f r)

Looking at the first version, clearly you see that when constructing a new
map you should only have to pay for the sub trees that you actually use. If
you perform only a handful of lookups then throw the new tree away, why
build the whole thing?

To further motivate, let me explain my use case. I have cyclical data
structures (graphs, more or less) that I mutate frequently, so I store them
in a Map, indexed by some Ord thing, lets say Int, so I'd have something
like Map Int [Int] (but not that exactly, and nothing like Data.Graph). This
is great for mutations because I can use backtracking, but for lookups it's
a burden on both me and the cpu. So I memoize the thing into something like
"data Node a = Node a [Node a]"  I can do this memoization using
Data.Map.mapWithKey, with the Nodes built therein referring back to the
produced Map. But then, what if I only crawl a small portion of this
cyclical network of Nodes? Why should I have to pay for the whole thing to
be rebuilt? It defeats the purpose of the memoization, which is to amortize
the cost of following edges in the mutable graph.

The pro and con as I see it are:

Pro
- More flexible data structure

Con
- Code is more verbose (see Data.Tree.AVL)
- Only a few (but important) functions can be made lazy

To that last point, note that while mapWithKey can be made lazy for both Map
and IntMap, only IntMap allows lazy filter and mapMaybe because it doesn't
rebalance. But I'm wondering how much of the tree needs to be forced when
rebalancing. Should be only O(log n), right? It also becomes important where
the tree is sourced from. The source needs to produce the tree lazily. The
regular definition of fromList (= foldr (uncurry insert) empty) admits no
laziness, but maybe successive unions could if the sub-maps were nearly
disjoint (a not-uncommon case I think.) Does anyone know if any benchmarking
has been done to this end?

Finally, I'll stress once more that the semantics of the functions currently
exported would be unchanged. This would only allow new lazy versions, named
something like mapWithKeyL or unionL.

So what do you think? Too much for too little?

Scott
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/libraries/attachments/20080804/aec2a723/attachment-0001.htm