[Haskell-cafe] Very large data structures in the heap

Thu Feb 19 09:21:04 UTC 2015

Dr. Olaf Klinke wrote:
> I have a function that builds a (cyclic) data structure for searching 
> text. The resulting object can be serialized and re-built. Real-world 
> problems would, however, result in a tree with about 10^9 nodes. I take 
> it that standard data type declarations are not very memory-efficient, 
> so my (fully evaluated) search tree would not fit into memory as a heap 
> object. Its serialised form should be of tractable size, e.g. by storing 
> an array of labels and the tree structure as a bit array of parentheses.
> 
> I'd like to separate serialization (IO) from traversal of the stucture 
> (pure), but how can one avoid the entire heap object from being built? 
> Can lazyness be exploited?

As a rule of thumb, if you want tight control over memory usage, then 
you need to stick to an execution model that gives you that control -- 
so not lazy evaluation. It certainly is possible to reason about memory 
usage with lazy evaluation, but it's not straightforward.

Here's an idea that could work for you: The problem as you described it 
is that you want to use an algebraic data type

    data Tree a = Bin (Tree a) (Tree a) | Leaf a

but for reasons for space usage, you have to represent it as a 
compressed byte array, for instance like this

    type Tree = ByteString

The drawback of the latter representation is that you lose pattern 
matching, but there is actually a way to sort of avoid that. In 
particular, consider the following function for the algebraic data type:

    caseof :: (Tree a -> Tree a -> b) -> (a -> b) -> (Tree a -> b)
    caseof f g tree = case tree of
         Bin x y -> f x y
         Leaf a  -> g a

This is essentially a case-of expression, written as a function with two 
argument (the two cases). If you implement a function like this for your 
byte array `Tree`, then you can formulate your traversal in terms of 
this case-of expression/function, which handles all the decoding issues. 
Essentially, your `Tree` data type has to be a pointer to the tip of a 
subtree.

You can also augment the `caseof` function to be in the IO monad if you 
want. This way, you can avoid loading the tree in memory altogether. 
Again, the `caseof` function takes care of decoding the data format, 
while the traversal uses `caseof` to make decisions of which parts of 
the tree to inspect.

This trick is also known as "Scott encoding", or "Mogensen-Scott 
encoding". It is very similar to the more well-known "Church encoding", 
but they differ in the case of recursive data types.

Best regards,
Heinrich Apfelmus

--
http://apfelmus.nfshost.com