[Haskell-cafe] Very large data structures in the heap
Heinrich Apfelmus
apfelmus at quantentunnel.de
Thu Feb 19 09:21:04 UTC 2015
Dr. Olaf Klinke wrote:
> I have a function that builds a (cyclic) data structure for searching
> text. The resulting object can be serialized and re-built. Real-world
> problems would, however, result in a tree with about 10^9 nodes. I take
> it that standard data type declarations are not very memory-efficient,
> so my (fully evaluated) search tree would not fit into memory as a heap
> object. Its serialised form should be of tractable size, e.g. by storing
> an array of labels and the tree structure as a bit array of parentheses.
>
> I'd like to separate serialization (IO) from traversal of the stucture
> (pure), but how can one avoid the entire heap object from being built?
> Can lazyness be exploited?
As a rule of thumb, if you want tight control over memory usage, then
you need to stick to an execution model that gives you that control --
so not lazy evaluation. It certainly is possible to reason about memory
usage with lazy evaluation, but it's not straightforward.
Here's an idea that could work for you: The problem as you described it
is that you want to use an algebraic data type
data Tree a = Bin (Tree a) (Tree a) | Leaf a
but for reasons for space usage, you have to represent it as a
compressed byte array, for instance like this
type Tree = ByteString
The drawback of the latter representation is that you lose pattern
matching, but there is actually a way to sort of avoid that. In
particular, consider the following function for the algebraic data type:
caseof :: (Tree a -> Tree a -> b) -> (a -> b) -> (Tree a -> b)
caseof f g tree = case tree of
Bin x y -> f x y
Leaf a -> g a
This is essentially a case-of expression, written as a function with two
argument (the two cases). If you implement a function like this for your
byte array `Tree`, then you can formulate your traversal in terms of
this case-of expression/function, which handles all the decoding issues.
Essentially, your `Tree` data type has to be a pointer to the tip of a
subtree.
You can also augment the `caseof` function to be in the IO monad if you
want. This way, you can avoid loading the tree in memory altogether.
Again, the `caseof` function takes care of decoding the data format,
while the traversal uses `caseof` to make decisions of which parts of
the tree to inspect.
This trick is also known as "Scott encoding", or "Mogensen-Scott
encoding". It is very similar to the more well-known "Church encoding",
but they differ in the case of recursive data types.
Best regards,
Heinrich Apfelmus
--
http://apfelmus.nfshost.com
More information about the Haskell-Cafe
mailing list