Using associated data types to create unpacked data structures

Wed Aug 11 12:03:12 EDT 2010

Hi all,

Inspired by the generic maps example at

    http://www.haskell.org/haskellwiki/GHC/Indexed_types

I tried to use associated data types to create a generic finite map that
unpacks both the key and value into the leaf data constructor.

This makes functions such as lookup faster as the key can be accessed
directly instead than via an indirection. It also makes the data structure
more space efficient (4 words less per key/value pair for weight balanced
trees), which makes it possible to fit more data in main memory (and cache).
Memory overhead is important when working with "Big Data" processing, where
fitting as much data in memory as possible is important. Working with big
data sets is something done daily at companies like Google, Microsoft,
Yahoo, Twitter, Facebook, etc.

We can achieve the above goals using an associated data type like so:

    {-# LANGUAGE MultiParamTypeClasses, TypeFamilies #-}
    module Ex where

    class Unbox k v where
        data Map k v :: *
        empty       :: Map k v
        lookup      :: k -> Map k v -> Maybe v
        insert      :: k -> v -> Map k v -> Map k v

and an instance

    instance Unbox Int Double where
        data Map Int Double = TipIntDouble
                            | BinIntDouble {-# UNPACK #-} !Size
                                           {-# UNPACK #-} !Int
                                           {-# UNPACK #-} !Double
                                           !(Map Int Double)
                                           !(Map Int Double)

        -- implementation elided
        empty = undefined
        lookup k m = undefined
        insert k v m = undefined

    type Size = Int

However, if we try to apply this method to large programs we run into
problems: we need to defined instances for a large number of combinations of
keys/values. We could generate a large number of instances, hoping that
these will be enough for most users' needs, using Template Haskell or CPP.
However, the potential number of instances is very large, about a hundred if
you consider only Prelude types and tens of thousands if you include tuples.
We cannot add instances for types not defined in base without adding a
dependency on all libraries which data types we want to add instances for.

Since we cannot define all instances up-front we'll have to rely on the user
to create instances for the combinations she needs. This is tedious work for
the user; most of the time the instance is an exact copy of the above
instance for Int/Double, modulo renaming of the type arguments and the
constructor names.

Unfortunately our problems don't end here. If we assume for a second that
the user writes the necessary boilerplate (perhaps using a Template Haskell
function that generates it) there are still more problems ahead. It's quite
likely that two different libraries wants an instance for the same types,
and each declare one locally. However, now the poor user can't use both
libraries as there are conflicting instances (or can she using some
extension?) and imports always bring in instances. This problem exists for
type classes in general but we only use ten or so type classes in most
Haskell programs (e.g Functor, Monad, Eq, Ord, and Show) so it doesn't seem
to be a big problem so far.

What to do? It seems that associated data types might not be right tool for
this problem. Could it be extended to work well for this use case? Can it be
made to *scale* to large programs.

Here's an idea: allow default implementations of associated data types, just
like for methods

    class Unbox k v where
        data Map k v :: *
        empty       :: Map k v
        lookup      :: k -> Map k v -> Maybe v
        insert      :: k -> v -> Map k v -> Map k v

        data Map Int Double = Tip
                            | Bin {-# UNPACK #-} !Size
                                  {-# UNPACK #-} !k
                                  {-# UNPACK #-} !v
                                  !(Map k v)
                                  !(Map k v)

        -- implementation elided
        empty = undefined
        lookup k m = undefined
        insert k v m = undefined

and export the definition in the interface file. This would allow
instantiation of the type class without boilerplate

    instance Unbox Int Double  -- no "body"

The compiler would perhaps have to generate unique names for the
constructors for this to work.

This is not enough. We still have two problems left:

    * boilerplate instance declarations (but less so that before), and
    * instance collisions.

Could we automatic create instances whenever the user mentions a type class?
For example, if a program mentions

    f :: Map Int Double -> ...

we know we need an instance for Int/Double and if we can't find one we
derive one using the default definition. It is all the user needs to do when
using the classic containers package. This would completely remove the
boilerplate instance declarations.

We could still use OverlappingInstances to allow the user to provide more
specific instances (with a different implementation), if needed. This is
akin to C++ template specialization.

The compiler will need to help us with the instance collision problem in
that it only generates on instance even if the same type parameters are used
with the Map type in several different modules.

Summary: While associated data types in theory allows us to create more
efficient data structures, the feature doesn't seem to scale to large
programs, for this use case.

Cheers,
Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20100811/4d5212b0/attachment.html