[Haskell-cafe] Best bit LIST data structure

Mon Oct 10 10:59:17 CEST 2011

On Sun, Oct 9, 2011 at 12:11 PM, Roman Beslik <beroal at ukr.net> wrote:

> Yes, if you do not use high-level concepts and optimize everything by hand,
> it requires a lot of testing. :)
>

There are probably more constructive, jibe-free ways to frame this
suggestion...

Regarding testing:  my preference for using a preexisting solution is a
product of 18 years of programming in Scheme without a large base of shared
infrastructure -- I've seen way too much "roll your own X" leading to
trouble.

Regarding high-performance data-structures in Haskell: I wish high-level
concepts were sufficient for their optimization.  But if you look at all the
tricks played by, for example, Johan Tibell and Greg Collins in their
excellent hashmaps and hashtables libraries, that, alas, seems not to be the
case yet.  GHC is in a good position to do inlining and specialization
(making the world safe for type classes), but it can't add unpack and
strictness annotations, nor can it change data representations themselves.

For example, to answer Yves question:

I fail to understand. Why not just:
> > data BitList b = Nil | BitList Int b (BitList b)
> ??
>

That was a "data structure unrolling" to optimize the memory representation
in the common case (<64 bits).  Starting with:

 type I = Int64 -- or whatever
 data BitList = Nil | BL Int I BitList

The recursive datatype can be inlined (once):

 data BitList = Nil | BL  Int I (Nil | BL Int I BitList) *-- not real syntax
*
 data BitList = Nil | BL2 Int I Nil | BL3 Int I Int I BitList* -- distribute
*
 data BitList = Nil | BL2 Int I     | BL3 Int I Int I BitList *-- prune Nil*

This unrolled data structure has two advantages.  It can directly represent
the common case <64 bits with one object, and it can use half the tail
pointers for longer lists.  GHC could conceivably transform code
automatically to enable this unrolling (but it can't now).

However, there are some further observations that really require a human.
Because we are using that extra Int to track the bit position inside the "I"
the Nil case is redundant -- "BL2 0 0" can represent empty.  Further one of
the Ints in the BL3 case is always 64 (sizeof I) and needn't be
represented.  That gives us:

 data BitList = BL2 Int I | BL3 Int I I BitList *-- prune Nil*

Which is pretty much what I used.  Actually, I skipped the "double wide"
second case because I was really only worried about simplifying the
representation for shorter lists and that would indeed complicate the code.

> data *(Bits b) =>* BitList b

FYI, in the bit of code I sent I didn't generalize over the Bits class
because it's really an implementation detail what size chunk is used (just
like in Lazy ByteStrings).  I didn't want to pollute the interface.  That
said, the code should certainly be "CSE"d to make the "64/Int64" choice
swappable.

Best regards,
  -Ryan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20111010/2e474516/attachment.htm>