Feedback request: priority queues in containers

Tue Mar 16 20:17:26 EDT 2010

>
> I suspect the following might be faster:
>
> data BinomForest2 a = Empty
>                    | NonEmpty a [BinomTree2 a]
>
> data BinomTree2 a = BinomTree2 a [BinomTree2 a]
>
> This eliminates the "Skip" constructor, which contributes only to the
> nested type guarantee.
>

Ehehehe.  This is something I'm pretty proud of inventing, because your
implementation is actually significantly *slower*.  The reason is
essentially that I can do a lot more constructor unpacking when I have
access to that much type information about the structure of the tree at each
level.

You didn't quite implement it correctly, because among other things, we
*need* to track tree rank, at least for the roots, if it's not encoded in
the type.  Here are your data types, with everything unpacked as much as
possible.

data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap
a) (BinomQueue a)
-- equivalent to [(Int, BinHeap a)], but unrolled
-- we only need to track ranks in the roots
data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree
a)
-- equivalent to [BinHeap a], but unrolled
data BinHeap a = Bin a (BinomSubtree a)

I tested, and this implementation actually performs better if the spine is
maintained lazily, so we'll test that version.  The full implementation
(that is, the basic methods: insert, extract, fromList, toAscList) of your
approach was attached to the ticket
here<http://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs>.
 Feel free to ghc-core it, or tinker with the implementation, but I've done
a fair bit of tinkering, and my results haven't changed.

Running a benchpress test on heapsorting 25000 Ints, calling performGC after
every run of each method.  SBinomial stands for "sparse binomial heap,"
which is your approach.

With -O2, +RTS -H128m:
              min    mean    +/-sd  median    max
Binomial:    0.000   3.440   2.204   4.000   8.001
SBinomial:  24.001  28.562   5.600  28.001  48.003  (ratio: 8.3x slower
average)
With -O2:
              min    mean    +/-sd  median    max
Binomial:    4.000   8.801   2.606   8.001  24.002
SBinomial:  32.002  41.763   8.007  42.003  64.004  (ratio: 4.7x slower
average)
Without optimization, with +RTS -H128m:
              min      mean    +/-sd    median    max
Binomial:     4.001   10.041    3.140    8.001   20.002
SBinomial:   64.004   76.805    8.790   76.005  100.006  (ratio:  7.6x
slower average)
Without optimization:
Binomial:    12.000   19.761    5.328   16.001   40.002
SBinomial:   72.004   90.126   11.906   88.006  120.008  (ratio: 4.6x slower
average)

These times are measured in milliseconds.  Conclusion: my implementation is
*massively faster*, by a factor of at least 4.5x.  (I spent the last half
hour trying to optimize SBinomial -- it really is that big a difference, and
it's not going to change.)

Here's why I think that's the case: even though we might have the Skip
constructor, how many Skip constructors are there, total?  On average, half
the forest is made up of Skips, so there are 1/2 log n Skip values there.

But the thing is that the sparse binomial tree representation can't do
anywhere near as much unpacking; in particular, the set of children of each
node is not a single-constructor type.  That's an O(n) increase in
allocation, all for an O(log n) shortening of the spine.  That's why it's a
bad plan.  Most of the work is being done in allocations of nodes in the
*trees,* rather than along the spine among the roots.  In this area, the
crazy, type-system-exploiting approach actually does much less allocation,
because it can do a lot more constructor unpacking.  Let's test this
hypothesis:

My type-magical implementation,  -O2, +RTS -sstderr (no -H128m this time):

 3,050,272,052 bytes allocated in the heap
     240,340,552 bytes copied during GC
       1,087,992 bytes maximum residency (201 sample(s))
          53,136 bytes maximum slop
               3 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:  5703 collections,     0 parallel,  0.48s,  0.53s elapsed
  Generation 1:   201 collections,     0 parallel,  0.22s,  0.24s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.52s  (  4.74s elapsed)
  GC    time    0.70s  (  0.77s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    5.23s  (  5.51s elapsed)

  %GC time      13.5%  (13.9% elapsed)

  Alloc rate    674,200,547 bytes per MUT second

  Productivity  86.5% of total user, 82.2% of total elapsed

The sparse binomial forest representation, same options:

   5,612,965,672 bytes allocated in the heap
     563,671,500 bytes copied during GC
       1,967,576 bytes maximum residency (202 sample(s))
         107,212 bytes maximum slop
               5 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0: 10602 collections,     0 parallel,  1.60s,  1.67s elapsed
  Generation 1:   202 collections,     0 parallel,  0.28s,  0.30s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.52s  (  6.51s elapsed)
  GC    time    1.87s  (  1.97s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.40s  (  8.48s elapsed)

  %GC time      22.3%  (23.2% elapsed)

  Alloc rate    860,302,018 bytes per MUT second

  Productivity  77.7% of total user, 76.9% of total elapsed

Furthermore, if we actually profile the sparse forest representation, we
confirm that a *huge* amount of allocation is being done during tree
melding.

Conclusion: the type-magical binomial heap implementation does a lot less
allocating, and is faster even ignoring GC time.  I encourage you to test it
out for yourself, but I think you'll find similar results.  =D

Louis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20100316/f92649be/attachment.html