Feedback request: priority queues in containers
Louis Wasserman
wasserman.louis at gmail.com
Tue Mar 16 20:17:26 EDT 2010
>
> I suspect the following might be faster:
>
> data BinomForest2 a = Empty
> | NonEmpty a [BinomTree2 a]
>
> data BinomTree2 a = BinomTree2 a [BinomTree2 a]
>
> This eliminates the "Skip" constructor, which contributes only to the
> nested type guarantee.
>
Ehehehe. This is something I'm pretty proud of inventing, because your
implementation is actually significantly *slower*. The reason is
essentially that I can do a lot more constructor unpacking when I have
access to that much type information about the structure of the tree at each
level.
You didn't quite implement it correctly, because among other things, we
*need* to track tree rank, at least for the roots, if it's not encoded in
the type. Here are your data types, with everything unpacked as much as
possible.
data BinomQueue a = Nil | Cons {-# UNPACK #-} !Int {-# UNPACK #-} !(BinHeap
a) (BinomQueue a)
-- equivalent to [(Int, BinHeap a)], but unrolled
-- we only need to track ranks in the roots
data BinomSubtree a = Nil' | Cons' {-# UNPACK #-} !(BinHeap a) (BinomSubtree
a)
-- equivalent to [BinHeap a], but unrolled
data BinHeap a = Bin a (BinomSubtree a)
I tested, and this implementation actually performs better if the spine is
maintained lazily, so we'll test that version. The full implementation
(that is, the basic methods: insert, extract, fromList, toAscList) of your
approach was attached to the ticket
here<http://hackage.haskell.org/trac/ghc/attachment/ticket/3909/QuickBinom.hs>.
Feel free to ghc-core it, or tinker with the implementation, but I've done
a fair bit of tinkering, and my results haven't changed.
Running a benchpress test on heapsorting 25000 Ints, calling performGC after
every run of each method. SBinomial stands for "sparse binomial heap,"
which is your approach.
With -O2, +RTS -H128m:
min mean +/-sd median max
Binomial: 0.000 3.440 2.204 4.000 8.001
SBinomial: 24.001 28.562 5.600 28.001 48.003 (ratio: 8.3x slower
average)
With -O2:
min mean +/-sd median max
Binomial: 4.000 8.801 2.606 8.001 24.002
SBinomial: 32.002 41.763 8.007 42.003 64.004 (ratio: 4.7x slower
average)
Without optimization, with +RTS -H128m:
min mean +/-sd median max
Binomial: 4.001 10.041 3.140 8.001 20.002
SBinomial: 64.004 76.805 8.790 76.005 100.006 (ratio: 7.6x
slower average)
Without optimization:
Binomial: 12.000 19.761 5.328 16.001 40.002
SBinomial: 72.004 90.126 11.906 88.006 120.008 (ratio: 4.6x slower
average)
These times are measured in milliseconds. Conclusion: my implementation is
*massively faster*, by a factor of at least 4.5x. (I spent the last half
hour trying to optimize SBinomial -- it really is that big a difference, and
it's not going to change.)
Here's why I think that's the case: even though we might have the Skip
constructor, how many Skip constructors are there, total? On average, half
the forest is made up of Skips, so there are 1/2 log n Skip values there.
But the thing is that the sparse binomial tree representation can't do
anywhere near as much unpacking; in particular, the set of children of each
node is not a single-constructor type. That's an O(n) increase in
allocation, all for an O(log n) shortening of the spine. That's why it's a
bad plan. Most of the work is being done in allocations of nodes in the
*trees,* rather than along the spine among the roots. In this area, the
crazy, type-system-exploiting approach actually does much less allocation,
because it can do a lot more constructor unpacking. Let's test this
hypothesis:
My type-magical implementation, -O2, +RTS -sstderr (no -H128m this time):
3,050,272,052 bytes allocated in the heap
240,340,552 bytes copied during GC
1,087,992 bytes maximum residency (201 sample(s))
53,136 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Generation 0: 5703 collections, 0 parallel, 0.48s, 0.53s elapsed
Generation 1: 201 collections, 0 parallel, 0.22s, 0.24s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 4.52s ( 4.74s elapsed)
GC time 0.70s ( 0.77s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 5.23s ( 5.51s elapsed)
%GC time 13.5% (13.9% elapsed)
Alloc rate 674,200,547 bytes per MUT second
Productivity 86.5% of total user, 82.2% of total elapsed
The sparse binomial forest representation, same options:
5,612,965,672 bytes allocated in the heap
563,671,500 bytes copied during GC
1,967,576 bytes maximum residency (202 sample(s))
107,212 bytes maximum slop
5 MB total memory in use (1 MB lost due to fragmentation)
Generation 0: 10602 collections, 0 parallel, 1.60s, 1.67s elapsed
Generation 1: 202 collections, 0 parallel, 0.28s, 0.30s elapsed
INIT time 0.00s ( 0.00s elapsed)
MUT time 6.52s ( 6.51s elapsed)
GC time 1.87s ( 1.97s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 8.40s ( 8.48s elapsed)
%GC time 22.3% (23.2% elapsed)
Alloc rate 860,302,018 bytes per MUT second
Productivity 77.7% of total user, 76.9% of total elapsed
Furthermore, if we actually profile the sparse forest representation, we
confirm that a *huge* amount of allocation is being done during tree
melding.
Conclusion: the type-magical binomial heap implementation does a lot less
allocating, and is faster even ignoring GC time. I encourage you to test it
out for yourself, but I think you'll find similar results. =D
Louis
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/libraries/attachments/20100316/f92649be/attachment-0001.html
More information about the Libraries
mailing list