DData

Adrian Hey ahey at iee.org
Thu May 20 17:01:42 EDT 2004


On Thursday 20 May 2004 11:32 am, Ross Paterson wrote:
> > Unfortunately, I think a large part of the time here is probably
> > accounted for by the time taken to generate the test data itself, if the
> > results on my machine are anything to go by. Using your test data
> > increases the execution time of the insertion test (for 100000 pairs)
> > from 0.39 seconds to 1.88 seconds for AVL trees.
>
> Building random data costs more, but random insertions are also more
> costly than ordered ones.  (With ordered insertions, the insertion path
> is much shorter than the average tree depth.)  Also, the difference
> varies between implementations.

I don't know if that's true for the trees used in FiniteMap, but for
AVL trees newly inserted elements always become leaves, so I would
guess that the insertion path is always about the same as the average
tree depth, regardless of order. In fact I find it hard to see how else
things could work, but without knowing the details of other implementations
I guess I can't say for sure.

One possible alternative explanation that occurs to me is that ordered
insertions give better cache behaviour. Typically the insertion path will
change little from one insertion to the next, so most of the nodes concerned
will likely be resident in cache already, as they were just created by the
previous insertion. So maybe the apparent difference between AVL trees and
FiniteMap in this case is really due almost entirely to differing heap
burn rate, rather than differences in the notional efficiency of the
rebalancing algorithm.

It would be good to try to establish a set of representative benchmarks
for the various tree types though. Trouble is with a garbage collected
language modern cached CPU's it's hard to be sure that benchmarks are
representative of the true costs in real progams. It's also hard to be
sure that the measured time for random data generation in isolation is
representative of the cost where it runs in the presence of other
cache disrupting code, for example. Every time I've tried to normalise
my results this way I get answers that don't seem to make much sense :-(

Regards
--
Adrian Hey 



More information about the Libraries mailing list