mergesort. Reply

Ketil Z. Malde ketil@ii.uib.no
28 Jun 2002 09:44:11 +0200


"Serge D. Mechveliani" <mechvel@botik.ru> writes:

> But        sortBy' (compare) [1 .. n]

> costs too much, even for  n = 11000.
> It costs (on worst data) many times more than  mergeSort.    

Yes, but why do you want to sort sorted data?  

I think the multiple value cost, i.e. that

        sortBy (compare) (take n $ repeat 1)

is equally expensive, is a bigger problem.  Okay, probably because it
caught me unawares, but also because the mantra is "quicksort is
pessimal on sorted data".

But if mergesort (or heapsort for that matter) can be made to behave
nicely, I think that's a good alternative.  I haven't run numbers, but
I was under the impression that mergesort was quite a bit slower than
quicksort in the expected case.  I, for one, am sorting expected, not
worst-case, data :-)

<gripe>
What's this obsession with worst-case behaviour anyway?  While I have
linear-time algorithms I could use, I'm using one that's linear
expected, quadratic worst-case -- but with better cache behaviour. And
why not?  There are O(2^n) possible inputs, who cares about the almost
none that are pessimal?

And that's cache as in the six-orders-of-magnitude access time
difference between RAM vs. disk, not the relatively low cost of L2
cache misses. 
</>

One solution I've seen suggested, is to use quicksort to a depth of
c log n (for some c), and fall back to mergesort thereafter.  Or to
pick a random pivot, rather than the first element.  

BTW, I'm fully in favor of keeping an insertion (or other) sort around
that behaves nicely for sorted/almost sorted data, as a separate
function available for the discriminating programmer.

Okay, that was kind of rambling, to sum up:

1. The default sorting should, in order of approximate preference
        1. scale well (probably means O(n log n))
        2. scale beyond available RAM
        3. be fast (i.e. have low constant overhead)
        ?. be stable (always a nice property)

2. Other sorts should be provided for special cases that the
programmer might know about, and where a different algorithm could be
a win, possibly:
        - short sequences (bubble?)
        - sequences of sequences (radix?)
        - almost sorted/reverse sorted sequences (bubble/insertion?)
        - limited range (bucket?)

-kzm
-- 
If I haven't seen further, it is by standing in the footprints of giants