mergesort. Reply

01 Jul 2002 09:04:37 +0200

Andrew J Bromage <andrew@bromage.org> writes:

>> I, for one, am sorting expected, not worst-case, data :-)
>> What's this obsession with worst-case behaviour anyway?

> The best algorithm to use is the one which exploits known facts about
> the data.  The converse is: The worst algorithm to use is the one whose
> worst case behaviour happens on the sort of data you want to feed it.

Okay.

(As an aside, and for Num type classes, have anybody tried calculating
the average, and using that as a pivot?  I mean, I know we really want
the median, but the average is at least available?)

> So if, for example, you know that your data will cause expected-case
> behaviour on average, more power to you.=20=20

Isn't that what "expected-case" means? :-)

I was going to run through the statistics to work out the expected
running time for a quick sort (i.e. sorting random data); I wrestled a
bit with it, and while I'm fairly sure it's pretty close to n log n,
the proof is at the moment left as an excercise for the reader...

No matter, let's look at string searching:  In theory, Boyers-Moore or
Knuth-Morris-Pratt are faster than a naive method (ie. just using
something like=20

        find pat =3D or . map (pat `isPrefixOf`) . tails

This is worst-case O(n*m), since you may have to traverse the whole
pattern each time before deciding it's not a prefix of a tail, and
moving on.  The competition is O(n+m) which sounds a lot better.

But for random data, the chance of a match in the first character is
equal to the alphabet size, two matches is alpsz=B2, and so on.  Most of
the time, we only traverse one or two places before finding a
mismatch, and since the algorithm is so much simpler than the
alternatives, it's generally faster (I implemented KMP and measured,
and I think you need *highly* repetitive data for it to be worth the
trouble)

-kzm
--=20
If I haven't seen further, it is by standing in the footprints of giants