# mergesort. Reply

**Ketil Z. Malde
**
ketil@ii.uib.no

*01 Jul 2002 09:04:37 +0200*

Andrew J Bromage <andrew@bromage.org> writes:
>>* I, for one, am sorting expected, not worst-case, data :-)
*>>* What's this obsession with worst-case behaviour anyway?
*
>* The best algorithm to use is the one which exploits known facts about
*>* the data. The converse is: The worst algorithm to use is the one whose
*>* worst case behaviour happens on the sort of data you want to feed it.
*
Okay.
(As an aside, and for Num type classes, have anybody tried calculating
the average, and using that as a pivot? I mean, I know we really want
the median, but the average is at least available?)
>* So if, for example, you know that your data will cause expected-case
*>* behaviour on average, more power to you.=20=20
*
Isn't that what "expected-case" means? :-)
I was going to run through the statistics to work out the expected
running time for a quick sort (i.e. sorting random data); I wrestled a
bit with it, and while I'm fairly sure it's pretty close to n log n,
the proof is at the moment left as an excercise for the reader...
No matter, let's look at string searching: In theory, Boyers-Moore or
Knuth-Morris-Pratt are faster than a naive method (ie. just using
something like=20
find pat =3D or . map (pat `isPrefixOf`) . tails
This is worst-case O(n*m), since you may have to traverse the whole
pattern each time before deciding it's not a prefix of a tail, and
moving on. The competition is O(n+m) which sounds a lot better.
But for random data, the chance of a match in the first character is
equal to the alphabet size, two matches is alpsz=B2, and so on. Most of
the time, we only traverse one or two places before finding a
mismatch, and since the algorithm is so much simpler than the
alternatives, it's generally faster (I implemented KMP and measured,
and I think you need *highly* repetitive data for it to be worth the
trouble)
-kzm
--=20
If I haven't seen further, it is by standing in the footprints of giants