GHC Data.List.sort performance question

Tue Jan 15 11:16:23 EST 2008

Marcus D. Gabriel wrote:
> By a rather indirect route, I discovered that I obtain an almost
> factor of two improvement in performance in Data.List.sort if I make
> one small change in the implementation of the function merge which
> supports mergesort and hence sortBy and sort.  Admittedly, the
> improvement was only noticeable to me when sorting for example one
> million random Int.  The current code in libraries/base/Data/List.hs
> for merge is
>
> merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
> merge cmp xs [] = xs
> merge cmp [] ys = ys
[snip]
>
> and all that I did was
>
> merge :: (a -> a -> Ordering) -> [a] -> [a] -> [a]
> merge cmp [] ys = ys
> merge cmp xs [] = xs
[snip]
>
> that is, I swapped the order of the first two lines.  By the way, the
> second version is much less likely to overflow the stack than the
> first version.
>
> Can some confirm this?  If you confirm it, can someone explain to me
> why one obtains this performance improvement?  I currently just do not
> grasp the point.

I'm not sure why there is a performance difference, but there is
one major difference in the behaviour of these two implementations:

The library version evaluates the list from right to left (i.e. it
compares the last two elements of the list first), because the third
argument of 'merge' is forced before the second one.

Swapping those two lines of 'merge' results in a version which compares
the first two elements of the list first.

This means that the library version produces a stack overflow on
lists generated in an iterate like fashion (say, take 1000000 [0..]).
The modified version produces a stack overflow on the reverse of
that list, but I believe such lists are much rarer in practice.

Did you look at the GC stats for the two sort implementations?

Bertram