[Haskell-beginners] sorting almost sorted list

Wed Sep 28 00:00:06 CEST 2011

On Tuesday 27 September 2011, 04:46:29, Dennis Raddle wrote:
> I have a problem from music. I have a list of notes sorted by begin
> time. I want to sort them by end time.
> 
> Notes that are sorted by begin time are usually close to sorted by end
> time, because notes tend to cluster within a small range of durations.
> 
> What algorithms are available to me (hopefully in a library) that are
> good with this kind of thing?

Data.List.sort/sortBy should be fine most of the time. Most algorithms that 
perform significantly better in some situations are rather specific and not 
available in libraries.  Nobody will implement them until they're needed, 
and then the implementor may think that it's not worth publishing as a 
library as it's too specialised. Thus I doubt you'll find a library 
function adapted to your special needs.

Leaving that aside, which algorithms are best depends on various things.
The most important are, I think, what kind of almost-sortedness you have 
and what your space and laziness requirements are.
Most sorting algorithms require O(n) space and can't lazily produce the 
output[1], but in your situation, that is possible (assuming the data is 
sufficiently nice, if the last note to begin is the first to end, you need 
O(n) space and can't produce incremental output).

Regarding the kind of almost-sortedness, if you have long monotonic runs 
with few out-of-place elements in between, like

[2 .. 1000] ++ [1] ++ [1002 .. 5000] ++ [1001] ++ [5001 .. 10000],

Data.List.sort[By] will be quite good, less so if you have some jittering 
superimposed on a monotonic list, like

concat [[n+1,n] | n <- [0, 2 .. 10000]].

I suppose your situation is more like the second.

Then an insertion sort usually does rather well, it can often outperform 
algorithms with lower (worst case) complexity. (If the average displacement 
needed for sorting is d, insertion sort takes O(n*(d+1)) time; if d is much 
smaller than log n, insertion sort is very good.)
Like Data.List's mergesort, insertion sort is generic, needs O(n) space, 
and can't produce incremental output. It is easy to implement,

inSortBy cmp = foldr ins []
  where
    ins x [] = [x]
    ins x (y:ys) = case cmp x y of
                     GT -> y : ins x ys
                     _ -> x : y : ys

but all is not roses; for long input lists, that builds a large thunk which 
may blow your stack. Insertion sort works best on mutable arrays where you 
don't have the problem of building large thunks. However, it also works 
well on short enough lists. Benchmarking with Yitz's list generator 
(version 2), it can be more than twice as fast as Yitz's sortAlmostBy when 
the lists are at most a few thousand elements long and the average 
displacement is small, but it is much slower if the lists are a few ten-
thousand elements long or the average displacement is large.
Better at dealing with longer lists and larger displacements is the left-
fold version of insertion sort,

linSortBy cmp = reverse . foldl' ins []
  where
    ins [] a = [a]
    ins l@(b:bs) a =
      case cmp a b of
        LT -> case ins bs a of
                k@(_:_) -> b : k
                [] -> error "impossible"
        _  -> a : l

(note: This is adapted to the case of an almost sorted list, for an almost 
reverse-sorted list, you'd use the obvious modification of ins and not need 
the reverse at the end, giving better performance. For a more or less 
random list, no version of insertion sort does well.)
The strictness is essential, using foldl instead of foldl' leads to worse 
performance than the right fold gives, making the insertion ins less strict 
isn't nearly as bad, but still hurts - a bit if the displacements are 
small, more if they're large.
This version of insertion sort can keep up with sortAlmostBy much longer, 
but it suffers from the same problem as the right-fold insertion sort, only 
less.

Moving to specifically tailored sorting algorithms, we have Yitz's nice 
sortAlmostBy. It requires that you know a bound for the number of notes 
beginning not before but ending before any given note, but if you do, it 
doesn't suffer too badly if you overestimate (as long as your estimate is 
still small relative to the length of the list). Of course, if you 
underestimate, it'll probably produce wrong results.

Then, more specifically tailored to the problem, the algorithm posted by 
David Fletcher earlier, taking advantage of the fact that you know the 
starting times of the notes and that a note can't end before it began (and 
that the list is sorted by starting time).
My (more generic) implementation of it:

sortABy :: (a -> a -> Ordering) -> (a -> a -> Ordering) -> [a] -> [a]
sortABy _ _ [] = []
sortABy cmp1 cmp2 (x:xs) = go [] x xs
  where
    ins a [] = [a]
    ins a l@(b:bs) = case cmp1 a b of
                       LT -> a : l
                       _ -> case ins a bs of
                              k@(_:_) -> b : k
                              [] -> error "oops"
    go !store y [] = y : store
    go store y zzs@(z:zs) =
        case cmp2 y z of
          GT -> case cmp1 y z of
                  GT -> go (ins y store) z zs
                  _  -> go (ins z store) y zs
          _  -> y : case store of
                      (u:us) -> go us u zzs
                      [] -> go [] z zs

The first comparison function compares the end times, the second compares 
the end time of the first argument to the starting time of the second.
As for the left-fold insertion sort, making the insertion stricter gains a 
bit of speed (or more than a bit for larger displacements).
It has the advantage over sortAlmostBy that you don't need to know a bound,
and under favourable circumstances (if there never are many notes beginning 
during any given note's lifetime), can be much faster (much meaning 
something like a factor of 1.5 or so).
However, it doesn't take larger displacements well, and in extreme cases 
degrades to a badly adapted insertion sort.
You could prevent that by using a good heap/priority queue for the store 
instead of the list. That would make it slower in the good cases, but 
guarantee better worst-case behaviour.

Both specialised algorithms can produce incremental output and run in 
constant space if the preconditions are satisfied. Which is better depends 
on the nature of your data.

Cheers,
Daniel

[1] Generic sorting algorithms can at best produce incremental output after 
the entire input has been partially processed, since the minimum could 
occur at any position. Thus they necessarily use at least O(n) space. 
[ignoring obscenities like

sort [] = []
sort xs = let (x,ct) = findMinCount xs
          in replicate ct x ++ sort' x (recalculate xs)

findMinCount (x:xs) = go 1 x xs
  where
    go k x [] = (x,k)
    go k x (y:ys)
      | y < x = go 1 y ys
      | y == x = go (k+1) x ys
      | otherwise = go k x ys

sort' x xs =
  case dropWhile (<= x) xs of
    [] -> []
    (y:ys) -> let (z,ct) = findMinCount1 x 1 y ys
              in replicate ct z ++ sort' z (recalculate xs)

findMinCount1 x k y [] = (y,k)
findMinCount1 x k y (z:zs)
  | z <= x = findMinCount1 x k y zs
  | z < y  = findMinCount1 x 1 z zs
  | z == y = findMinCount1 x (k+1) y zs
  | otherwise = findMinCount1 x k y zs

which relies on (==) identifying only truly indistinguishable values and a 
hypothetical 'recalculate' which recalculates the list lazily in O(1) 
space. But it was fun to come up with.]