[Haskell-beginners] sorting almost sorted list

Tue Sep 27 16:04:08 CEST 2011

On Tuesday 27 September 2011, 11:32:35, Yitzchak Gale wrote:
> I wrote:
> > Can you guarantee for some value of m that for each note N,
> > only the first m notes following N might end earlier than N?
> > If so, then the following simple algorithm is linear
> > and runs in constant space. You could then use:
> > sortAlmostBy m (comparing endTime)
> > ...You might be able to do a little better than this.
> 
> After running some tests, it seems empirically that
> you can use m `div` 2 instead of m for lists of even
> length, which gives you a nice speedup for larger
> values of m. For lists of prime length, the best
> you can do seems to be m-1. It's more complicated
> for lists of non-prime odd length.
> 
> I can't immediately see why any of that is true.
> 
> Daniel to the rescue perhaps?

I have to confess, I'm not even sure what you mean.
Okay, m is the maximal number of following notes that may end earlier than 
the current. But to which list(s) does the length refer?

> 
> test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands k

With what parameters did you call test?

Prelude Almost System.Random> test 12 16
True
(0.12 secs, 178509776 bytes)
Prelude Almost System.Random> test 12 17
False
(0.00 secs, 592904 bytes)

The only big variation in runtimes I find is due to a violation of the 
preconditions of sortAlmostBy, when all shortcuts.
(But the most of the runtime is used for generating the lists. If I 
completely evaluate such a list of lists prior to sorting, then m has a 
measurable impact on time even when all doesn't shortcut, however, in that 
situation it's basically monotonic in m, as expected).

> 
> isSorted xs = and . zipWith (<=) xs $ drop 1 xs
> 
> rands d = fmap (take 100 . map (concat . zipWith (map . (+)) [0,d ..]
> . chunksOf d) . chunksOf 1000 . randomRs (0, d-1)) newStdGen

You generate 100 lists of length 1000, such that the entries at index
k*d <= i < (k+1)*d are between k*d (inclusive) and (k+1)*d (exclusive).

Now, if you split those lists into chunks of length m, the preconditions of 
sortAlmostBy are only guaranteed to hold if each of the d-long chunks from 
the generation meets at most two m-long chunks.
That is the case if (and only if, if the entire list is long enough)

m + gcd m d >= d

If m + gcd m d < d, you will sooner or later encounter a d-chunk meeting 
three (or more) m-chunks. The probability that the first element of that 
d-chunk is larger than the last is nearly 1/2 [(d-1)/2d], so it doesn't 
take many such situations until you get a non-sorted list with 
sortAlmostBy m.

So, if running time is monotonic in m, under the assumption that the 
preconditions of sortAlmostBy hold, the optimal choice of m is

min { m \in N : m + gcd m d >= d }

For even d, that's d/2, for d prime it's d-1, generally, it's (1 - 1/p)*d, 
where p is the smallest prime factor of d.

I haven't analysed your merging function, but at first glance, merging c 
chunks of length m each is O(c*m); so, since sorting each chunk is 
O(m*log m), it'd overall be

O(c*m*log m) = O(n*log m) = O(n*log (n/c))

It has larger overhead than sortBy, so if n is small or m is large, sortBy 
is likely to be better, but if n is large and m relatively small, 
sortAlmostBy can be significantly (but not dramatically, excluding extreme 
cases where you can do much better) faster.