# [Haskell-beginners] sorting almost sorted list

Tue Sep 27 16:04:08 CEST 2011

On Tuesday 27 September 2011, 11:32:35, Yitzchak Gale wrote:
> I wrote:
> > Can you guarantee for some value of m that for each note N,
> > only the first m notes following N might end earlier than N?
> > If so, then the following simple algorithm is linear
> > and runs in constant space. You could then use:
> > sortAlmostBy m (comparing endTime)
> > ...You might be able to do a little better than this.
>
> After running some tests, it seems empirically that
> you can use m div 2 instead of m for lists of even
> length, which gives you a nice speedup for larger
> values of m. For lists of prime length, the best
> you can do seems to be m-1. It's more complicated
> for lists of non-prime odd length.
>
> I can't immediately see why any of that is true.
>
> Daniel to the rescue perhaps?

I have to confess, I'm not even sure what you mean.
Okay, m is the maximal number of following notes that may end earlier than
the current. But to which list(s) does the length refer?

>
> test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $rands k With what parameters did you call test? Prelude Almost System.Random> test 12 16 True (0.12 secs, 178509776 bytes) Prelude Almost System.Random> test 12 17 False (0.00 secs, 592904 bytes) The only big variation in runtimes I find is due to a violation of the preconditions of sortAlmostBy, when all shortcuts. (But the most of the runtime is used for generating the lists. If I completely evaluate such a list of lists prior to sorting, then m has a measurable impact on time even when all doesn't shortcut, however, in that situation it's basically monotonic in m, as expected). > > isSorted xs = and . zipWith (<=) xs$ drop 1 xs
>
> rands d = fmap (take 100 . map (concat . zipWith (map . (+)) [0,d ..]
> . chunksOf d) . chunksOf 1000 . randomRs (0, d-1)) newStdGen

You generate 100 lists of length 1000, such that the entries at index
k*d <= i < (k+1)*d are between k*d (inclusive) and (k+1)*d (exclusive).

Now, if you split those lists into chunks of length m, the preconditions of
sortAlmostBy are only guaranteed to hold if each of the d-long chunks from
the generation meets at most two m-long chunks.
That is the case if (and only if, if the entire list is long enough)

m + gcd m d >= d

If m + gcd m d < d, you will sooner or later encounter a d-chunk meeting
three (or more) m-chunks. The probability that the first element of that
d-chunk is larger than the last is nearly 1/2 [(d-1)/2d], so it doesn't
take many such situations until you get a non-sorted list with
sortAlmostBy m.

So, if running time is monotonic in m, under the assumption that the
preconditions of sortAlmostBy hold, the optimal choice of m is

min { m \in N : m + gcd m d >= d }

For even d, that's d/2, for d prime it's d-1, generally, it's (1 - 1/p)*d,
where p is the smallest prime factor of d.

I haven't analysed your merging function, but at first glance, merging c
chunks of length m each is O(c*m); so, since sorting each chunk is
O(m*log m), it'd overall be

O(c*m*log m) = O(n*log m) = O(n*log (n/c))

It has larger overhead than sortBy, so if n is small or m is large, sortBy
is likely to be better, but if n is large and m relatively small,
sortAlmostBy can be significantly (but not dramatically, excluding extreme
cases where you can do much better) faster.