[Haskell-beginners] sorting almost sorted list
Daniel Fischer
daniel.is.fischer at googlemail.com
Tue Sep 27 16:04:08 CEST 2011
On Tuesday 27 September 2011, 11:32:35, Yitzchak Gale wrote:
> I wrote:
> > Can you guarantee for some value of m that for each note N,
> > only the first m notes following N might end earlier than N?
> > If so, then the following simple algorithm is linear
> > and runs in constant space. You could then use:
> > sortAlmostBy m (comparing endTime)
> > ...You might be able to do a little better than this.
>
> After running some tests, it seems empirically that
> you can use m `div` 2 instead of m for lists of even
> length, which gives you a nice speedup for larger
> values of m. For lists of prime length, the best
> you can do seems to be m-1. It's more complicated
> for lists of non-prime odd length.
>
> I can't immediately see why any of that is true.
>
> Daniel to the rescue perhaps?
I have to confess, I'm not even sure what you mean.
Okay, m is the maximal number of following notes that may end earlier than
the current. But to which list(s) does the length refer?
>
> test n k = fmap (all isSorted . map (sortAlmostBy n compare)) $ rands k
With what parameters did you call test?
Prelude Almost System.Random> test 12 16
True
(0.12 secs, 178509776 bytes)
Prelude Almost System.Random> test 12 17
False
(0.00 secs, 592904 bytes)
The only big variation in runtimes I find is due to a violation of the
preconditions of sortAlmostBy, when all shortcuts.
(But the most of the runtime is used for generating the lists. If I
completely evaluate such a list of lists prior to sorting, then m has a
measurable impact on time even when all doesn't shortcut, however, in that
situation it's basically monotonic in m, as expected).
>
> isSorted xs = and . zipWith (<=) xs $ drop 1 xs
>
> rands d = fmap (take 100 . map (concat . zipWith (map . (+)) [0,d ..]
> . chunksOf d) . chunksOf 1000 . randomRs (0, d-1)) newStdGen
You generate 100 lists of length 1000, such that the entries at index
k*d <= i < (k+1)*d are between k*d (inclusive) and (k+1)*d (exclusive).
Now, if you split those lists into chunks of length m, the preconditions of
sortAlmostBy are only guaranteed to hold if each of the d-long chunks from
the generation meets at most two m-long chunks.
That is the case if (and only if, if the entire list is long enough)
m + gcd m d >= d
If m + gcd m d < d, you will sooner or later encounter a d-chunk meeting
three (or more) m-chunks. The probability that the first element of that
d-chunk is larger than the last is nearly 1/2 [(d-1)/2d], so it doesn't
take many such situations until you get a non-sorted list with
sortAlmostBy m.
So, if running time is monotonic in m, under the assumption that the
preconditions of sortAlmostBy hold, the optimal choice of m is
min { m \in N : m + gcd m d >= d }
For even d, that's d/2, for d prime it's d-1, generally, it's (1 - 1/p)*d,
where p is the smallest prime factor of d.
I haven't analysed your merging function, but at first glance, merging c
chunks of length m each is O(c*m); so, since sorting each chunk is
O(m*log m), it'd overall be
O(c*m*log m) = O(n*log m) = O(n*log (n/c))
It has larger overhead than sortBy, so if n is small or m is large, sortBy
is likely to be better, but if n is large and m relatively small,
sortAlmostBy can be significantly (but not dramatically, excluding extreme
cases where you can do much better) faster.
More information about the Beginners
mailing list