[Haskell-beginners] sorting almost sorted list

Tue Sep 27 20:16:18 CEST 2011

On Tuesday 27 September 2011, 18:34:01, Yitzchak Gale wrote:
> Hi Daniel,
> 
> Thanks for clearing up the mystery
> about the dependence on prime factorization!
> Very interesting.
> 
> > at first glance, merging c
> > chunks of length m each is O(c*m); so, since sorting each chunk is
> > O(m*log m), it'd overall be
> > 
> > O(c*m*log m) = O(n*log m) = O(n*log (n/c))
> 
> No. In the given use case m is a small constant and
> c is proportional to n, so the overall complexity is O(n).

Not a contradiction. If m is a constant (> 1), O(n*log m) = O(n)
(and sorting chunks of length m is O(m*log m) = O(1) then).

> 
> > It has larger overhead than sortBy,
> 
> That's true. But it's asymptotically better.
> 
> Given Dennis' use case, it sounds like
> m will probably always be < 10.
> Input size will typically be a few hundred,
> but could be a few thousand. sortBy
> would indeed be just fine in those cases.
> 
> > so if n is small or m is large, sortBy
> > is likely to be better, but if n is large and m relatively small,
> > sortAlmostBy can be significantly (but not dramatically,
> > excluding extreme cases where you can do much better) faster.
> 
> At most m will be on the order of 100 if we
> need to process all the parts of a large
> symphony orchestra combined together
> (a very unlikely use case admittedly),
> and in that case the input size could be as
> much as 10^6, or possibly even more for
> something like the Mahler.

Yeah, but log (10^6) is still a fairly small number, so counting reduction 
steps, it wouldn't be dramatically better. However, I forgot to take memory 
usage into account, so in terms of wall-clock (or CPU) time, it could be 
dramatically better even for not too large n and no too small m.

> 
> (Come to think of it, it's pretty amazing that
> Mahler wrote so many notes and actually got
> people to play them.)

People do stranger things. People listen to Wagner, for example.

> 
> There I believe sortAlmostBy could be quite
> an improvement.

Certainly. (Just included Data.List.sort in the benchmark, for these short 
lists, it takes less than twice as long as sortAlmostBy)

> I'd be interested to hear ideas to make it even better though.

mergeAABy :: (a -> a -> Ordering) -> [[a]] -> [a]
mergeAABy cmp ((x:xs):(y:ys):zss) = foo x xs y ys zss
  where
    foo u us v vs wss =
      case cmp u v of
        GT -> v : case vs of
                    (b:bs) -> foo u us b bs wss
                    [] -> u : us ++ mergeAABy cmp wss
        _ -> u : case us of
                   (c:cs) -> foo c cs v vs wss
                   [] -> case wss of
                           ((w:ws):kss) -> foo v vs w ws kss
                           _ -> v : vs
mergeAABy _ [xs] = xs
mergeAABy _ _ = []

seems to be a bit faster (I have made the lists longer, substituted 2000 by 
5000):

dafis at schwartz:~/Haskell/BeginnersTesting> ./benchAlmost -nis 20 -kis 20
(20,20)  -- print parameters to make sure they're evaluated
284090563  -- print sum of all random Ints to make sure they're evaluated
warming up
estimating clock resolution...
mean is 2.183860 us (320001 iterations)
found 41404 outliers among 319999 samples (12.9%)
  2443 (0.8%) low severe
  38961 (12.2%) high severe
estimating cost of a clock call...
mean is 50.65112 ns (14 iterations)
found 2 outliers among 14 samples (14.3%)
  2 (14.3%) high mild

benchmarking Yitz
collecting 100 samples, 1 iterations each, in estimated 6.537795 s
mean: 67.05025 ms, lb 66.59363 ms, ub 67.62786 ms, ci 0.950
std dev: 2.616184 ms, lb 2.179318 ms, ub 3.100722 ms, ci 0.950

benchmarking Mine
collecting 100 samples, 1 iterations each, in estimated 5.566311 s
mean: 56.91069 ms, lb 56.54846 ms, ub 57.37602 ms, ci 0.950
std dev: 2.100948 ms, lb 1.722233 ms, ub 2.585056 ms, ci 0.950

With n = k = 50:

benchmarking Yitz
collecting 100 samples, 1 iterations each, in estimated 7.366776 s
mean: 75.63603 ms, lb 75.17202 ms, ub 76.20199 ms, ci 0.950
std dev: 2.617970 ms, lb 2.227238 ms, ub 3.040563 ms, ci 0.950

benchmarking Mine
collecting 100 samples, 1 iterations each, in estimated 6.517696 s
mean: 68.00708 ms, lb 67.26089 ms, ub 69.04105 ms, ci 0.950
std dev: 4.463628 ms, lb 3.484169 ms, ub 5.950722 ms, ci 0.950

n = k = 100:

benchmarking Yitz
collecting 100 samples, 1 iterations each, in estimated 9.756303 s
mean: 87.39906 ms, lb 86.90397 ms, ub 87.99232 ms, ci 0.950
std dev: 2.763955 ms, lb 2.382008 ms, ub 3.168909 ms, ci 0.950

benchmarking Mine
collecting 100 samples, 1 iterations each, in estimated 7.505393 s
mean: 77.05801 ms, lb 76.51954 ms, ub 77.74727 ms, ci 0.950
std dev: 3.117756 ms, lb 2.594126 ms, ub 3.769922 ms, ci 0.950