Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Sun Mar 26 17:05:19 UTC 2017

Interesting. You are right, performance for sorting random lists has
priority over performance for sorting already-sorted lists.

Ignore the numbers for my previous version. Can you compare GHC's sort,
your proposal, and gSort below?

gSort :: Ord a => [a] -> [a]
gSort = gSortBy compare
gSortBy cmp = mergeAll . sequences
  where
    sequences (a:b:xs)
      | a `cmp` b == GT = descending b [a]  xs
      | otherwise       = ascending  b (a:) xs
    sequences xs = [xs]

    descending a as (b:bs)
      | a `cmp` b == GT = descending b (a:as) bs
    descending a as bs  = (a:as) : sequences bs

    ascending a as (b:bs)
      | a `cmp` b /= GT = ascending b (\ys -> as (a:ys)) bs
    ascending a as bs   = as [a] `seq` as [a] : sequences bs

    mergeAll [x] = x
    mergeAll xs  = mergeAll (mergePairs xs)

    mergePairs (a:b:xs) = merge a b : mergePairs xs
    mergePairs xs       = xs

    merge as@(a:as') bs@(b:bs')
      | a `cmp` b == GT = b : merge as  bs'
      | otherwise       = a : merge as' bs
    merge [] bs         = bs
    merge as []         = as

Thanks,
Sid

On Sun, Mar 26, 2017 at 9:19 AM, Gregory Popovitch <greg7mdp at gmail.com>
wrote:

> Thank you @Siddhanathan! I welcome any improvement you may make, as I said
> I
> am very far from a Haskell expert.
>
> I just tested your change with my test project
> (https://github.com/greg7mdp/ghc-sort)
> and here are my results (mean times in ms):
>
> input                        GHC sort          Orig proposal        your
> change
> ------------------------------------------------------------
> ----------------
> ---
> sorted ints (ascending)      153               467                  139
> sorted ints (descending)     152               472                  599
> random ints                 2824              2077                 2126
> random strings              6564              5613                 5983
>
> Your change is a definite improvement for sorted integers in ascending
> order, but is worse for other cases.
>
> Is there a real need to optimize the sort for already sorted list? Of
> course
> it should not be a degenerate
> case and take longer than sorting random numbers, but this is not the case
> here. Sorting already sorted
> lists is, even with my version, over 4 times faster than sorting random
> lists. This sounds perfectly
> acceptable to me, and I feel that trying to optimize this specific case
> further, if it comes at the
> detriment of the general case, is not desirable.
>
> Thanks,
>
> greg
>
> ________________________________
>
> From: siddhanathan at gmail.com [mailto:siddhanathan at gmail.com] On Behalf Of
> Siddhanathan Shanmugam
> Sent: Sunday, March 26, 2017 11:41 AM
> To: Gregory Popovitch
> Cc: Haskell Libraries
> Subject: Re: Proposal: a new implementation for Data.List.sort and
> Data.List.sortBy, which has better performance characteristics and is more
> laziness-friendly.
>
>
> Thank you! This identifies a space leak in base which went unnoticed for 7
> years.
>
> Your implementation can be improved further. Instead of splitting into
> pairs, you could instead split into lists of sorted sublists by replacing
> the pairs function with the following
>
>     pair = foldr f []
>       where
>         f x [] = [[x]]
>         f x (y:ys)
>           | x `cmp` head y == LT = (x:y):ys
>           | otherwise            = [x]:y:ys
>
> This should give you the same performance improvements for sorting random
> lists, but better performance while sorting ascending lists.
>
> The version in base takes it one step further by using a DList to handle
> the
> descending case efficiently as well, except there's a space leak right now
> because of which it is slower.
>
> On Sun, Mar 26, 2017 at 7:21 AM, Gregory Popovitch <greg7mdp at gmail.com>
> wrote:
>
>
>
>         Motivation:
>         ----------
>
>         Data.List.sort is a very important functionality in Haskell. I
> believe that
>         the proposed implementation is:
>
>         - significantly faster than the current implementation on unsorted
> lists,
>         typically 14% to 27% faster
>         - more laziness-friendly, i.e.:
>             take 3 $ sort l
>           will require significantly less comparisons than the current
>         implementation
>
>         Proposed Implementation
>         -----------------------
>
>         sort :: (Ord a) => [a] -> [a]
>         sort =  sortBy compare
>
>         sortBy cmp [] = []
>         sortBy cmp xs = head $ until (null.tail) reduce (pair xs)
>           where
>             pair (x:y:t) | x `cmp` y == GT  = [y, x] : pair t
>                          | otherwise        = [x, y] : pair t
>             pair [x] = [[x]]
>             pair []  = []
>
>             reduce (v:w:x:y:t) = merge v' x' : reduce t
>                                  where v' = merge v w
>                                        x' = merge x y
>
>             reduce (x:y:t) = merge x y : reduce t
>             reduce xs      = xs
>
>             merge xs []           = xs
>             merge []  ys          = ys
>             merge xs@(x:xs') ys@(y:ys')
>                  | x `cmp` y == GT  = y : merge xs  ys'
>                  | otherwise        = x : merge xs' ys
>
>
>         Effect and Interactions
>         -----------------------
>
>         I have a stack project with a criterion test for this new
> implementation,
>         available at https://github.com/greg7mdp/ghc-sort
> <https://github.com/greg7mdp/ghc-sort> .
>         I ran the tests on an Ubuntu 14.0.2 VM and GHC 8.0.2, and had the
> following
>         results:
>
>         - sorting of random lists of integers is 27% faster
>         - sorting of random lists of strings is 14% faster
>         - sorting of already sorted lists is significantly slower, but
> still
> much
>         faster than sorting random lists
>         - proposed version is more laziness friendly. For example this
> version of
>         sortBy requires 11 comparisons to find
>           the smallest element of a 15 element list, while the default
>         Data.List.sortBy requires 15 comparisons.
>           (see
>
> https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs
> <https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs>
> )
>
>
>         Test results
>         ------------
>
>         Criterion output (descending/ascending results are for already
> sorted
>         lists).
>         I barely understand what Criterion does, and I am puzzled with the
> various
>         "T" output - maybe there is a bug in my bench code:
>
>         vagrant at vagrant-ubuntu-trusty-64:/vagrant$ stack exec ghc-sort
>         benchmarking ascending ints/ghc
>         TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 160.6 ms
> (153.4
>         ms .. 167.8 ms)
>                              0.997 R²   (0.986 R² .. 1.000 R²)
>         mean                 161.7 ms   (158.3 ms .. 165.9 ms)
>         std dev              5.210 ms   (3.193 ms .. 7.006 ms)
>         variance introduced by outliers: 12% (moderately inflated)
>
>         benchmarking ascending ints/greg
>         TTTTTTTTTTTTTTTTtime                 473.8 ms   (398.6 ms .. 554.9
> ms)
>                              0.996 R²   (0.987 R² .. 1.000 R²)
>         mean                 466.2 ms   (449.0 ms .. 475.0 ms)
>         std dev              14.94 ms   (0.0 s .. 15.29 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>         benchmarking descending ints/ghc
>         TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 165.1 ms
> (148.2
>         ms .. 178.2 ms)
>                              0.991 R²   (0.957 R² .. 1.000 R²)
>         mean                 158.7 ms   (154.0 ms .. 164.3 ms)
>         std dev              7.075 ms   (4.152 ms .. 9.903 ms)
>         variance introduced by outliers: 12% (moderately inflated)
>
>         benchmarking descending ints/greg
>         TTTTTTTTTTTTTTTTtime                 471.7 ms   (419.8 ms .. 508.3
> ms)
>                              0.999 R²   (0.995 R² .. 1.000 R²)
>         mean                 476.0 ms   (467.5 ms .. 480.0 ms)
>         std dev              7.447 ms   (67.99 as .. 7.865 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>         benchmarking random ints/ghc
>         TTTTTTTTTTTTTTTTtime                 2.852 s    (2.564 s .. 3.019
> s)
>                              0.999 R²   (0.997 R² .. 1.000 R²)
>         mean                 2.812 s    (2.785 s .. 2.838 s)
>         std dev              44.06 ms   (543.9 as .. 44.97 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>         benchmarking random ints/greg
>         TTTTTTTTTTTTTTTTtime                 2.032 s    (1.993 s .. 2.076
> s)
>                              1.000 R²   (1.000 R² .. 1.000 R²)
>         mean                 2.028 s    (2.019 s .. 2.033 s)
>         std dev              7.832 ms   (0.0 s .. 8.178 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>         benchmarking shakespeare/ghc
>         TTTTTTTTTTTTTTTTtime                 6.504 s    (6.391 s .. 6.694
> s)
>                              1.000 R²   (1.000 R² .. 1.000 R²)
>         mean                 6.499 s    (6.468 s .. 6.518 s)
>         std dev              28.85 ms   (0.0 s .. 32.62 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>         benchmarking shakespeare/greg
>         TTTTTTTTTTTTTTTTtime                 5.560 s    (5.307 s .. 5.763
> s)
>                              1.000 R²   (0.999 R² .. 1.000 R²)
>         mean                 5.582 s    (5.537 s .. 5.607 s)
>         std dev              39.30 ms   (0.0 s .. 43.49 ms)
>         variance introduced by outliers: 19% (moderately inflated)
>
>
>         Costs and Drawbacks
>         -------------------
>
>         The only cost I see is the reduced performance when sorting already
> sorted
>         lists. However, since this remains quite efficient, indeed over 4
> times
>         faster than sorting unsorted lists, I think it is an acceptable
> tradeoff.
>
>         Final note
>         ----------
>
>         My Haskell is very rusty. I worked on this a couple years ago when
> I
> was
>         learning Haskell, and meant to propose it to the Haskell community,
> but
>         never got to it at the time.
>
>         _______________________________________________
>         Libraries mailing list
>         Libraries at haskell.org
>         http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
> <http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20170326/bbdce2f5/attachment-0001.html>