Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Gregory Popovitch greg7mdp at gmail.com
Sun Mar 26 17:47:47 UTC 2017


Thanks again @Siddhanathan! Looks like your gSort fixes the main issue with
Data.List.sort().
 
I have updated the test programs in https://github.com/greg7mdp/ghc-sort to
include your new version.

Here are the results (your new version looks like a definite improvement vs
the current GHC one):

input                        GHC sort         My Orig proposal     gSort
----------------------------------------------------------------------------
---
sorted ints (ascending)      151               456                  148
sorted ints (descending)     152               466                  155
random ints                 2732              2006                 2004
random strings              6564              5549                 5528


So replacing the current GHC version with gSort is a no brainer, as it is
better in all regards. 

However, still my version is more laziness-friendly, i.e. it requires fewer
comparisons to get the 
N smallest elements of a list (see
https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs).

I wonder if this might not be a more useful trait than being able to sort
already sorted lists super fast.

Thanks,

greg

________________________________

From: siddhanathan at gmail.com [mailto:siddhanathan at gmail.com] On Behalf Of
Siddhanathan Shanmugam
Sent: Sunday, March 26, 2017 1:05 PM
To: Gregory Popovitch
Cc: Haskell Libraries
Subject: Re: Proposal: a new implementation for Data.List.sort and
Data.List.sortBy, which has better performance characteristics and is more
laziness-friendly.


Interesting. You are right, performance for sorting random lists has
priority over performance for sorting already-sorted lists. 

Ignore the numbers for my previous version. Can you compare GHC's sort, your
proposal, and gSort below? 


gSort :: Ord a => [a] -> [a]
gSort = gSortBy compare
gSortBy cmp = mergeAll . sequences
  where
    sequences (a:b:xs)
      | a `cmp` b == GT = descending b [a]  xs
      | otherwise       = ascending  b (a:) xs
    sequences xs = [xs]


    descending a as (b:bs)
      | a `cmp` b == GT = descending b (a:as) bs
    descending a as bs  = (a:as) : sequences bs


    ascending a as (b:bs)
      | a `cmp` b /= GT = ascending b (\ys -> as (a:ys)) bs
    ascending a as bs   = as [a] `seq` as [a] : sequences bs


    mergeAll [x] = x
    mergeAll xs  = mergeAll (mergePairs xs)


    mergePairs (a:b:xs) = merge a b : mergePairs xs
    mergePairs xs       = xs


    merge as@(a:as') bs@(b:bs')
      | a `cmp` b == GT = b : merge as  bs'
      | otherwise       = a : merge as' bs
    merge [] bs         = bs
    merge as []         = as


Thanks,
Sid


On Sun, Mar 26, 2017 at 9:19 AM, Gregory Popovitch <greg7mdp at gmail.com>
wrote:


	Thank you @Siddhanathan! I welcome any improvement you may make, as
I said I
	am very far from a Haskell expert.
	
	I just tested your change with my test project
	(https://github.com/greg7mdp/ghc-sort
<https://github.com/greg7mdp/ghc-sort> )
	and here are my results (mean times in ms):
	
	input                        GHC sort          Orig proposal
your
	change
	
----------------------------------------------------------------------------
	---
	sorted ints (ascending)      153               467
139
	sorted ints (descending)     152               472
599
	random ints                 2824              2077
2126
	random strings              6564              5613
5983
	
	Your change is a definite improvement for sorted integers in
ascending
	order, but is worse for other cases.
	
	Is there a real need to optimize the sort for already sorted list?
Of course
	it should not be a degenerate
	case and take longer than sorting random numbers, but this is not
the case
	here. Sorting already sorted
	lists is, even with my version, over 4 times faster than sorting
random
	lists. This sounds perfectly
	acceptable to me, and I feel that trying to optimize this specific
case
	further, if it comes at the
	detriment of the general case, is not desirable.
	
	Thanks,
	
	greg
	
	________________________________
	
	From: siddhanathan at gmail.com [mailto:siddhanathan at gmail.com] On
Behalf Of
	Siddhanathan Shanmugam
	Sent: Sunday, March 26, 2017 11:41 AM
	To: Gregory Popovitch
	Cc: Haskell Libraries
	Subject: Re: Proposal: a new implementation for Data.List.sort and
	Data.List.sortBy, which has better performance characteristics and
is more
	laziness-friendly.
	


	Thank you! This identifies a space leak in base which went unnoticed
for 7
	years.
	
	Your implementation can be improved further. Instead of splitting
into
	pairs, you could instead split into lists of sorted sublists by
replacing
	the pairs function with the following
	
	    pair = foldr f []
	      where
	        f x [] = [[x]]
	        f x (y:ys)
	          | x `cmp` head y == LT = (x:y):ys
	          | otherwise            = [x]:y:ys
	
	This should give you the same performance improvements for sorting
random
	lists, but better performance while sorting ascending lists.
	
	The version in base takes it one step further by using a DList to
handle the
	descending case efficiently as well, except there's a space leak
right now
	because of which it is slower.
	
	On Sun, Mar 26, 2017 at 7:21 AM, Gregory Popovitch
<greg7mdp at gmail.com>
	wrote:
	
	
	
	        Motivation:
	        ----------
	
	        Data.List.sort is a very important functionality in Haskell.
I
	believe that
	        the proposed implementation is:
	
	        - significantly faster than the current implementation on
unsorted
	lists,
	        typically 14% to 27% faster
	        - more laziness-friendly, i.e.:
	            take 3 $ sort l
	          will require significantly less comparisons than the
current
	        implementation
	
	        Proposed Implementation
	        -----------------------
	
	        sort :: (Ord a) => [a] -> [a]
	        sort =  sortBy compare
	
	        sortBy cmp [] = []
	        sortBy cmp xs = head $ until (null.tail) reduce (pair xs)
	          where
	            pair (x:y:t) | x `cmp` y == GT  = [y, x] : pair t
	                         | otherwise        = [x, y] : pair t
	            pair [x] = [[x]]
	            pair []  = []
	
	            reduce (v:w:x:y:t) = merge v' x' : reduce t
	                                 where v' = merge v w
	                                       x' = merge x y
	
	            reduce (x:y:t) = merge x y : reduce t
	            reduce xs      = xs
	
	            merge xs []           = xs
	            merge []  ys          = ys
	            merge xs@(x:xs') ys@(y:ys')
	                 | x `cmp` y == GT  = y : merge xs  ys'
	                 | otherwise        = x : merge xs' ys
	
	
	        Effect and Interactions
	        -----------------------
	
	        I have a stack project with a criterion test for this new
	implementation,
	        available at https://github.com/greg7mdp/ghc-sort
<https://github.com/greg7mdp/ghc-sort> 
	
	<https://github.com/greg7mdp/ghc-sort
<https://github.com/greg7mdp/ghc-sort> > .
	        I ran the tests on an Ubuntu 14.0.2 VM and GHC 8.0.2, and
had the
	following
	        results:
	
	        - sorting of random lists of integers is 27% faster
	        - sorting of random lists of strings is 14% faster
	        - sorting of already sorted lists is significantly slower,
but still
	much
	        faster than sorting random lists
	        - proposed version is more laziness friendly. For example
this
	version of
	        sortBy requires 11 comparisons to find
	          the smallest element of a 15 element list, while the
default
	        Data.List.sortBy requires 15 comparisons.
	          (see
	
	
https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs
<https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs> 
	
<https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs
<https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs> >
)
	


	        Test results
	        ------------
	
	        Criterion output (descending/ascending results are for
already
	sorted
	        lists).
	        I barely understand what Criterion does, and I am puzzled
with the
	various
	        "T" output - maybe there is a bug in my bench code:
	
	        vagrant at vagrant-ubuntu-trusty-64:/vagrant$ stack exec
ghc-sort
	        benchmarking ascending ints/ghc
	        TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime
160.6 ms
	(153.4
	        ms .. 167.8 ms)
	                             0.997 R²   (0.986 R² .. 1.000 R²)
	        mean                 161.7 ms   (158.3 ms .. 165.9 ms)
	        std dev              5.210 ms   (3.193 ms .. 7.006 ms)
	        variance introduced by outliers: 12% (moderately inflated)
	
	        benchmarking ascending ints/greg
	        TTTTTTTTTTTTTTTTtime                 473.8 ms   (398.6 ms ..
554.9
	ms)
	                             0.996 R²   (0.987 R² .. 1.000 R²)
	        mean                 466.2 ms   (449.0 ms .. 475.0 ms)
	        std dev              14.94 ms   (0.0 s .. 15.29 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	        benchmarking descending ints/ghc
	        TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime
165.1 ms
	(148.2
	        ms .. 178.2 ms)
	                             0.991 R²   (0.957 R² .. 1.000 R²)
	        mean                 158.7 ms   (154.0 ms .. 164.3 ms)
	        std dev              7.075 ms   (4.152 ms .. 9.903 ms)
	        variance introduced by outliers: 12% (moderately inflated)
	
	        benchmarking descending ints/greg
	        TTTTTTTTTTTTTTTTtime                 471.7 ms   (419.8 ms ..
508.3
	ms)
	                             0.999 R²   (0.995 R² .. 1.000 R²)
	        mean                 476.0 ms   (467.5 ms .. 480.0 ms)
	        std dev              7.447 ms   (67.99 as .. 7.865 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	        benchmarking random ints/ghc
	        TTTTTTTTTTTTTTTTtime                 2.852 s    (2.564 s ..
3.019 s)
	                             0.999 R²   (0.997 R² .. 1.000 R²)
	        mean                 2.812 s    (2.785 s .. 2.838 s)
	        std dev              44.06 ms   (543.9 as .. 44.97 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	        benchmarking random ints/greg
	        TTTTTTTTTTTTTTTTtime                 2.032 s    (1.993 s ..
2.076 s)
	                             1.000 R²   (1.000 R² .. 1.000 R²)
	        mean                 2.028 s    (2.019 s .. 2.033 s)
	        std dev              7.832 ms   (0.0 s .. 8.178 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	        benchmarking shakespeare/ghc
	        TTTTTTTTTTTTTTTTtime                 6.504 s    (6.391 s ..
6.694 s)
	                             1.000 R²   (1.000 R² .. 1.000 R²)
	        mean                 6.499 s    (6.468 s .. 6.518 s)
	        std dev              28.85 ms   (0.0 s .. 32.62 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	        benchmarking shakespeare/greg
	        TTTTTTTTTTTTTTTTtime                 5.560 s    (5.307 s ..
5.763 s)
	                             1.000 R²   (0.999 R² .. 1.000 R²)
	        mean                 5.582 s    (5.537 s .. 5.607 s)
	        std dev              39.30 ms   (0.0 s .. 43.49 ms)
	        variance introduced by outliers: 19% (moderately inflated)
	
	
	        Costs and Drawbacks
	        -------------------
	
	        The only cost I see is the reduced performance when sorting
already
	sorted
	        lists. However, since this remains quite efficient, indeed
over 4
	times
	        faster than sorting unsorted lists, I think it is an
acceptable
	tradeoff.
	
	        Final note
	        ----------
	
	        My Haskell is very rusty. I worked on this a couple years
ago when I
	was
	        learning Haskell, and meant to propose it to the Haskell
community,
	but
	        never got to it at the time.
	
	        _______________________________________________
	        Libraries mailing list
	        Libraries at haskell.org
	        http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
<http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries> 
	
	<http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
<http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries> >
	
	
	
	
	





More information about the Libraries mailing list