Proposal: a new implementation for Data.List.sort and Data.List.sortBy, which has better performance characteristics and is more laziness-friendly.

Sun Mar 26 16:19:54 UTC 2017

Thank you @Siddhanathan! I welcome any improvement you may make, as I said I
am very far from a Haskell expert. 

I just tested your change with my test project
(https://github.com/greg7mdp/ghc-sort) 
and here are my results (mean times in ms):

input                        GHC sort          Orig proposal        your
change
----------------------------------------------------------------------------
---
sorted ints (ascending)      153               467                  139
sorted ints (descending)     152               472                  599
random ints                 2824              2077                 2126
random strings              6564              5613                 5983

Your change is a definite improvement for sorted integers in ascending
order, but is worse for other cases.

Is there a real need to optimize the sort for already sorted list? Of course
it should not be a degenerate 
case and take longer than sorting random numbers, but this is not the case
here. Sorting already sorted 
lists is, even with my version, over 4 times faster than sorting random
lists. This sounds perfectly 
acceptable to me, and I feel that trying to optimize this specific case
further, if it comes at the 
detriment of the general case, is not desirable.

Thanks,

greg

________________________________

From: siddhanathan at gmail.com [mailto:siddhanathan at gmail.com] On Behalf Of
Siddhanathan Shanmugam
Sent: Sunday, March 26, 2017 11:41 AM
To: Gregory Popovitch
Cc: Haskell Libraries
Subject: Re: Proposal: a new implementation for Data.List.sort and
Data.List.sortBy, which has better performance characteristics and is more
laziness-friendly.

Thank you! This identifies a space leak in base which went unnoticed for 7
years. 

Your implementation can be improved further. Instead of splitting into
pairs, you could instead split into lists of sorted sublists by replacing
the pairs function with the following

    pair = foldr f []
      where
        f x [] = [[x]]
        f x (y:ys)
          | x `cmp` head y == LT = (x:y):ys
          | otherwise            = [x]:y:ys

This should give you the same performance improvements for sorting random
lists, but better performance while sorting ascending lists.

The version in base takes it one step further by using a DList to handle the
descending case efficiently as well, except there's a space leak right now
because of which it is slower.

On Sun, Mar 26, 2017 at 7:21 AM, Gregory Popovitch <greg7mdp at gmail.com>
wrote:

	Motivation:
	----------

	Data.List.sort is a very important functionality in Haskell. I
believe that
	the proposed implementation is:

	- significantly faster than the current implementation on unsorted
lists,
	typically 14% to 27% faster
	- more laziness-friendly, i.e.:
	    take 3 $ sort l
	  will require significantly less comparisons than the current
	implementation

	Proposed Implementation
	-----------------------

	sort :: (Ord a) => [a] -> [a]
	sort =  sortBy compare

	sortBy cmp [] = []
	sortBy cmp xs = head $ until (null.tail) reduce (pair xs)
	  where
	    pair (x:y:t) | x `cmp` y == GT  = [y, x] : pair t
	                 | otherwise        = [x, y] : pair t
	    pair [x] = [[x]]
	    pair []  = []

	    reduce (v:w:x:y:t) = merge v' x' : reduce t
	                         where v' = merge v w
	                               x' = merge x y

	    reduce (x:y:t) = merge x y : reduce t
	    reduce xs      = xs

	    merge xs []           = xs
	    merge []  ys          = ys
	    merge xs@(x:xs') ys@(y:ys')
	         | x `cmp` y == GT  = y : merge xs  ys'
	         | otherwise        = x : merge xs' ys

	Effect and Interactions
	-----------------------

	I have a stack project with a criterion test for this new
implementation,
	available at https://github.com/greg7mdp/ghc-sort
<https://github.com/greg7mdp/ghc-sort> .
	I ran the tests on an Ubuntu 14.0.2 VM and GHC 8.0.2, and had the
following
	results:

	- sorting of random lists of integers is 27% faster
	- sorting of random lists of strings is 14% faster
	- sorting of already sorted lists is significantly slower, but still
much
	faster than sorting random lists
	- proposed version is more laziness friendly. For example this
version of
	sortBy requires 11 comparisons to find
	  the smallest element of a 15 element list, while the default
	Data.List.sortBy requires 15 comparisons.
	  (see

https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs
<https://github.com/greg7mdp/ghc-sort/blob/master/src/sort_with_trace.hs> )

	Test results
	------------

	Criterion output (descending/ascending results are for already
sorted
	lists).
	I barely understand what Criterion does, and I am puzzled with the
various
	"T" output - maybe there is a bug in my bench code:

	vagrant at vagrant-ubuntu-trusty-64:/vagrant$ stack exec ghc-sort
	benchmarking ascending ints/ghc
	TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 160.6 ms
(153.4
	ms .. 167.8 ms)
	                     0.997 R²   (0.986 R² .. 1.000 R²)
	mean                 161.7 ms   (158.3 ms .. 165.9 ms)
	std dev              5.210 ms   (3.193 ms .. 7.006 ms)
	variance introduced by outliers: 12% (moderately inflated)

	benchmarking ascending ints/greg
	TTTTTTTTTTTTTTTTtime                 473.8 ms   (398.6 ms .. 554.9
ms)
	                     0.996 R²   (0.987 R² .. 1.000 R²)
	mean                 466.2 ms   (449.0 ms .. 475.0 ms)
	std dev              14.94 ms   (0.0 s .. 15.29 ms)
	variance introduced by outliers: 19% (moderately inflated)

	benchmarking descending ints/ghc
	TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTtime                 165.1 ms
(148.2
	ms .. 178.2 ms)
	                     0.991 R²   (0.957 R² .. 1.000 R²)
	mean                 158.7 ms   (154.0 ms .. 164.3 ms)
	std dev              7.075 ms   (4.152 ms .. 9.903 ms)
	variance introduced by outliers: 12% (moderately inflated)

	benchmarking descending ints/greg
	TTTTTTTTTTTTTTTTtime                 471.7 ms   (419.8 ms .. 508.3
ms)
	                     0.999 R²   (0.995 R² .. 1.000 R²)
	mean                 476.0 ms   (467.5 ms .. 480.0 ms)
	std dev              7.447 ms   (67.99 as .. 7.865 ms)
	variance introduced by outliers: 19% (moderately inflated)

	benchmarking random ints/ghc
	TTTTTTTTTTTTTTTTtime                 2.852 s    (2.564 s .. 3.019 s)
	                     0.999 R²   (0.997 R² .. 1.000 R²)
	mean                 2.812 s    (2.785 s .. 2.838 s)
	std dev              44.06 ms   (543.9 as .. 44.97 ms)
	variance introduced by outliers: 19% (moderately inflated)

	benchmarking random ints/greg
	TTTTTTTTTTTTTTTTtime                 2.032 s    (1.993 s .. 2.076 s)
	                     1.000 R²   (1.000 R² .. 1.000 R²)
	mean                 2.028 s    (2.019 s .. 2.033 s)
	std dev              7.832 ms   (0.0 s .. 8.178 ms)
	variance introduced by outliers: 19% (moderately inflated)

	benchmarking shakespeare/ghc
	TTTTTTTTTTTTTTTTtime                 6.504 s    (6.391 s .. 6.694 s)
	                     1.000 R²   (1.000 R² .. 1.000 R²)
	mean                 6.499 s    (6.468 s .. 6.518 s)
	std dev              28.85 ms   (0.0 s .. 32.62 ms)
	variance introduced by outliers: 19% (moderately inflated)

	benchmarking shakespeare/greg
	TTTTTTTTTTTTTTTTtime                 5.560 s    (5.307 s .. 5.763 s)
	                     1.000 R²   (0.999 R² .. 1.000 R²)
	mean                 5.582 s    (5.537 s .. 5.607 s)
	std dev              39.30 ms   (0.0 s .. 43.49 ms)
	variance introduced by outliers: 19% (moderately inflated)

	Costs and Drawbacks
	-------------------

	The only cost I see is the reduced performance when sorting already
sorted
	lists. However, since this remains quite efficient, indeed over 4
times
	faster than sorting unsorted lists, I think it is an acceptable
tradeoff.

	Final note
	----------

	My Haskell is very rusty. I worked on this a couple years ago when I
was
	learning Haskell, and meant to propose it to the Haskell community,
but
	never got to it at the time.

	_______________________________________________
	Libraries mailing list
	Libraries at haskell.org
	http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
<http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>