Proposal (Trac ticket #3671): Add takeRec, genericTakeRec and spanRec to Data.List

Thu Nov 19 05:47:31 EST 2009

Dear Malcolm, Ian, et al.

On Thu, 2009-11-19 at 04:14 +0000, Malcolm Wallace wrote:
> I suppose the only convincing argument is empirical. 

Hear, hear! Although... the empirical argument changes with the versions
of the compiler. Anyone up for an evaluation of alternative
implementations of all functions in base? :p

> Using the  
> following simple and unscientific benchmark, it turns out that take 
> +drop is ~ 2x faster than splitAt.  Maybe list fusion or something is  
> kicking in.

Is it possible that the tuple wrapping and unwrapping in the
splitAt-implementation hurts optimization?

On Wed, 2009-11-18 at 14:47 +0000, Ian Lynagh wrote:
> My breaks has generally been such that
>     breaks "123,456,,78" == ["123", "456", "", "78"]
> but the details probably depend on exactly what I've been using it
for.
> 
> I don't remember ever needing yours. I'd have thought that
>     breaks :: (a -> Bool) -> [a] -> [([a], [a])]
> would make more sense, but personally I'd vote for not adding a breaks
> at all.

The more I looked at the spans/breaks, the more I figured that didn't
quite cover the majority of cases for which I hack in extra
functionality. I think I have it down to the bare bones of what I was
missing; there's no function in Data.List to segment a list based on
sequence properties. In other words, there is no way to extract runs (or
clumps, if you prefer). An alternative suggestion, thus:

runs :: (a -> a -> Bool) -> [a] -> [[a]]
runs p xs = ...

which produces a list of runs, i.e. the first result is that prefix of
xs, such that for all consecutive elements e_i, e_{i+1}, the property
holds, i.e. p e_i e_{i+1} -->> True.

Although not exactly equivalent, spans' can be implemented as:

spans' p = runs (\x y -> p x == p y)

the difference being the first span:

> spans odd [2,3,4,5,7,9,8,0,3,5,9]
[[],[2],[3],[4],[5,7,9],[8,0],[3,5,9]]
> spans' odd [2,3,4,5,7,9,8,0,3,5,9]
[[2],[3],[4],[5,7,9],[8,0],[3,5,9]]

This difference is of no consequence to the types of programs I used
spans in. This new implementation makes spans so simple that inclusion
in Data.List is no longer necessary. So my new proposal would be to
include groupsOf and runs.

Now for the empirical stuff. I have two implementations:

runs :: (a -> a -> Bool) -> [a] -> [[a]]
runs _ [ ] = []
runs _ [x] = [[x]]
runs p (x:xs) = r : runs p xs' 
	where
	(r,xs') = run x xs
	cons' x (xs,y) = (x:xs,y)
	run y [] = ([y],[])
	run y l@(x:xs) | p y x     = cons' y $ run x xs
	               | otherwise = ([y],l)

runsAlt :: (a -> a -> Bool) -> [a] -> [[a]]
runsAlt _ [] = [[]]
runsAlt _ xs@[x] = [xs]
runsAlt p (x:xs@(y:_))
	| p x y     = (x : head xs') : tail xs'
	| otherwise = [x]:xs'
	where
	xs' = runsAlt p xs

(I welcome suggestions for improvements.)

Used in the program:

bigList = concat $ replicate 10000000 [5,6,9,1,3,4,2]
main = print (length (runs (>) bigList))

compiled without and with -O2:

runs       : 5.40s user 0.03s system 99% cpu 5.438 total
runsOpt    : 4.40s user 0.03s system 99% cpu 4.440 total
runsAlt    : 4.89s user 0.04s system 99% cpu 4.934 total
runsAltOpt : 4.14s user 0.03s system 99% cpu 4.207 total

We have a winner ;)

Regards,
Philip