nubBy, groupBy specification

Thu Dec 31 19:35:05 EST 2009

Hello,

I've recently been irritated with (what I consider to be) broken
versions of nubBy in GHC 6.10.4 and 6.12.1. For a while, groupBy was
broken in the same way, but it seems to be fixed at least in 6.10.4.

The problem is roughly caused by nubBy being specified only for
symmetric relations, where this is actually a rather silly constraint,
given that it captures a very common pattern of sieving (that at least
I'd personally not want to have to write recursively every time).

My most common usage of nubBy is actually probably nubBy (>=), (which
has to be changed to nubBy (<=) for recent GHCs), and variations on
it. This is extremely useful as a kind of lazier approximation of
maximum or minimum. I occasionally use nubBy with equivalence
relations (after these, I'd say (==) `on` foo is probably my next most
common use case), but it's surprising how often that's actually not
quite the right thing. Of course, there's also the inefficient but
pretty example of constructing the list of primes using the
divisibility relation and nubBy, which getting this wrong also breaks.

What do people think of the following spec?

nubBy f xs produces a subsequence ys of xs which satisfies:
1) f x y is False whenever x occurs before y in ys.
2) ys is maximal with respect to containment subject to condition 1.
3) The sequence of indices of selected elements is lexicographically
minimal subject to conditions 1 and 2. (That is, the selections are
made greedily.)

Similarly, I think groupBy f xs should be specified as:

groupBy f xs produces a list of lists yss of contiguous elements of xs
such that:
1) concat yss = xs
2) Each element of yss is a nonempty list (y:ys) such that  all (f y) ys.
3) The sequence of lengths of elements of ys is lexicographically
maximal subject to conditions 1 and 2. (Again, this just means it
greedily prefers to put elements in earlier lists rather than later
ones.)

Note that the code given in the Haskell 98 spec does the right thing
with respect to these specifications, but Haskell 98 says that they're
not specified for non-equivalence relations, while I think that they
actually should be.

A general policy that I think we should try to adopt is that the
expression graph produced by a higher order function on lists, in
terms of the supplied functions, should as far as reasonably possible
maintain the left-to-right order of occurrences of elements as they
occurred in the list. This makes the precise behaviour easy to
remember, and tends to keep expression graphs easy to draw. foldr,
foldl, scanr, scanl, mapAccumL, nubBy and groupBy (as specified
above), all follow this rule (of course, many simpler HOFs do as
well). mapAccumR as specified strangely does not, which makes it
another thing I'd like to change. (See:
http://cale.yi.org/index.php/Fold_Diagrams)

cheers!
 - Cale