Proposal #2629: Data.List: Replace nub;
add nubOrd, nubInt, nubWith
Bart Massey
bart at cs.pdx.edu
Tue Sep 30 01:46:23 EDT 2008
Bart Massey <bart <at> cs.pdx.edu> writes:
> http://hackage.haskell.org/trac/ghc/ticket/2629
>
> Everyone always complains about nub, but nobody ever does
> anything about it except (map head . group . sort), which
> isn't lazy and isn't always faster.
Just thought I'd post a note to thank you all for the informative and
interesting discussion of my proposal so far. Thanks much to all who are
helping to figure out what we should do!
It looks to me like there's consensus on at least one point so far. If someone
is really not agreeing with this, please post.
(1) The StopList class is too "heavyweight" to naturally
live in Data.List and should not be exported. Given
this, the obvious choices are (a) to drop nubWith from
the interface altogether, or (b) to replace it with a
version that takes a stoplist function and value as an
argument, as suggested by Krasimir Angelov earlier:
nubWith :: (a -> b -> Maybe b) -> b -> [a] -> [a]
I've implemented and benchmarked (b), and it seems to
perform fine. You can get it from
git://svcs.cs.pdx.edu/git/nub.git
in the branch nubWith-function if you want to look for
yourself.
(Another possibility which hasn't been discussed is (c1)
to include a new module with StopList and some related
functions in it, and make Data.List depend on it, or
(c2) to throw StopLists into some other existing base
module such as Data.Function. Why Data.Function?
Because the other thing I want them for is a later
proposal to add transitiveClosure and
reflexiveTransitiveClosure of functions to the
libraries, and this is the natural place to put those.)
Areas that still need discussion include:
(2) It's probably reasonable to drop nubInt from the
interface. It has roughly the same complexity as
nubOrd, and thus should just be an optimization rule,
since we want to avoid nubX for an arbitrarily long list
X of typeclasses.
(3) Since nubOrd has a dramatically different asymptotic
time complexity than nub, enough so that many programs
that need the former effectively won't work with the
latter, we should expose nubOrd separately and not rely
on optimizer magic to pick it. (At least I *hope*
there's some consensus here, since I strongly agree with
those who feel this way :-). Time and space usage are,
in practical terms, that part of the semantics that
distinguish a programming language from a mathematical
notation. I'm a programmer.)
Note, though, that one can easily imagine a nubAscii
that is linear-time rather than the n log n time of
nubOrd / nubInt, using a small bit vector to track the
Chars. Certainly nubBool has this property. Hmm.
What's our criterion for when the performance difference
between two functions constitutes a practical semantic
difference? I'm claiming it's asymptotic complexity
class, in which case (a) we should probably figure out how
to expose some kind of nubFinite. Or we could just take
the position that in the vast universe of possible
functions, our library cannot provide them all. In
which case (1b) nubWith is back in. I guess I tend
toward (1b).
(4) Should (a) the StopList class and friends also be banned
from the *implementation*? Or is it (b) OK to use them
internally in the source code of Data.List? I'm
actually more inclined toward (a), ironically. It makes
for simpler and more portable code, and since we're not
going to give the user the StopLists, we can avoid being
so clever.
(5) Isaac Dupree proposes nubEq as a synonym for nub, for
use in cleaning up code. Is this a good idea? In
particular, would we deprecate nub at this point?
I'll roll a new proposal draft soon based on current and near future feedback.
Again, thanks to all.
Bart Massey
bart <at> cs.pdx.edu
[Is there really an address miner anywhere in the world
that's still really fooled by the above?]
More information about the Libraries
mailing list