[Haskell-cafe] ordNub

Wed Oct 16 17:32:32 UTC 2013

On 16/10/13 17:46, Yitzchak Gale wrote:
> On Sun, Oct 13, 2013 at 6:25 PM, Niklas Hambüchen <mail at nh2.me> wrote:
>> ordNub behaves has the same behaviour as nub, while (Set.toList .
>> Set.fromList) doesn't.
> 
> Perhaps you mean that they produce exactly the same
> output when fully evaluated. But I doubt that their behavior
> is *exactly* the same in every aspect. Given the differences
> in strictness between sets and lists, can you prove that
> nub and nubOrd have *exactly* the same strictness properties
> in every possible case?

Yes. The elements are are evaluated in the same order as in `nub`.
We can see that by comparing the implementations.

If you have objects that need not be fully evaluated to use (==) on
them, and need not fully evaluated *in a different way* when using
`compare` on them, then of course the strictness properties are different.
That is why one function has "Eq a =>" and the other one "Ord a =>" at
the front.

>> ...ordNub is *always* faster than nub, even for singleton lists.
> 
> This is also a claim that I doubt. You say that you tested the
> case of singletons. How about this:
> 
> 2 : replicate 100000 1 ++ [3]
> 
> Well, you could optimize for that by having nubOrd squeeze
> runs of of equal elements in the input. But then how about
> something like this:
> 
> [3,2,1] ++ take 100000 (cycle [3,2,1]) ++ [4]
> 
> Or variations on that, cycling some other fairly small number of
> elements. Set containers must do extra comparisons, whereas
> the of cost running down the first few hundred elements of a linked
> list (as long as the compiler fuses the iteration down to a tight
> loop in the output code and you remain within the cache) is near zero.
> How could nubOrd be as fast as ord in these kinds of cases?

I make the following, really cool proposal:

* You add your test cases that you just mentioned to the benchmark list
in "ordnub.hs"
* run `cabal build`
* run `dist/build/ordnub/ordnub -o report.html`.

Then you will see that ordNub is, indeed, faster.

> (as long as the compiler fuses the iteration down to a tight
> loop in the output code and you remain within the cache)

Even though a list is conceptually simpler than a set, a list node and a
set node are not that different (as I explained in the last email).
Both have a content and a pointer to the next node. The only difference
is that the Set node is optimised with strictness and unboxing, and the
list one is not.

> The only advantage of nubOrd
> is for very large lists where the complexity advantage overtakes
> the excellent constants of nub to provide better speed.

Sorry, I think this is absolutely invented.

What "excellent constants" do you mean?

How large are "very large lists"? I find lists of length 1000 a very
common case. An `n log2 n` algorithm is already 100 times faster on
those than an n² one.

I do not dispute that there exist n² algorithms that are faster than
others with better complexity for small inputs.

nub is not one of them.