Data.Set optimisations [was: [Haskell-cafe] library for set of heterogeneous values?]

Anthony Clayden anthony_clayden at
Sun Aug 14 11:03:52 UTC 2016

Firstly, to correct myself:

> Contrast that with Data.Set [*], you could stream the contents to a list and filter.
> (Which would be horribly inefficient for my purposes.) 

Thank you Libraries team. Data.Set seems to have been seriously revamped since I last looked, and now includes a 'proper' filter function. (It used to stream the Set to a List; filter the List; build a new Set.)

> [*] Data.Set seems to have been swallowed into the new collections library,
> which is not currently maintained 

Uhh. That was a dead link (of ~10 years ago, but Google still took me to it). Data.Set is in the **containers** library supported on hackage.

But I have some q's about the optimisations:
The hedge-union algorithm seems to have been taken out, as inefficient.
(per treeowl on github). But the code on hackage still uses hedge??
So the docos are out of date. (And quite a bit of commentary on StackOverflow.) 
Was it really not more efficient?

The Set data type's constructors got reordered (Bin is now first, Tip is second).
On grounds it "improves the benchmark by up to 10%".
Are you sure it makes that much difference?
Are you sure it makes any difference at all?
The culprit seems to be pattern matching. It makes me feel there's something seriously wrong with GHC's code generator.
(And the semantics of ADTs is that the first constructor is ordered first if you go `deriving (Ord)`. So there's usually a design reason driving the ordering of constructors -- eg Nothing ordered before Just; [] ordered before (:), which makes String ordering do the right thing.)

Set is a purely functional data structure, no in-situ updating of nodes, no pointer semantics
unlike imperative/low-level implementations of balanced binary trees.
That means Tips (empty end-points) genuinely get allocated/built and accessed and garbaged on insert -- or does all that strictness annotation/INLINing alter that?

Contrast that in an imperative/low-level model they'd be null pointers, nothing allocated.
Note the number of Tips is equal the number of values in the tree, + 1.
Would it save construction/destruction/access overhead to have three extra constructors?
| both sub-trees are empty 
| left (only) is empty
| right (only) is empty

If it's true per above that each constructor adds a performance penalty just doing the pattern matching, I guess the answer is 'no' :-(


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Libraries mailing list