[Haskell-cafe] ANNOUNCE: set-monad

Sat Jun 16 09:57:20 CEST 2012

Hi George,

George Giorgidze wrote:
> I would like to announce the first release of the set-monad library.
>
> On Hackage: http://hackage.haskell.org/package/set-monad

Very cool. Seems to work fine. But I am wondering about the impact of 
using your package on asymptotic complexity (and thereby, on performance).

 From your implementation:
> data Set a where
>   Prim   :: (Ord a) => S.Set a -> Set a
>   [...]
>   Zero   :: Set a
>   Plus   :: Set a -> Set a -> Set a

I notice that this part of your datatype looks like a binary tree of 
sets. Lets see how your run function treats it:

> run :: (Ord a) => Set a -> S.Set a
> run (Prim s)              = s
> [...]
> run (Zero)                = S.empty
> run (Plus ma mb)          = S.union (run ma) (run mb)

As expected, the Prim-Zero-Plus structure is treated as a binary tree of 
sets, and run collects all the sets together. That is probably fine, 
because it should have the same complexity as combining these sets in 
the first place.

> run (Bind (Prim s) f)     = S.foldl' S.union S.empty (S.map (run . f) s)
> [...]
> run (Bind Zero _)         = S.empty
> run (Bind (Plus ma mb) f) = run (Plus (Bind ma f) (Bind mb f))
> [...]

But when I use the binary tree of sets on the left-hand side of a bind, 
your run function has to naively traverse the whole tree. So if the same 
elements are included in many sets in the tree of sets, these elements 
will be processed more than once, so the overall complexity is worse 
than necessary.

Here's a ghci session that seems to confirm my suspicion. I first define 
a function using set-monad's convenient monad instance for sets:
> $ :m +Control.Monad Data.Set.Monad
> $ let s1 `times` s2 = do {e1 <- s1; e2 <- s2; return (e1, e2)}

Now I produce some test data:
> $ let numbers = fromList [1 .. 1000]
> $ let unioned = numbers `union` numbers
> $ let mplused = numbers `mplus` numbers

Note that these three sets are all equivalent.
> $ (numbers == unioned, numbers == mplused, unioned == mplused)
> (True, True, True)

However, they behave differently when passed to the times function 
above. numbers and unioned are similarly "fast":
> $ :set +s
> $ size $ numbers `times` numbers
> 1000000
> (2.56 secs, 1315452984 bytes)
>
> $ size $ unioned `times` unioned
> (2.39 secs, 1314950600 bytes)
> 1000000

(Why is unioned faster then numbers? Is union doing some rebalancing? 
Can I trigger that effect directly?)

But mplused is much slower:
> $ size $ mplused `times` mplused
> 1000000
> (10.83 secs, 5324731444 bytes)

I suspect that I can achieve similar results by using the list monad and 
converting to a set in the very end. In what situations can I benefit 
from set-monad?

   Tillmann