[Haskell-cafe] ANNOUNCE: set-monad

Fri Jun 22 00:04:48 CEST 2012

Hi Tillmann,

Thanks for your interesting question regarding the performance
overheads of the Data.Set.Monad wrapper compared to the original
Data.Set library.

If you use set-specific functions there will not be any difference in
asymptotic complexity between Data.Set.Monad and Data.Set.

In terms of raw performance, there will be a tiny overhead when
unpacking sets from the Prim constructor. But this should be
negligible, as set-specific operations performed using Data.Set under
the hood will likely dominate the runtime.

Now, how does the runtime complexity of overloaded functions from the
Functor, Monad and other supported type classes compared to the
set-specific ones?

You are right that there is a problem when mplus is composed with >>=.
I have not done a proper investigation and benchmarking, but I think
(and hope) this should be the only problematic case when it comes to
runtime complexity. Luckily, there is an easy workaround (see below).

Let us take a detailed look at this case:

run (Bind (Plus ma mb) f) = run (Plus (Bind ma f) (Bind mb f))

Should ma mb be equal, this will run the function f twice
(unnecessarily). So does this mean that there is only the factor of 2
slowdown compared to the case where the sets are combined before the
bind. Unfortunately, the answer is no, ma and mb could be constructed
with Plus as well. In the worst case the slowdown is k, where k is the
number equal sets in the tree of Pluses (in the worst case all sets in
the tree are equal) in the left hand side of the bind.

The key to the approach used in set-monad is to make progress with the
evaluation of the unconstrained constructors (i.e., Return, Bind, Zero
and Plus) without using constrained set-specific operations. It turns
out that for several cases one can progress with the evaluation
without duplicating f (evaluation relies on monoid laws, Plus is
associative and Zero is left and right identity of Plus). I have
pushed those optimisations to the repo. These optimisations also cover
your example.

But there is one case for which I have not found a way to progress
with the evaluation without duplicating f:

run (Bind (Plus (Bind _ _) (Bind _ _)) f)

For this case, I am still duplicating f. I will have a look whether it
is possible to avoid duplication.

Luckily, the aforementioned limitation can be avoided by using the
mappend function from the Monoid type class instead of mplus from the
MonadPlus type class. Also the desugaring of the do and monad
comprehension notations do not produce calls to mplus.

In your email, you have also asked:

> I suspect that I can achieve similar results by using the list monad and
> converting to a set in the very end. In what situations can I benefit from
> set-monad?

Sometimes set is a more appropriate data structure than list. That is
why the modules like Data.Set and Data.HashSet exist in the first
place. Different operations on sets and lists may have different
runtime complexities. For example, if your problem requires frequent
checking of whether an element exists in a collection then set may be
more appropriate than list. This is just one example.

More generally, if the programmer does not care about order or does
not want duplicates in a collection, again set seems to be more
appropriate than list. If the programmer uses lists in this scenario,
she needs to explicitly eliminate duplicates. In addition to being
inconvenient (if the programmer has to do it frequently), duplicate
elimination for lists is an expensive operation.

The set-monad library itself (by exporting the Monad instance for
sets) could also have an educational value, by allowing the
comprehension notation to be used for sets (and not just for lists).
This recreates the original set comprehension notation from
mathematics in Haskell. Math definitions that use comprehensions can
be directly transcribed to Haskell. It could also be useful to
highlight conceptual differences between lists and sets in Haskell
without different notations getting in the way.

Cheers, George

On 16 June 2012 09:57, Tillmann Rendel <rendel at informatik.uni-marburg.de> wrote:
> Hi George,
>
>
> George Giorgidze wrote:
>>
>> I would like to announce the first release of the set-monad library.
>>
>> On Hackage: http://hackage.haskell.org/package/set-monad
>
>
> Very cool. Seems to work fine. But I am wondering about the impact of using
> your package on asymptotic complexity (and thereby, on performance).
>
>
> From your implementation:
>>
>> data Set a where
>>  Prim   :: (Ord a) => S.Set a -> Set a
>>  [...]
>>  Zero   :: Set a
>>  Plus   :: Set a -> Set a -> Set a
>
>
> I notice that this part of your datatype looks like a binary tree of sets.
> Lets see how your run function treats it:
>
>> run :: (Ord a) => Set a -> S.Set a
>> run (Prim s)              = s
>> [...]
>> run (Zero)                = S.empty
>> run (Plus ma mb)          = S.union (run ma) (run mb)
>
>
> As expected, the Prim-Zero-Plus structure is treated as a binary tree of
> sets, and run collects all the sets together. That is probably fine, because
> it should have the same complexity as combining these sets in the first
> place.
>
>> run (Bind (Prim s) f)     = S.foldl' S.union S.empty (S.map (run . f) s)
>> [...]
>> run (Bind Zero _)         = S.empty
>> run (Bind (Plus ma mb) f) = run (Plus (Bind ma f) (Bind mb f))
>> [...]
>
>
> But when I use the binary tree of sets on the left-hand side of a bind, your
> run function has to naively traverse the whole tree. So if the same elements
> are included in many sets in the tree of sets, these elements will be
> processed more than once, so the overall complexity is worse than necessary.
>
>
>
> Here's a ghci session that seems to confirm my suspicion. I first define a
> function using set-monad's convenient monad instance for sets:
>>
>> $ :m +Control.Monad Data.Set.Monad
>> $ let s1 `times` s2 = do {e1 <- s1; e2 <- s2; return (e1, e2)}
>
>
> Now I produce some test data:
>>
>> $ let numbers = fromList [1 .. 1000]
>> $ let unioned = numbers `union` numbers
>> $ let mplused = numbers `mplus` numbers
>
>
> Note that these three sets are all equivalent.
>>
>> $ (numbers == unioned, numbers == mplused, unioned == mplused)
>> (True, True, True)
>
>
> However, they behave differently when passed to the times function above.
> numbers and unioned are similarly "fast":
>>
>> $ :set +s
>> $ size $ numbers `times` numbers
>> 1000000
>> (2.56 secs, 1315452984 bytes)
>>
>> $ size $ unioned `times` unioned
>> (2.39 secs, 1314950600 bytes)
>> 1000000
>
>
> (Why is unioned faster then numbers? Is union doing some rebalancing? Can I
> trigger that effect directly?)
>
> But mplused is much slower:
>>
>> $ size $ mplused `times` mplused
>> 1000000
>> (10.83 secs, 5324731444 bytes)
>
>
>
> I suspect that I can achieve similar results by using the list monad and
> converting to a set in the very end. In what situations can I benefit from
> set-monad?
>
>  Tillmann