Add `take`/`drop`/`splitAt` to `Data.Map`/`Data.Set`

Tue Mar 8 00:33:46 UTC 2016

Sets and maps don't inherently have orderings, but Set and Map do. I think
you could still make an argument for retaining the constraint, but it's a
thin one. It is possible to imagine that there could be some *other*
Ord-based representation of sets and maps for which having the ordering
directly available would lead to more efficient splits. Retaining the
constraint could then be seen as forward compatibility with such a
hypothetical reimplementation.
On Mar 7, 2016 7:26 PM, "Dan Burton" <danburton.email at gmail.com> wrote:

> I would prefer that the Ord constraint be retained in the type signature,
> even if not used in the implementation. Sets and Maps conceptually do not
> have an ordering; the Ord constraint indicates in which order one is
> sequencing the values.
>
> -- Dan Burton
>
> On Mon, Mar 7, 2016 at 4:14 PM, Gabriel Gonzalez <gabriel439 at gmail.com>
> wrote:
>
>> I would like to propose adding `take`/`drop`/`splitAt` to both `Data.Map`
>> and `Data.Set` as originally requested in:
>>
>> https://github.com/haskell/containers/issues/135
>>
>> The motivation behind this proposal is three-fold:
>>
>> * for convenience - these functions are commonly used to implement
>> pagination or previews of maps/sets
>> * for type accuracy - the public API impose an unnecessary `Ord`
>> constraint
>> * for efficiency - these can be implemented more efficiently using the
>> internal API
>>
>> Currently the only way you can implement this functionality via the
>> public API is to use `lookupIndex`/`elemAt` + `split`.  For example, one
>> way to implement `Data.Set.take` is:
>>
>>
>> take :: Ord a => Int -> Set a -> Set a
>> take n m
>>     | n      <  0 = empty
>>     | size m <= n = m
>>     | otherwise   = lt
>>   where
>>     (lt, _) = split k m
>>     k       = elemAt n m
>> {-# INLINE take #-}
>>
>>
>> This implementation incurs an unnecessary `Ord` constraint due to a
>> roundabout way of computing `take`: this extracts the element at the given
>> index and then works backwards from the element’s value to partition the
>> set using O(log N) comparisons.  We could eliminate all of the comparisons
>> by using the internal API.
>>
>> Intuitively, we expect that the performance of `Data.Set.take` would
>> benefit from avoiding those unnecessary comparisons and also avoiding
>> traversing the `Set`’s spine twice.  So I tested that hypothesis by
>> implementing `take` via the internal API like this:
>>
>> take :: Int -> Set a -> Set a
>> take n0 s0 = go s0 n0
>>   where
>>     go s@(Bin sz x l r) n =
>>         if sz <= n
>>         then s
>>         else
>>             let sl = size l
>>             in  if n <= sl
>>                 then go l n
>>                 else link x l (go r $! n - sl)
>>     go Tip _ = Tip
>> {-# INLINE take #-}
>>
>>
>> I then added the following benchmark to `benchmarks/Set.hs`:
>>
>> *diff --git a/benchmarks/Set.hs b/benchmarks/Set.hs*
>> *index 3a6e8aa..03c99fb 100644*
>> *--- a/benchmarks/Set.hs*
>> *+++ b/benchmarks/Set.hs*
>> @@ -31,6 +31,7 @@ main = do
>>          , bench "union" $ whnf (S.union s_even) s_odd
>>          , bench "difference" $ whnf (S.difference s) s_even
>>          , bench "intersection" $ whnf (S.intersection s) s_even
>> +        , bench "take" $ whnf (S.take (2^11)) s
>>          , bench "fromList" $ whnf S.fromList elems
>>          , bench "fromList-desc" $ whnf S.fromList (reverse elems)
>>          , bench "fromAscList" $ whnf S.fromAscList elems
>>
>>
>> Here is the performance on my machine when implementing `take` via the
>> public API:
>>
>> benchmarking take
>> time                 272.8 ns   (266.7 ns .. 278.1 ns)
>>                      0.997 R²   (0.996 R² .. 0.998 R²)
>> mean                 266.3 ns   (261.8 ns .. 270.8 ns)
>> std dev              15.44 ns   (13.26 ns .. 18.95 ns)
>> variance introduced by outliers: 75% (severely inflated)
>>
>>
>> … and the performance improved by 61% from using the internal API:
>>
>> benchmarking take
>> time                 169.2 ns   (166.1 ns .. 172.6 ns)
>>                      0.997 R²   (0.996 R² .. 0.998 R²)
>> mean                 172.1 ns   (169.4 ns .. 175.4 ns)
>> std dev              10.68 ns   (8.420 ns .. 15.34 ns)
>> variance introduced by outliers: 78% (severely inflated)
>>
>>
>> … and I’m guessing (but haven’t tested) that the performance gap would
>> only increase the more expensive the comparison function gets.
>>
>> I haven’t performed comparative performance testing for `drop`/`splitAt`
>> nor have I tested `Map` (because the benchmarks take a while for me to
>> build and run) but I can perform those additional comparisons upon requests
>> if people feel they are necessary.
>>
>> I haven’t yet written up a full patch since the maintainer asked me to
>> first run this proposal by the libraries mailing list to assess whether it
>> would be wise to expand the `containers` API to include these utilities.
>>
>> The deadline for discussion is two weeks.
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>>
>>
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20160307/ac5e664d/attachment.html>