Add `take`/`drop`/`splitAt` to `Data.Map`/`Data.Set`

Tue Mar 8 00:34:57 UTC 2016

I'm fine with keeping the constraint for forwards compatibility reasons
On Mon, Mar 7, 2016 at 4:33 PM David Feuer <david.feuer at gmail.com> wrote:

> Sets and maps don't inherently have orderings, but Set and Map do. I think
> you could still make an argument for retaining the constraint, but it's a
> thin one. It is possible to imagine that there could be some *other*
> Ord-based representation of sets and maps for which having the ordering
> directly available would lead to more efficient splits. Retaining the
> constraint could then be seen as forward compatibility with such a
> hypothetical reimplementation.
> On Mar 7, 2016 7:26 PM, "Dan Burton" <danburton.email at gmail.com> wrote:
>
>> I would prefer that the Ord constraint be retained in the type signature,
>> even if not used in the implementation. Sets and Maps conceptually do not
>> have an ordering; the Ord constraint indicates in which order one is
>> sequencing the values.
>>
>> -- Dan Burton
>>
>> On Mon, Mar 7, 2016 at 4:14 PM, Gabriel Gonzalez <gabriel439 at gmail.com>
>> wrote:
>>
>>> I would like to propose adding `take`/`drop`/`splitAt` to both
>>> `Data.Map` and `Data.Set` as originally requested in:
>>>
>>> https://github.com/haskell/containers/issues/135
>>>
>>> The motivation behind this proposal is three-fold:
>>>
>>> * for convenience - these functions are commonly used to implement
>>> pagination or previews of maps/sets
>>> * for type accuracy - the public API impose an unnecessary `Ord`
>>> constraint
>>> * for efficiency - these can be implemented more efficiently using the
>>> internal API
>>>
>>> Currently the only way you can implement this functionality via the
>>> public API is to use `lookupIndex`/`elemAt` + `split`.  For example, one
>>> way to implement `Data.Set.take` is:
>>>
>>>
>>> take :: Ord a => Int -> Set a -> Set a
>>> take n m
>>>     | n      <  0 = empty
>>>     | size m <= n = m
>>>     | otherwise   = lt
>>>   where
>>>     (lt, _) = split k m
>>>     k       = elemAt n m
>>> {-# INLINE take #-}
>>>
>>>
>>> This implementation incurs an unnecessary `Ord` constraint due to a
>>> roundabout way of computing `take`: this extracts the element at the given
>>> index and then works backwards from the element’s value to partition the
>>> set using O(log N) comparisons.  We could eliminate all of the comparisons
>>> by using the internal API.
>>>
>>> Intuitively, we expect that the performance of `Data.Set.take` would
>>> benefit from avoiding those unnecessary comparisons and also avoiding
>>> traversing the `Set`’s spine twice.  So I tested that hypothesis by
>>> implementing `take` via the internal API like this:
>>>
>>> take :: Int -> Set a -> Set a
>>> take n0 s0 = go s0 n0
>>>   where
>>>     go s@(Bin sz x l r) n =
>>>         if sz <= n
>>>         then s
>>>         else
>>>             let sl = size l
>>>             in  if n <= sl
>>>                 then go l n
>>>                 else link x l (go r $! n - sl)
>>>     go Tip _ = Tip
>>> {-# INLINE take #-}
>>>
>>>
>>> I then added the following benchmark to `benchmarks/Set.hs`:
>>>
>>> *diff --git a/benchmarks/Set.hs b/benchmarks/Set.hs*
>>> *index 3a6e8aa..03c99fb 100644*
>>> *--- a/benchmarks/Set.hs*
>>> *+++ b/benchmarks/Set.hs*
>>> @@ -31,6 +31,7 @@ main = do
>>>          , bench "union" $ whnf (S.union s_even) s_odd
>>>          , bench "difference" $ whnf (S.difference s) s_even
>>>          , bench "intersection" $ whnf (S.intersection s) s_even
>>> +        , bench "take" $ whnf (S.take (2^11)) s
>>>          , bench "fromList" $ whnf S.fromList elems
>>>          , bench "fromList-desc" $ whnf S.fromList (reverse elems)
>>>          , bench "fromAscList" $ whnf S.fromAscList elems
>>>
>>>
>>> Here is the performance on my machine when implementing `take` via the
>>> public API:
>>>
>>> benchmarking take
>>> time                 272.8 ns   (266.7 ns .. 278.1 ns)
>>>                      0.997 R²   (0.996 R² .. 0.998 R²)
>>> mean                 266.3 ns   (261.8 ns .. 270.8 ns)
>>> std dev              15.44 ns   (13.26 ns .. 18.95 ns)
>>> variance introduced by outliers: 75% (severely inflated)
>>>
>>>
>>> … and the performance improved by 61% from using the internal API:
>>>
>>> benchmarking take
>>> time                 169.2 ns   (166.1 ns .. 172.6 ns)
>>>                      0.997 R²   (0.996 R² .. 0.998 R²)
>>> mean                 172.1 ns   (169.4 ns .. 175.4 ns)
>>> std dev              10.68 ns   (8.420 ns .. 15.34 ns)
>>> variance introduced by outliers: 78% (severely inflated)
>>>
>>>
>>> … and I’m guessing (but haven’t tested) that the performance gap would
>>> only increase the more expensive the comparison function gets.
>>>
>>> I haven’t performed comparative performance testing for `drop`/`splitAt`
>>> nor have I tested `Map` (because the benchmarks take a while for me to
>>> build and run) but I can perform those additional comparisons upon requests
>>> if people feel they are necessary.
>>>
>>> I haven’t yet written up a full patch since the maintainer asked me to
>>> first run this proposal by the libraries mailing list to assess whether it
>>> would be wise to expand the `containers` API to include these utilities.
>>>
>>> The deadline for discussion is two weeks.
>>>
>>> _______________________________________________
>>> Libraries mailing list
>>> Libraries at haskell.org
>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>>>
>>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20160308/835ec654/attachment-0001.html>