Add `take`/`drop`/`splitAt` to `Data.Map`/`Data.Set`

Tue Mar 8 18:54:45 UTC 2016

Good idea, +1.

On 03/08/2016 02:14 AM, Gabriel Gonzalez wrote:
> I would like to propose adding `take`/`drop`/`splitAt` to both
> `Data.Map` and `Data.Set` as originally requested in:
> 
> https://github.com/haskell/containers/issues/135
> 
> The motivation behind this proposal is three-fold:
> 
> * for convenience - these functions are commonly used to implement
> pagination or previews of maps/sets
> * for type accuracy - the public API impose an unnecessary `Ord` constraint
> * for efficiency - these can be implemented more efficiently using the
> internal API
> 
> Currently the only way you can implement this functionality via the
> public API is to use `lookupIndex`/`elemAt` + `split`.  For example, one
> way to implement `Data.Set.take` is:
> 
> 
>     take :: Ord a => Int -> Set a -> Set a
>     take n m
>         | n      <  0 = empty
>         | size m <= n = m
>         | otherwise   = lt
>       where
>         (lt, _) = split k m
>         k       = elemAt n m
>     {-# INLINE take #-}
> 
> 
> This implementation incurs an unnecessary `Ord` constraint due to a
> roundabout way of computing `take`: this extracts the element at the
> given index and then works backwards from the element’s value to
> partition the set using O(log N) comparisons.  We could eliminate all of
> the comparisons by using the internal API.
> 
> Intuitively, we expect that the performance of `Data.Set.take` would
> benefit from avoiding those unnecessary comparisons and also avoiding
> traversing the `Set`’s spine twice.  So I tested that hypothesis by
> implementing `take` via the internal API like this:
> 
>     take :: Int -> Set a -> Set a
>     take n0 s0 = go s0 n0
>       where
>         go s@(Bin sz x l r) n =
>             if sz <= n
>             then s
>             else
>                 let sl = size l
>                 in  if n <= sl
>                     then go l n
>                     else link x l (go r $! n - sl)
>         go Tip _ = Tip
>     {-# INLINE take #-}
> 
> 
> I then added the following benchmark to `benchmarks/Set.hs`:
> 
>     *diff --git a/benchmarks/Set.hs b/benchmarks/Set.hs*
>     *index 3a6e8aa..03c99fb 100644*
>     *--- a/benchmarks/Set.hs*
>     *+++ b/benchmarks/Set.hs*
>     @@ -31,6 +31,7 @@main = do
>              , bench "union" $ whnf (S.union s_even) s_odd
>              , bench "difference" $ whnf (S.difference s) s_even
>              , bench "intersection" $ whnf (S.intersection s) s_even
>     +        , bench "take" $ whnf (S.take (2^11)) s
>              , bench "fromList" $ whnf S.fromList elems
>              , bench "fromList-desc" $ whnf S.fromList (reverse elems)
>              , bench "fromAscList" $ whnf S.fromAscList elems
> 
> 
> Here is the performance on my machine when implementing `take` via the
> public API:
> 
>     benchmarking take
>     time                 272.8 ns   (266.7 ns .. 278.1 ns)
>                          0.997 R²   (0.996 R² .. 0.998 R²)
>     mean                 266.3 ns   (261.8 ns .. 270.8 ns)
>     std dev              15.44 ns   (13.26 ns .. 18.95 ns)
>     variance introduced by outliers: 75% (severely inflated)
> 
> 
> … and the performance improved by 61% from using the internal API:
> 
>     benchmarking take
>     time                 169.2 ns   (166.1 ns .. 172.6 ns)
>                          0.997 R²   (0.996 R² .. 0.998 R²)
>     mean                 172.1 ns   (169.4 ns .. 175.4 ns)
>     std dev              10.68 ns   (8.420 ns .. 15.34 ns)
>     variance introduced by outliers: 78% (severely inflated)
> 
> 
> … and I’m guessing (but haven’t tested) that the performance gap would
> only increase the more expensive the comparison function gets.
> 
> I haven’t performed comparative performance testing for `drop`/`splitAt`
> nor have I tested `Map` (because the benchmarks take a while for me to
> build and run) but I can perform those additional comparisons upon
> requests if people feel they are necessary.
> 
> I haven’t yet written up a full patch since the maintainer asked me to
> first run this proposal by the libraries mailing list to assess whether
> it would be wise to expand the `containers` API to include these utilities.
> 
> The deadline for discussion is two weeks.
> 
> 
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>