Discussion: Implementation of fromList for Data.Set and Data.Map

Fri Feb 10 08:59:43 UTC 2017

Dear all,

yes, as Edward says, the current implementation just alleviated the need
to use fromDistinctAscList -- it was quite a common case of using
fromList (because many functions produce sorted lists).

Some adaptive algorithm depending on number of inversions (by inversion
I mean $(x, y)$ such that $x$ preceeds $y$ and $x > y$), ideally
with complexity $O(n \log (2+number_of_inversions/n))$, would definitely
be desirable. I tried to experiment with it when I implemented this
"use linear-time algorithm if sorted", but as noted in the e4e9a97
commit, I have no idea how to keep the constant factors small.

Cheers,
Milan Straka

> -----Original message-----
> From: David Feuer <david.feuer at gmail.com>
> Sent: 9 Feb 2017, 20:26
>
> Merging the sorted runs will not hurt big-O complexity. When I replaced the
> union algorithms, Ross Paterson noted that the time bounds on the new ones
> were good enough that starting from singletons we can take the unions in
> any order and build the set/map in O(n log n) time. My main concern is
> about constant factors. Like you said, I'll have to benchmark a lot.
> 
> On Feb 9, 2017 8:16 PM, "Edward Kmett" <ekmett at gmail.com> wrote:
> 
> > The current implementation was really just an attempt to improve over the
> > old implementation which did the naive construction. It did improve things
> > for the sorted list case to the point where there is very little marginal
> > benefit to using fromDistinctAscList any more, but it very much isn't the
> > final word on the topic. =)
> >
> > Finding something that handles the middle cases where you have "mostly
> > sorted" data better would be a good idea, but definitely needs to be
> > benchmarked to death, and it would help to know if the merges of runs
> > affect the overall asymptotic performance.
> >
> > Timsort is both fast and popular and includes a similar pass (it also
> > happens to include handling strictly descending runs as mentioned), so it
> > seems that there is room to do something similar here.
> >
> > -Edward
> >
> > On Thu, Feb 9, 2017 at 3:20 PM, David Feuer <david.feuer at gmail.com> wrote:
> >
> >> Currently, the implementations of fromList for Data.Set and Data.Map
> >> try to take advantage of inputs with some prefix in strictly
> >> increasing order. It uses the fast (O (n)) fromDistinctAscList
> >> algorithm for the strictly ascending prefix, and then falls back on
> >> the slower (O (n log n)) naive algorithm for the rest. For Data.Set,
> >> this works roughly like the implementation sketched at the bottom of
> >> this message.
> >>
> >> This whole thing strikes me as a bit odd: changing just the first
> >> element of the input list can have a substantial impact on the overall
> >> performance.
> >>
> >> In my opinion, there are two reasonable ways to implement this function:
> >>
> >> 1. Don't bother to try to take advantage of preexisting ordering. This
> >> is the simplest option, and will probably be fastest for lists with
> >> little preexisting order.
> >>
> >> 2. Try to take advantage of sorted ascending and descending "runs". A
> >> simple version would be to break up the input list into ascending and
> >> descending segments (merging duplicates on the fly), use
> >> fromDistinctAscList or fromDistinctDescList to convert each segment,
> >> and combine the segments using set unions. This option will be slower
> >> for totally unsorted lists, but should degrade much more gradually
> >> than the current algorithm as disorder is added. Wren suggests that
> >> there may be other variations on the same theme that we could
> >> consider.
> >>
> >> Should we
> >>
> >> A. Switch to 1 (and offer 2 by another name)
> >> B. Switch to 2 (and offer 1 by another name?)
> >> C. Leave fromList alone and offer 2 by another name
> >>
> >> Note that option 1 can be implemented quite easily outside the library
> >> using insert and foldl'; option 2 is not something users should be
> >> expected to write themselves.
> >>
> >> If we add names, what should those be?
> >>
> >> Thanks,
> >> David Feuer
> >>
> >> Current implementation sketch:
> >>
> >> fromList :: Ord a => [a] -> Set a
> >> fromList xs = foldl' (flip insert) sortedSet unsortedList
> >>   where
> >>      sortedSet = fromDistinctAscList sortedList
> >>      (sortedList, unsortedList) = spanStrictlyAscending xs
> >>
> >> spanStrictlyAscending [] = ([], [])
> >> spanStrictlyAscending (x : xs) = x : go x xs
> >>   where
> >>      go prev (n : ns)
> >>        | prev < n = first (n :) $ go n ns
> >>      go _ ns = ([], ns)
> >> _______________________________________________
> >> Libraries mailing list
> >> Libraries at haskell.org
> >> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
> >>
> >
> >