Proposal for Data.Graph : Improve graph creation complexity when nodes have "consecutive" keys
Olivier S.
olivier.sohn at gmail.com
Mon Apr 2 23:24:35 UTC 2018
2018-04-03 0:18 GMT+02:00 Ivan Lazar Miljenovic <ivan.miljenovic at gmail.com>:
> (Sending this back to the libraries@ list as well.)
>
> On 2 April 2018 at 23:59, Olivier S. <olivier.sohn at gmail.com> wrote:
> >
> > 2018-04-02 14:34 GMT+02:00 Ivan Lazar Miljenovic
> > <ivan.miljenovic at gmail.com>:
> >>
> >> On 2 April 2018 at 21:30, Olivier S. <olivier.sohn at gmail.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> > I'm resending this proposal, which is simplified w.r.t the first one,
> >> > and
> >> > where I removed a wrong analysis of a benchmark.
> >> >
> >> > - Proposal I : Optimize the time complexity of (key -> Maybe Vertex)
> >> > lookups
> >> > and graph creation when keys are Integral and consecutive.
> >> >
> >> > (The related PR for proposal I is [1], including benchmarks showing
> the
> >> > performance improvements.)
> >> >
> >> > Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges
> >> > consist
> >> > of a binary search on an array, with a time complexity of O(log V) (I
> >> > will
> >> > use V for "Count of vertices", E for "Count of edges").
> >>
> >> At the risk of bikeshedding, can you please use |V| and |E| to refer
> >> to the order and size of the graph respectively?
> >
> >
> > Do you mean in the haddock documentation for complexities or here? I
> don't
> > know which is mor readable, O( (V+E) * log V ) or O( (|V|+|E|) * log |V|
> ).
> > Anyway it would be a quick change in the PR, I'm not particularly
> attached
> > to the notation.
>
> Both. |V| and |E| are more standard for this, as V and E represent
> the vertices and edges themselves.
>
> >> > When key is Integral, and keys (of nodes passed to the graph creation
> >> > function) form a set of /consecutive/ values (for example : [4,5,6,7]
> or
> >> > [5,6,4,7]), we can have an O(1) lookup by substracting the value of
> the
> >> > smallest key, and checking for bounds.
> >>
> >> I'm not sure I follow this part; are you ignoring order in these lists
> >> (you're referring to sets but using list notation)?
> >>
> >
> > I'm not ignoring the order, let me try to give a more precise definition:
> >
> > keys is a list of consecutive keys iff it verifies:
> >
> > -- (1) keys contains no duplicates
> > Set.size (Set.fromList keys) == length keys
> >
> > -- (2) there is no "gap" between values, when sorted:
> > sort keys == [minimum keys .. maximum keys]
> >
> >
> > The O(1) lookup is at line 516 of Data/Graph.hs in
> > https://github.com/haskell/containers/pull/549/files (key_vertex)
> >
> >> >
> >> > Hence, graph creation complexity is improved, and user algorithms
> using
> >> > (key
> >> > -> Maybe Vertex) lookups will see their complexity reduced by a factor
> >> > of
> >> > up-to O(log V).
> >> >
> >> > The PR introduces this lookup and uses it for functions
> >> > graphFromEdgesWithConsecutiveKeys and
> >> > graphFromEdgesWithConsecutiveAscKeys.
> >> >
> >> > Here is a summary of complexities for (graph creation, lookup
> function)
> >> > as
> >> > they stand in the current state of the PR:
> >> >
> >> > - graphFromEdges (the currently existing function):
> >> > O( (V+E) * log V ), O(log V)
> >> > - graphFromEdgesWithConsecutiveKeys (new function):
> >> > O( E + (V*log V) ), O(1)
> >> > - graphFromEdgesWithConsecutiveAscKeys (new function) :
> >> > O( V+E ), O(1)
> >> >
> >> > - Proposal II : Deprecate `graphFromEdges` taking [(node, key, [key])]
> >> > in
> >> > favor of `graphFromMap` taking (Map key (node,[key]))
> >> >
> >> > If we pass the same key twice in the list we pass to 'graphFromEdges'
> it
> >> > is
> >> > undefined which node for that key will actually be used.
> >> > Introducing 'graphFromMap', taking a (Map key (node,[key]) would
> >> > alleviate
> >> > this issue, through the type used.
> >>
> >> Off the top of my head, I'm not a big fan of this. If we're going to
> >> improve this, then I'd prefer to do so in such a way that allowed for
> >> usage with IntMap
> >
> >
> > Yes, IntMap seems to be better wrt performances than Map. Quoting the
> doc of
> > IntMap:
> >
> > This data structure performs especially well on binary operations like
> union
> > and intersection. However, my benchmarks show that it is also (much)
> faster
> > on insertions and deletions when compared to a generic size-balanced map
> > implementation (see Data.Map).
> >
> >
> >>
> >> (is there an existing type-class that covers
> >> association list-style data structures?).
> >
> >
> >
> > There is the Map type-class (which I just discovered) in :
> >
> > https://hackage.haskell.org/package/collections-api-1.0.0.
> 0/docs/Data-Collections.html#g:4
>
> Except that it's in another library ;-)
>
>
So it seems using Data.IntMap would be a good compromise?
> > With instances defined here, but only for Lazy versions Data.Map and
> > Data.IntMap:
>
> Note that the data structures for the Lazy and Strict variants of
> [Int]Map are the same, it's just the strictness of the functions that
> operate on them that differ.
>
That's interesting, I wasn't aware of this.
>
> >
> > https://hackage.haskell.org/package/collections-base-
> instances-1.0.0.0/docs/Data-Collections-BaseInstances.html
> >
> > Also, the type-class class doesn't have toAscList (or toList) functions
> > (which is what we would use in the implementation).
> >
> > So if we want to rely on this we would need to implement toAscList, and
> > probably add instances for Strict maps (Data.IntMap.Strict,
> Data.Map.Strict)
> >
> >>
> >> Ideally you could also use
> >> HashMap from unordered-containers as well, but since we ultimately
> >> want `type Vertex = Int` I'm not sure if that's worth it; IntMap,
> >> however, is.
> >>
> >
> > I see another problem with HashMap : it doesn't provide a toAscList
> function
> > where the keys are sorted, so we would have to sort them, incurring a
> fixed
> > O(V log V) cost, whereas with Map and IntMap the user has the
> possibility to
> > create the map from an ascending list (fromAscList), in O(V) time and we
> can
> > get the list back also (toAscList) in O(V) time.
> >
> >> >
> >> > Also, using a Map makes the implementation a bit more "natural" :
> there
> >> > is
> >> > no need for sorting by key, as Map.toAscList gives exactly the sorted
> >> > list
> >> > we want.
> >> >
> >> > We could also deprecate graphFromEdgesWithConsecutiveKeys and
> >> > graphFromEdgesWithConsecutiveAscKeys (introduced in proposal I) in
> favor
> >> > of
> >> > graphFromConsecutiveMap.
> >> >
> >> > About the naming, I propose two different schemes:
> >> >
> >> > Either:
> >> > - graphFromEdges (takes a List, deprecated,
> existing
> >> > function)
> >> > - graphFromEdgesInMap (takes a Map)
> >> > - graphFromEdgesInConsecutiveMap (takes a Map with consecutive
> keys)
> >> > Or:
> >> > - graphFromEdges (takes a List, deprecated,
> existing
> >> > function)
> >> > - graphFromMap
> >> > - graphFromConsecutiveMap
> >> > with these, to reflect the Map / List duality in the naming scheme:
> >> > - graphFromList (takes a List, deprecated, redirects
> >> > to
> >> > graphFromEdges)
> >> > - graphFromConsecutiveList (takes a List, deprecated, redirects
> >> > to
> >> > graphFromEdgesWithConsecutiveKeys)
> >> > - graphFromConsecutiveAscList (takes a List, deprecated, redirects
> >> > to
> >> > graphFromEdgesWithConsecutiveAscKeys)
> >> >
> >> > Cheers,
> >> > Olivier Sohn
> >> >
> >> > [1] https://github.com/haskell/containers/pull/549
> >> >
> >> >
> >> > _______________________________________________
> >> > Libraries mailing list
> >> > Libraries at haskell.org
> >> > http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
> >> >
> >>
> >>
> >>
> >> --
> >> Ivan Lazar Miljenovic
> >> Ivan.Miljenovic at gmail.com
> >> http://IvanMiljenovic.wordpress.com
> >
> >
>
>
>
> --
> Ivan Lazar Miljenovic
> Ivan.Miljenovic at gmail.com
> http://IvanMiljenovic.wordpress.com
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20180403/86bbf31d/attachment.html>
More information about the Libraries
mailing list