<div dir="ltr"><div><span style="font-family:monospace,monospace">Hello,</span><br></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">This is a 3-part proposal for improving the time performance of Data.Graph.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The related PR for <span style="color:rgb(34,34,34);font-family:monospace,monospace;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">proposal</span> I is <a href="https://github.com/haskell/containers/pull/549">https://github.com/haskell/containers/pull/549</a>, including some graph benchmarks.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Reading the PR may be a good introduction to this material, if you're not yet familiar with Data.Graph!</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">I'm by advance sorry for the length of this mail, but since these are all related proposals I thought it was best to keep them together. I may be wrong, if you feel the number of issues addressed herein is too big, I can start a thread with just Proposal I for example.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Note that I may or may not have time to do more work than done in the current PR, but I hope this discussion will be of use to anyone wanting to follow-up on this.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">That being said, I'm looking forward to hearing what the community thinks of these proposals!</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">- Proposal I : Optimize the time complexity of (key -> Maybe Vertex) lookups and graph creation.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges consist of a binary search on an array, with a time complexity of O(log V) (I will use V for "Count of vertices", E for "Count of edges").</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">When key is Integral, and keys (of nodes passed to the graph creation function) form a set of /consecutive/ values, we can have an O(1) lookup by substracting the value of the smallest key, and checking for bounds.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Hence, graph creation complexity is improved, and user algorithms using (key -> Maybe Vertex) lookups will see their complexity reduced by a factor of up-to O(log V).</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The PR introduces this lookup and uses it for functions graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Note that in this PR, graphFromEdges in unchanged, so it doesn't benefit from these improvements (see Proposal II below).</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Here is a summary of complexities as they stand in the current state of the PR:</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-------------------------------------------------------------------------------------</font></div><div><font face="monospace, monospace">|                                      |          Time complexities                 |</font></div><div><font face="monospace, monospace">|          Functions                   |---------------------------------------------</font></div><div><font face="monospace, monospace">|                                      | graph creation     | (key -> Maybe Vertex) |</font></div><div><font face="monospace, monospace">-------------------------------------------------------------------------------------</font></div><div><font face="monospace, monospace">| graphFromEdges                       | O( (V+E) * log V ) | O(log V)              |</font></div><div><font face="monospace, monospace">| graphFromEdgesWithConsecutiveKeys    | O( E + (V*log V) ) | O(1)                  |</font></div><div><font face="monospace, monospace">| graphFromEdgesWithConsecutiveAscKeys | O( V+E )           | O(1)                  |</font></div><div><font face="monospace, monospace">-------------------------------------------------------------------------------------</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">- Proposal II : Make users of graphFromEdges benefit from performance improvements "for free" when they use Integral keys that are consecutive.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">We could use a rewrite rule to redirect graphFromEdges (when key is Integral) to a function checking if the keys are successive or not, to decide if the optimized lookup can be used.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">The problem with this is that some potential optimizations would be hidden-away, consider this example:</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">-- 'userAlgorithm' doesn't have the (Integral a) constraint</font></div><div><font face="monospace, monospace">userAlgorithm :: (Ord a) => ...</font></div><div><font face="monospace, monospace">userAlgorithm =</font></div><div><font face="monospace, monospace">  (...)</font></div><div><font face="monospace, monospace">-- hence, the rewrite rule here will never fire:</font></div><div><font face="monospace, monospace">  graphFromEdges ...</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">userProgram :: (Integral a, Ord a) => a -> ...</font></div><div><font face="monospace, monospace">userProgram input =</font></div><div><font face="monospace, monospace">  userAlgorithm input</font></div><div><font face="monospace, monospace">-- Eventhough input /is/ Integral (and its keys may be consecutive),</font></div><div><font face="monospace, monospace">-- a potential optimization is lost because 'userAlgorithm' doesn't have the (Integral a) constraint.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">To address this, I propose to introduce the following a breaking change : instead of using a rewrite rule, just add an (Integral key) constraint on graphFromEdges, and do the check (wether optimized lookups can be used) in graphFromEdges itself.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Then, in the previously described scenario, the user would see a compilation error indicating that (Integral a) could not be deduced from the context.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Hence, the user would be more likely to just add the Integral constraint on myAlgorithm (and benefit from optimizations) than to use graphFromEdgesOld, which will have been added to do exactly what graphFromEdges does today, and which is intended to be used when the key is genuinely not Integral.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">And for completeness, note that we could have the idea of the rewrite rule on graphFromEdgesOld, eventhough I'm not sure it will benefit any real use case (but who knows...).</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">- Proposal III : Deprecate `graphFromEdges` taking [(node, key, [key])] in favor of `graphFromMap` taking (Map key (node,[key]))</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">While working on the PR, I found two arguments in favor of this change:</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">  -a) Today in graphFromEdges, if we pass the same key twice, it is undefined which node for that key will actually be used.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">  graphFromMap, taking a (Map key (node,[key]) would alleviate this issue.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">  -b) "sortBy is expensive, even on sorted input": I became aware of this as I looked</font></div><div><font face="monospace, monospace">  at these 2 benchmarks:</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveAscKeys</font></div><div><font face="monospace, monospace">    time                 53.55 μs   (52.44 μs .. 54.76 μs)</font></div><div><font face="monospace, monospace">                         0.991 R²   (0.986 R² .. 0.995 R²)</font></div><div><font face="monospace, monospace">    mean                 57.41 μs   (55.02 μs .. 61.75 μs)</font></div><div><font face="monospace, monospace">    std dev              10.64 μs   (5.100 μs .. 17.09 μs)</font></div><div><font face="monospace, monospace">    variance introduced by outliers: 95% (severely inflated)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveKeys</font></div><div><font face="monospace, monospace">    time                 75.34 μs   (73.79 μs .. 76.85 μs)</font></div><div><font face="monospace, monospace">                         0.992 R²   (0.987 R² .. 0.995 R²)</font></div><div><font face="monospace, monospace">    mean                 74.67 μs   (72.29 μs .. 77.47 μs)</font></div><div><font face="monospace, monospace">    std dev              8.335 μs   (6.692 μs .. 10.70 μs)</font></div><div><font face="monospace, monospace">    variance introduced by outliers: 85% (severely inflated)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">In the second benchmark, we do the same work as in the first benchmark, except for an initial</font></div><div><font face="monospace, monospace">sortBy on an already sorted list, which introduces a 30% overhead! Looking at the implementation</font></div><div><font face="monospace, monospace">of sortBy, I saw that there was no initial check to see if the list was sorted or not.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Since we use sortBy in graphFromEdges, we could check wether the input is sorted or not, before using sortBy.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">But also, if we used a Map (this is the reason I use this example here), there would be no need to sortBy because Map.toAscList gives exactly the sorted list we want.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">This constitutes another argument for this proposal.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Note that graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys would also</font></div><div><font face="monospace, monospace">be deprecated in favor of graphFromConsecutiveMap, because they have the same a) and b) issues.</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">Naming proposal N1:</font></div><div><font face="monospace, monospace">    - graphFromEdges                 (takes a List, deprecated, existing function)</font></div><div><font face="monospace, monospace">    - graphFromEdgesInMap            (takes a Map)</font></div><div><font face="monospace, monospace">    - graphFromEdgesInConsecutiveMap (takes a Map with consecutive keys)</font></div><div><font face="monospace, monospace">Naming proposal N2:</font></div><div><font face="monospace, monospace">    - graphFromEdges                 (takes a List, deprecated, existing function)</font></div><div><font face="monospace, monospace">    - graphFromMap</font></div><div><font face="monospace, monospace">    - graphFromConsecutiveMap</font></div><div><font face="monospace, monospace"> with these, to reflect the Map / List duality in the naming scheme:</font></div><div><font face="monospace, monospace">    - graphFromList               (takes a List, deprecated, redirects to graphFromEdges)</font></div><div><font face="monospace, monospace">    - graphFromConsecutiveList    (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveKeys)</font></div><div><font face="monospace, monospace">    - graphFromConsecutiveAscList (takes a List, deprecated, redirects to graphFromEdgesWithConsecutiveAscKeys)</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">--</font></div><div><font face="monospace, monospace">Olivier Sohn</font></div><div><font face="monospace, monospace"><br></font></div><div><font face="monospace, monospace">PS : when writing the criterion graph benchmarks, I discovered that Graphs can be</font></div><div><font face="monospace, monospace">evaluated to normal form by criterion only when they are acyclic. Cyclic graphs</font></div><div><font face="monospace, monospace">will make the evaluation loop infinitely. I wonder why that is, if anyone has an</font></div><div><font face="monospace, monospace">explanation on that subject I'd be happy to hear it! Or should I submit a bug report</font></div><div><font face="monospace, monospace">at criterion?</font></div><div><br></div></div>