Proposal : Data.Graph performance improvements (graph creation and lookups when nodes keys are consecutive)

Fri Mar 30 14:48:28 UTC 2018

Hello,

This is a 3-part proposal for improving the time performance of Data.Graph.

The related PR for proposal I is
https://github.com/haskell/containers/pull/549, including some graph
benchmarks.

Reading the PR may be a good introduction to this material, if you're not
yet familiar with Data.Graph!

I'm by advance sorry for the length of this mail, but since these are all
related proposals I thought it was best to keep them together. I may be
wrong, if you feel the number of issues addressed herein is too big, I can
start a thread with just Proposal I for example.

Note that I may or may not have time to do more work than done in the
current PR, but I hope this discussion will be of use to anyone wanting to
follow-up on this.

That being said, I'm looking forward to hearing what the community thinks
of these proposals!

- Proposal I : Optimize the time complexity of (key -> Maybe Vertex)
lookups and graph creation.

Currently, (key -> Maybe Vertex) lookups returned by graphFromEdges consist
of a binary search on an array, with a time complexity of O(log V) (I will
use V for "Count of vertices", E for "Count of edges").

When key is Integral, and keys (of nodes passed to the graph creation
function) form a set of /consecutive/ values, we can have an O(1) lookup by
substracting the value of the smallest key, and checking for bounds.

Hence, graph creation complexity is improved, and user algorithms using
(key -> Maybe Vertex) lookups will see their complexity reduced by a factor
of up-to O(log V).

The PR introduces this lookup and uses it for functions
graphFromEdgesWithConsecutiveKeys and graphFromEdgesWithConsecutiveAscKeys.

Note that in this PR, graphFromEdges in unchanged, so it doesn't benefit
from these improvements (see Proposal II below).

Here is a summary of complexities as they stand in the current state of the
PR:

-------------------------------------------------------------------------------------
|                                      |          Time complexities
         |
|          Functions
 |---------------------------------------------
|                                      | graph creation     | (key -> Maybe
Vertex) |
-------------------------------------------------------------------------------------
| graphFromEdges                       | O( (V+E) * log V ) | O(log V)
        |
| graphFromEdgesWithConsecutiveKeys    | O( E + (V*log V) ) | O(1)
        |
| graphFromEdgesWithConsecutiveAscKeys | O( V+E )           | O(1)
        |
-------------------------------------------------------------------------------------

- Proposal II : Make users of graphFromEdges benefit from performance
improvements "for free" when they use Integral keys that are consecutive.

We could use a rewrite rule to redirect graphFromEdges (when key is
Integral) to a function checking if the keys are successive or not, to
decide if the optimized lookup can be used.

The problem with this is that some potential optimizations would be
hidden-away, consider this example:

-- 'userAlgorithm' doesn't have the (Integral a) constraint
userAlgorithm :: (Ord a) => ...
userAlgorithm =
  (...)
-- hence, the rewrite rule here will never fire:
  graphFromEdges ...

userProgram :: (Integral a, Ord a) => a -> ...
userProgram input =
  userAlgorithm input
-- Eventhough input /is/ Integral (and its keys may be consecutive),
-- a potential optimization is lost because 'userAlgorithm' doesn't have
the (Integral a) constraint.

To address this, I propose to introduce the following a breaking change :
instead of using a rewrite rule, just add an (Integral key) constraint on
graphFromEdges, and do the check (wether optimized lookups can be used) in
graphFromEdges itself.

Then, in the previously described scenario, the user would see a
compilation error indicating that (Integral a) could not be deduced from
the context.

Hence, the user would be more likely to just add the Integral constraint on
myAlgorithm (and benefit from optimizations) than to use graphFromEdgesOld,
which will have been added to do exactly what graphFromEdges does today,
and which is intended to be used when the key is genuinely not Integral.

And for completeness, note that we could have the idea of the rewrite rule
on graphFromEdgesOld, eventhough I'm not sure it will benefit any real use
case (but who knows...).

- Proposal III : Deprecate `graphFromEdges` taking [(node, key, [key])] in
favor of `graphFromMap` taking (Map key (node,[key]))

While working on the PR, I found two arguments in favor of this change:

  -a) Today in graphFromEdges, if we pass the same key twice, it is
undefined which node for that key will actually be used.

  graphFromMap, taking a (Map key (node,[key]) would alleviate this issue.

  -b) "sortBy is expensive, even on sorted input": I became aware of this
as I looked
  at these 2 benchmarks:

    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveAscKeys
    time                 53.55 μs   (52.44 μs .. 54.76 μs)
                         0.991 R²   (0.986 R² .. 0.995 R²)
    mean                 57.41 μs   (55.02 μs .. 61.75 μs)
    std dev              10.64 μs   (5.100 μs .. 17.09 μs)
    variance introduced by outliers: 95% (severely inflated)

    benchmarking Degree_Zero/1000_ascending_graphFromConsecutiveKeys
    time                 75.34 μs   (73.79 μs .. 76.85 μs)
                         0.992 R²   (0.987 R² .. 0.995 R²)
    mean                 74.67 μs   (72.29 μs .. 77.47 μs)
    std dev              8.335 μs   (6.692 μs .. 10.70 μs)
    variance introduced by outliers: 85% (severely inflated)

In the second benchmark, we do the same work as in the first benchmark,
except for an initial
sortBy on an already sorted list, which introduces a 30% overhead! Looking
at the implementation
of sortBy, I saw that there was no initial check to see if the list was
sorted or not.

Since we use sortBy in graphFromEdges, we could check wether the input is
sorted or not, before using sortBy.

But also, if we used a Map (this is the reason I use this example here),
there would be no need to sortBy because Map.toAscList gives exactly the
sorted list we want.

This constitutes another argument for this proposal.

Note that graphFromEdgesWithConsecutiveKeys and
graphFromEdgesWithConsecutiveAscKeys would also
be deprecated in favor of graphFromConsecutiveMap, because they have the
same a) and b) issues.

Naming proposal N1:
    - graphFromEdges                 (takes a List, deprecated, existing
function)
    - graphFromEdgesInMap            (takes a Map)
    - graphFromEdgesInConsecutiveMap (takes a Map with consecutive keys)
Naming proposal N2:
    - graphFromEdges                 (takes a List, deprecated, existing
function)
    - graphFromMap
    - graphFromConsecutiveMap
 with these, to reflect the Map / List duality in the naming scheme:
    - graphFromList               (takes a List, deprecated, redirects to
graphFromEdges)
    - graphFromConsecutiveList    (takes a List, deprecated, redirects to
graphFromEdgesWithConsecutiveKeys)
    - graphFromConsecutiveAscList (takes a List, deprecated, redirects to
graphFromEdgesWithConsecutiveAscKeys)

--
Olivier Sohn

PS : when writing the criterion graph benchmarks, I discovered that Graphs
can be
evaluated to normal form by criterion only when they are acyclic. Cyclic
graphs
will make the evaluation loop infinitely. I wonder why that is, if anyone
has an
explanation on that subject I'd be happy to hear it! Or should I submit a
bug report
at criterion?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20180330/a31c2c8f/attachment.html>