[GHC] #10830: maximumBy has a space leak
GHC
ghc-devs at haskell.org
Thu Jan 5 15:27:25 UTC 2017
#10830: maximumBy has a space leak
-------------------------------------+-------------------------------------
Reporter: NeilMitchell | Owner:
Type: bug | Status: new
Priority: high | Milestone: 8.2.1
Component: Core Libraries | Version: 7.10.2
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: | Differential Rev(s): Phab:D1205
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by ekmett):
In response to Reid's prodding, I see two decisions to make here:
1.) There are ultimately 3 possible default implementations of
`minimumBy`/`maximumBy` based on `foldl1`, `foldr1`, or `foldMap`
respectively.
2.) We also need to decide whether we should add `maximumBy` and
`minimumBy` to the class.
`maximum` and `minimum` being in the class can improve the asymptotics of
these operations. e.g. Set offers O(log n) `minimum` and `maximum` today,
unlike before. Moreover, in the case of `minimum` and `maximum` 'a' is a
fully knowable type for some containers, so they can even handle some
infinite cases.
Having them in the class also happened to also let us paper over the
difference between the left fold and right fold in the old Foldable code
for `maximum` and `minimum` for lists and fix the analogous space leak
there. We didn't go with a right fold, but went with a monoidal fold
instead as it was never asymptotically worse than the right fold we had in
Foldable before.
But here the situation is a bit different. Without more knowledge about
the `b` type in the `(a -> b)` fitness function passed in there is no
scenario in which a `foldr1`-based `maximumBy` can get early termination,
so there is no real justification to the current implementation. It is
leakier than a `foldl1` solution and never better!
Even trying to argue there is some theoretical kind of snoclist container
doesn't pan out, if our chosen semantics is to return the first value with
the maximum value by fitness function, you'd have to reassociate all the
way anyways.
So at the very least there is no justification for a `foldr1` based
solution.
Ultimately, `maximumBy` and `minimumBy` are both operations that only make
sense for finite containers. Here we're stuck touching all of the distinct
members of the container at the least once as we don't know what fitness
function the user will supply or any properties of the output type, to
say, get early termination by reaching `minBound`/`maxBound`, or exploit
properties of the type, like the laziness of "Lazy Nat" max.
Adding `maximumBy`/`minimumBy` to the class would in theory allow a few
obscure cases: The only real thing they can do is take advantage of the
knowledge of which distinct 'a's they hold and send them through the
scoring function individually without repetition. This would allow some
containers to exploit a monoidal fold and pay proportional to the number
of distinct values in the container rather than the total number of
values. e.g. something like
http://hackage.haskell.org/package/compressed-3.10/docs/Data-Compressed-
LZ78.html could either just skim the tokens rather than do a full
decompression, or use a monoidal fold to get most of that benefit in
practice. If we have k distinct values, a monoidal fold version of
`maximumBy` for some containers that tracked distinct values could get
O(k) rather than O(n). Somewhat weaker, the `compressed` code above could
pay proportional to the size of the compressed data set, not the full data
set.
Basing something on `foldMap` without the ability to redefine it in the
class would effectively result in the bad `foldr`-style stack usage for
lists.
And frankly, it'd be nice to have a better story rather than 'lets favor
monoidal implementations but hack in left folds for lists' pattern that
seems to be cropping up over and over!
Off the cuff, we could possibly address the root cause of this and a
couple other issues by adding a `foldMap'` that uses a `foldl'` evaluation
order on lists and strict monoidal accumulator and basing the (possibly
default) definition of both `maximum` and `maximumBy` on that combinator.
This appeals to me as it could serve to reduce pressure to add more
members to the class of this sort, while avoiding more of these sorts of
stack space regressions.
I say "off the cuff" as I haven't had time to work through all the
consequences of such a combinator. and it isn't the only design in this
space, other designs in this space include making a single hideous
Frankenstein combinator that takes multiple fold types worth of arguments
and chooses an implementation accordingly, etc.
A straw man solution would be to add `foldMap'` to the class, switch
`maximum` and `maximumBy` to invoke it, and have it perform a monoidal
fold in `foldl'` style either by default or just in the list case. As an
embellishment on that design we could even add `sum'` and `product'`
combinators that went through this path rather than randomly leak space.
This would need some design work to work through consequences and
implementation, but seems doable.
**tl;dr** At the very least I agree that the current `foldr1`
implementation of `maximumBy` and `minimumBy` is a mistake.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10830#comment:26>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list