[Haskell-cafe] sum $ map f xs ... ghc-7.10 performance regression?

Tue Dec 15 00:20:16 UTC 2015

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

If you look at the memory usage, the version without the bang actually
contains a space leak, so that's the source of the longer runtime. I
hope you do not mind some core from ghc ; )
With the bang, we have the following main loop:

- -- strict acculumator function over a list, good (worker from sum)
Rec {
$wgo :: [Int] -> Int# -> Int#
$wgo =
  \ (w :: [Int]) (ww :: Int#) ->
    case w of _ {
      [] -> ww;
      : y ys -> case y of _ { I# y1 -> $wgo ys (+# ww y1) }
    }
end Rec }

- -- builds a list, lazy (i.e. no space leak)
main3 :: Int -> [Int] -> [Int]
main3 = \ (x :: Int) (ys :: [Int]) -> : (bitcount x) ys

main2 :: String
main2 =
  case $w$s^ main5 main4 of ww { __DEFAULT
->                              -- calculate 2^24
  case efdtIntUpFB main3 ([]) 0 4 (-# ww 1) of vx { __DEFAULT ->      --
loop function, desugared from [a, b..c]
  case $wgo vx 0 of ww1 { __DEFAULT
->                                            -- sum up the list
  case $wshowSignedInt 0 ww1 ([]) of _ { (# ww5, ww6 #) ->            --
only shows the result, not important
  : ww5 ww6
  }
  }
  }
  }

So we build a lazy list with bitcount applied to all arguments. All the
stuff is boxed for some reason, so we allocate some stuff, just to throw
it away, but it is never held on for long, so our memory usage never
gets past 1MB (and thus GC stays cheap). This list is then deconstructed
by $wgo again, forcing one element at the time and accumulated into a Int#.

Without the bang, the main loops looks something like this:

- -- This is interesting. The first argument seems to be the number from
the list and the third is our accumulator while the second seems to
start as the identity function (from main2 second case).
main4 :: Int -> (Int -> Int) -> Int -> Int
main4 =
  \ (x :: Int) (ys :: Int -> Int) (tpl :: Int) ->
    ys
      (case tpl of _ { I# x1 ->
       case x of _ { I# ww1 ->
       case $wbitcount ww1 of ww2 { __DEFAULT -> I# (+# x1 ww2) }
       }
       })

main2 :: String
main2 =
  case $w$s^ main6 main5 of ww { __DEFAULT
->                             -- calculate 2^24
  case efdtIntUpFB main4 (id) 0 4 (-# ww 1) main3 of _ { I# ww3 -> --
loop function, desugared from [a, b..c], but instead of building a list
we do something like a difference-list, but with ints (note that we now
have 7 arguments instead of 6)
  case $wshowSignedInt 0 ww3 ([]) of _ { (# ww5, ww6 #) ->           --
only shows the result, not important
  : ww5 ww6
  }
  }
  }

efdtIntUpFB seems to be something like this:
efdtIntUpFB :: (Int -> acc -> acc) -> acc -> Int# -> Int# -> Int# -> acc
efdtIntUpFB f acc x step max = if x > max then acc else f x (efdtIntUpFB
f acc (x + step) step max)

(Actual implementation:
http://hackage.haskell.org/package/base-4.8.1.0/docs/src/GHC.Enum.html#efdtIntUpFB)

But acc now resoves to Int -> Int. So if we exand this, we get:
main4 0 (main4 4 (main4 8 ... (main4 ... id) ...)))

And main4 builds a closure and applies this to ys. Then we enter ys,
which is again main4, building a closure and apply this to ys, which is
again main4, which builds a closure and finally applies this to id
(after 2^22 closures or something like that). So we leak a few hundered
MB (423 on my machine) which we have to copy every time we do a (full)
GC. On my machine 0.5 sec are spend on full GC, another 0.5 on a few
more minor GCs and 1 sec actually computing, which is somewhat similar
to your results.

So, that much about the space leak. The reason for the difference seems
to manifest itself after the first float-out pass, somehow the
simplifier rebuilds the expression differently... but right now I do not
see what exactly is happening there and why. My first though was a
different set of rules from the list fusion stuff fires, but at least
the names and order of the rule firings are exactly the same... But the
more I think about it, the more I think this is a bug. The following
program leaks a similar amount of space:

main = print bar

bar :: Int
bar = sum $ map (\x -> x) [1,2..2^24-1]

There is a special rule for map id, so we avoid it, a function doing
actual work (like bitcount) would yield the same result. Using
[1..2^24-1] yields an efficient program, so something with the
combination of sum, map and enumFromThenTo seems to be broken. So yes, I
would argue that you indeed did hit a bug.

On 12/14/2015 11:30 PM, Johannes Waldmann wrote:
> Is there a difference between "sum $ map f xs" and "sum $! map f xs"? > > I would think no, since "$!" just evaluates the argument to WHNF, >
so it would evaluate just the first cons of the list. > > > This little
program > > main = print $ sum $ map bitcount [0, 4 .. 2^24-1 ] > >
bitcount :: Int -> Int > bitcount x = >   if x > 0 then let (d,m) =
divMod x 2 in bitcount d + m else 0 > > runs in 1.6 seconds when
compiled with -O2 > on ghc-7.6.3, ghc-7.8.4, > > but takes 3.2 seconds
on ghc-7.10.[1-3]. Why? > > > when I change the main function (note: $
=> $!) to > > main = print $ sum $! map bitcount [0, 4 .. 2^24-1 ] > >
and compile with 7.10.[1-3], then it also runs in 1.6 seconds. > > > (I
can write a bug report but I want to check > whether I am missing
something obvious) > > > - J.W. >
_______________________________________________ > Haskell-Cafe mailing
list > Haskell-Cafe at haskell.org >
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJWb1y3AAoJEM0PYZBmfhoBqNAH/1QNNvZ84We9b0RUtb6w5gNt
6u9yzw5DxMBsxdHgEWlbc7vXOLpeCXaPhLMKYVFnNDHwwx4c+CPWZd44YUCZtLOd
Xi9sCAHMJX5vDxfhFgCzwBP39KsMr+Euhde+thIBjlAmM/IKJLC1h1tjtoYUOMle
W9AMzIoVetHSWA5MqgX+LjpNiMifTnXNy+6yx8aytCtRKo2hiGs9tWSzCE58a8XM
EY3cFSzbqy1IHsq9xwycjfpTvpUt1Ga8jjmhwilhjdREIYwWmc6qT2vAiAGf4nPn
gx9SoMIlAmgbcV/xMSUha2YZ5QzG9deYTIZ7SYLfcaTev1/QMKmONgHX4uWTRFU=
=rHVH
-----END PGP SIGNATURE-----