[Haskell-cafe] Optimizations for list comprehension

Wed Aug 21 13:45:49 UTC 2019

Hi,

On 20/08/2019 20:12, Emilio Francesquini wrote:
> Thanks William!
>
> For me it's quite unexpected to see a unit argument making that kind 
> of difference for the optimizer...
>
> Anyway, one new trick added to the bag...
>
> Tks.
>
>
>
> On Mon, Aug 19, 2019 at 11:34 AM William Yager <will.yager at gmail.com 
> <mailto:will.yager at gmail.com>> wrote:
>
>     I'm not sure which exact optimizations are responsible, but based
>     on --ddump-simple,
>
>     * "inside" is not allocating any lists at all. It's just a couple
>     loops over unboxed ints
>     * "outside" is actually allocating a (single) list data structure
>     and has an inner loop and an outer loop, both of which traverse
>     the list
>
>     GHC seems to be too aggressive about sharing "range" in "outside".
>     Adding a unit argument to "range" makes both functions go fast.
>
In "outside", "range" is a single (monomorphic) CAF binding, so I would 
expect it to be shared aggressively.

Even if you do something like "inside" with explicit naming:

inside2 :: Int -> Int
inside2 n =
   sum [i + j | i <- range1, j <- range2]
   where
     range1 = [0..n-1]
     range2 = [0..n-1]

then "range2" is shared between all iterations over "range1", which I 
expect to be allocated and traversed.  I haven't tested to be sure.

Historically I have used crass hacks to prevent sharing by introducing 
data dependencies:

inside3 :: Int -> Int
inside3 n =
   sum [i + j | i <- [0..n-1], j <- [i+0-i..n-1]]

That "inside" can be optimized to avoid allocations of the inner loop is 
a pleasant surprise, I hope it doesn't lead to unpleasant surprises in 
other circumstances.

>
>     On Mon, Aug 19, 2019 at 8:14 PM Emilio Francesquini
>     <francesquini at gmail.com <mailto:francesquini at gmail.com>> wrote:
>
>         Hello Cafe,
>
>         While investigating a performance problem I stumbled upon what
>         I eventually reduced to the example below:
>
>         module Main where
>
>         import Data.Time.Clock
>
>         outside :: Int -> Int
>         outside n =
>           sum [i + j | i <- range, j <- range]
>           where
>             range = [0..n-1]
>
>         inside :: Int -> Int
>         inside n =
>           sum [i + j | i <- [0..n-1], j <- [0..n-1]]
>
>         main :: IO ()
>         main = do
>           t0 <- getCurrentTime
>           print $ inside 10000
>           t1 <- getCurrentTime
>           print $ outside 10000
>           t2 <-getCurrentTime
>
>           print (diffUTCTime t1 t0)
>           print (diffUTCTime t2 t1)
>
>         Compiling with -O2, up to GHC 8.2.2, both `inside` and
>         `outside` functions would take the same amount of time to
>         execute. Somewhere between GHC 8.2.2 and 8.6.4 something
>         changed (possibly some new optimization) making `inside` run
>         ~4x faster on my machine. With LLVM the difference is even bigger.
>
>         It is not that `outside` got slower, but that `inside` got
>         much faster. I'm curious to what optimizations might be
>         happening to the `inside` function that would not fire on the
>         outside function.
>
>         Any hints?
>

Claude
-- 
https://mathr.co.uk