[Haskell-cafe] Floyd Warshall performance (again)

Fri Apr 16 04:58:32 EDT 2010

For what is is worth:

$ ghc -cpp -O2 -ddump-asm Main.hs > Main.s
$ time ./a.out
Allocating ...
Allocating ... done

real	0m39.487s
user	0m39.258s
sys	0m0.150s

$ ~/Programming/Checkouts/ghc.llvm/inplace/bin/ghc-stage2 -cpp -fllvm
-O2 Main.hs

$ time ./a.out
Allocating ...
Allocating ... done

real	0m20.443s
user	0m20.281s
sys	0m0.101s

So you have an order of magnitude improvement with the LLVM backend.
It looks to me like your Core code is near-optimal, so the performance
problem is all down to the backend to fix up.

However:

$ gcc -std=c99 -O2 Main.c
$ time ./a.out

real	0m9.120s
user	0m9.030s
sys	0m0.035s

We still have a long way to go!

Cheers,
Max

On 16 April 2010 09:06, Mathieu Boespflug <mboes at tweag.net> wrote:
> Dear haskell-cafe,
>
> I implemented the Floyd Warshall algorithm for finding the shortest
> path in a dense graph in Haskell, but noted the performance was
> extremely poor compared to C. Even using mutable unboxed arrays it was
> running about 30 times slower. I rewrote the program several times, to
> arrive at the following definition:
>
> module Main where
>
> import Data.Array.Base
> import Data.Array.ST
> import Data.Array.Unboxed
> import Control.Monad.ST
> import Control.Monad
> import System.IO
> import GHC.Exts
>
> #define SIZE 1500
>
> -- Actual graph is immaterial for this benchmark.
> graph = []
>
> shortestPath :: [(Int, Int, Int)] -> UArray Int Int
> shortestPath g = runSTUArray $ do
>  let mnew = newArray (0, SIZE * SIZE) 1
>      mread arr i j = unsafeRead arr (i * SIZE + j)
>      mwrite arr i j x = unsafeWrite arr (i * SIZE + j) x
>  unsafeIOToST $ hSetBuffering stdout LineBuffering
>  unsafeIOToST $ putStrLn "Allocating ..."
>  pm <- mnew
>  unsafeIOToST $ putStrLn "Allocating ... done"
>  let loop1 SIZE = return ()
>      loop1 k = let loop2 SIZE = return ()
>                    loop2 i = let loop3 SIZE = return ()
>                                  loop3 j = do
>                                    xij <- mread pm i j
>                                    xik <- mread pm i k
>                                    xkj <- mread pm k j
>                                    mwrite pm i j (min xij (xik + xkj))
>                                    loop3 (j + 1)
>                              in loop3 0 >> loop2 (i + 1)
>                in loop2 0 >> loop1 (k + 1)
>  loop1 0
>  return pm
>
> main = shortestPath graph `seq` return ()
>
> These 3 nested loops run a lot faster than generating a list of IO
> actions and then running them using sequence_. But still this program
> runs 3 times slower than it's C counterpart:
>
> $ time ./FloydWarshall
> Allocating ...
> Allocating ... done
> ./FloydWarshall  45,75s user 0,09s system 97% cpu 46,815 total
>
> $ time ./a.out
> ./a.out  14,92s user 0,03s system 98% cpu 15,230 total
>
> The source code for the C program is given below. Using the vector
> package instead of GHC's inbuilt ST arrays doesn't improve the
> situation any, neither does compiling via C. Any idea why the rather
> unidiomatic Haskell program is still 3 times slower, despite using
> unchecked indexing and the fact that the three nested tail recursive
> functions above shoud behave exactly as the 3 nested for loops in the
> C program?
>
> Best regards,
>
> Mathieu
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>