[Haskell-cafe] Floyd Warshall performance (again)

Fri Apr 16 04:06:06 EDT 2010

Dear haskell-cafe,

I implemented the Floyd Warshall algorithm for finding the shortest
path in a dense graph in Haskell, but noted the performance was
extremely poor compared to C. Even using mutable unboxed arrays it was
running about 30 times slower. I rewrote the program several times, to
arrive at the following definition:

module Main where

import Data.Array.Base
import Data.Array.ST
import Data.Array.Unboxed
import Control.Monad.ST
import Control.Monad
import System.IO
import GHC.Exts

#define SIZE 1500

-- Actual graph is immaterial for this benchmark.
graph = []

shortestPath :: [(Int, Int, Int)] -> UArray Int Int
shortestPath g = runSTUArray $ do
  let mnew = newArray (0, SIZE * SIZE) 1
      mread arr i j = unsafeRead arr (i * SIZE + j)
      mwrite arr i j x = unsafeWrite arr (i * SIZE + j) x
  unsafeIOToST $ hSetBuffering stdout LineBuffering
  unsafeIOToST $ putStrLn "Allocating ..."
  pm <- mnew
  unsafeIOToST $ putStrLn "Allocating ... done"
  let loop1 SIZE = return ()
      loop1 k = let loop2 SIZE = return ()
                    loop2 i = let loop3 SIZE = return ()
                                  loop3 j = do
                                    xij <- mread pm i j
                                    xik <- mread pm i k
                                    xkj <- mread pm k j
                                    mwrite pm i j (min xij (xik + xkj))
                                    loop3 (j + 1)
                              in loop3 0 >> loop2 (i + 1)
                in loop2 0 >> loop1 (k + 1)
  loop1 0
  return pm

main = shortestPath graph `seq` return ()

These 3 nested loops run a lot faster than generating a list of IO
actions and then running them using sequence_. But still this program
runs 3 times slower than it's C counterpart:

$ time ./FloydWarshall
Allocating ...
Allocating ... done
./FloydWarshall  45,75s user 0,09s system 97% cpu 46,815 total

$ time ./a.out
./a.out  14,92s user 0,03s system 98% cpu 15,230 total

The source code for the C program is given below. Using the vector
package instead of GHC's inbuilt ST arrays doesn't improve the
situation any, neither does compiling via C. Any idea why the rather
unidiomatic Haskell program is still 3 times slower, despite using
unchecked indexing and the fact that the three nested tail recursive
functions above shoud behave exactly as the 3 nested for loops in the
C program?

Best regards,

Mathieu