[Haskell-cafe] fast Eucl. dist. - Haskell vs C

Mon May 18 09:50:15 EDT 2009

kenneth.hoste:
> Hello,
>
> For a while now, I've been trying to come up with a fast Haskell-only
> function which implements Euclidean distance in n-dimensional space.
>
> So far, I've been disappointed by the performance of my best effort
> in Haskell, compared to C. I'm hoping some of the Haskell experts
> and/or performance gurus on this list can help me out on resolving this,
> and also maybe shed some light on some strange (imho) things I've run
> into.
>
> My current best try uses the uvector package, has two 'vectors' of type
> (UArr Double)  as input, and relies on the sumU and zipWithU functions
> which use streaming to compute the result:
>
> dist_fast :: UArr Double -> UArr Double -> Double
> dist_fast p1 p2 = sumDs `seq` sqrt sumDs
>         where
>                 sumDs         = sumU ds
>                 ds            = zipWithU euclidean p1 p2
>                 euclidean x y = d*d
>                         where
>                                 d = x-y

The problem in your uvector code is the use of lists, rather than uvector
generators. Replace [1..n] with enumFromTo:

    import Control.Monad
    import System.Environment
    import System.IO
    import Data.Array.Vector

    dist :: UArr Double -> UArr Double -> Double
    dist p1 p2 = sumU (zipWithU euclidean p1 p2)
        where
            euclidean x y = d*d where d = x-y

    main = do
        [dim] <- map read `fmap` getArgs

        print $
          dist
            (enumFromToFracU 1.0 dim)
            (enumFromToFracU 1.0 dim)

Now the entire thiing will fuse to a loop.

          $s$wfold_s1RR :: Double# -> Double# -> Double# -> Double#

          $s$wfold_s1RR =
            \ (sc_s1RH :: Double#)
              (sc1_s1RI :: Double#)
              (sc2_s1RJ :: Double#) ->
              case >## sc1_s1RI a5_s1QR of wild4_a1tn {
                False ->
                  case >## sc_s1RH a5_s1QR of wild5_X1vx {
                    False ->
                      let {
                        x1_a1Jg [ALWAYS Just L] :: Double#

                        x1_a1Jg = -## sc1_s1RI sc_s1RH } in
                      $s$wfold_s1RR
                        (+## sc_s1RH 1.0)
                        (+## sc1_s1RI 1.0)
                        (+## sc2_s1RJ (*## x1_a1Jg x1_a1Jg));
                    True -> sc2_s1RJ
                  };
                True -> sc2_s1RJ
              }; } in
        case $s$wfold_s1RR 1.0 1.0 0.0 of ww_s1QH { __DEFAULT ->
        a19 (D# ww_s1QH)

and this assembly:

    $ ghc -O2 -fvia-C -optc-O3

      s1T9_info:
          movsd       5(%rbx), %xmm7
          ucomisd     %xmm7, %xmm6
          ja  .L15
          ucomisd     %xmm7, %xmm5
          jbe .L18
        .L14:
        .L15:
          movsd       (%rbp), %xmm5
          leaq        8(%rbp), %rbp
          jmp *(%rbp)
        .L18:
          movapd      %xmm6, %xmm7
          subsd       %xmm5, %xmm7
          mulsd       %xmm7, %xmm7
          addsd       (%rbp), %xmm7
          movsd       %xmm7, (%rbp)
          movsd       .LC0(%rip), %xmm7
          addsd       %xmm7, %xmm5
          addsd       %xmm7, %xmm6
          jmp s1T9_info

Which I'd wager will match the C, which has to allocate the two arrays (GHC
essentially decides it doesn't need the arrays any more.

-- Don