[Haskell-cafe] fast Eucl. dist. - Haskell vs C

Mon May 18 08:37:51 EDT 2009

Hello,

For a while now, I've been trying to come up with a fast Haskell-only
function which implements Euclidean distance in n-dimensional space.

So far, I've been disappointed by the performance of my best effort
in Haskell, compared to C. I'm hoping some of the Haskell experts
and/or performance gurus on this list can help me out on resolving this,
and also maybe shed some light on some strange (imho) things I've run
into.

My current best try uses the uvector package, has two 'vectors' of type
(UArr Double)  as input, and relies on the sumU and zipWithU functions
which use streaming to compute the result:

dist_fast :: UArr Double -> UArr Double -> Double
dist_fast p1 p2 = sumDs `seq` sqrt sumDs
         where
                 sumDs         = sumU ds
                 ds            = zipWithU euclidean p1 p2
                 euclidean x y = d*d
                         where
                                 d = x-y

I've been benchmarking this function against various alternatives
using the MicroBench [1] package, which allows to get accurate timings  
of
single function calls.
I've also mimicked the MicroBench approach in pure C, to get comparable
timings for a C-only implementation.
The C-only function is quite straightforward:

double dist(int dim, double* p1, double* p2){

         int i;
         double d = 0.0;

         for(i=0; i < dim; i++){
                 d += (p1[i] - p2[i])*(p1[i] - p2[i]);
         }

         return sqrt(d);
}

(Note that the optimizer takes care of the p1-p2 operation
appearing twice in the code).

All code is attached if you'd like to play around with it.

All numbers reported below are using GHC 6.10.2 and gcc 4.3.3
on Linux/x86.

The compilation details can be found in the Makefile attached, but
in short, I'm using -O2 -fexcess-precision or
-O2 -fexcess-precision -fvia-C -optc-O3 with GHC, and -O3 with gcc.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Dist.hs
Type: application/octet-stream
Size: 1840 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/Dist.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: microbench_dist.c
Type: application/octet-stream
Size: 2480 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/microbench_dist.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: microbench_dist.hs
Type: application/octet-stream
Size: 4823 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/microbench_dist-0001.obj
-------------- next part --------------

-------------- next part --------------
A non-text attachment was scrubbed...
Name: dist_c.c
Type: application/octet-stream
Size: 229 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/dist_c.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: dist_c.h
Type: application/octet-stream
Size: 44 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/dist_c-0001.obj
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Makefile
Type: application/octet-stream
Size: 544 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20090518/34a658ee/Makefile.obj
-------------- next part --------------

Now the bad news: for small dimensions, e.g. 2D/3D-space,
the dist_fast function is 70-240x slower than a pure C implementation,
depending on the architecture.

For example, in 2D-space on an Intel Pentium 4 (3.0GHz, 1M L2 cache),
a single call to dist_fast takes about 1.75 microseconds (or  
0.00000175s),
while a call to dist_c (C implementation of Eucl. dist), takes about
0.025 microseconds (70x slowdown).

On a Core 2 2.5GHz with 6MB L2 this maps to 1.9 and 0.008 microseconds,
resp. (i.e. 240x slower), while on a Core i7 2.66GHz with 8MB L2 the  
numbers
are 1.53 and 0.011 microseconds (i.e. 140x slower).

For larger dimensions, the gap becomes less big, but is still
worrying: 10D: 40-110x; 100D: 10-17x; >1kD: 2.5x-6x.

I'm mostly interested in the range 10D to 100D, so seeing that
Haskell is over 10x and up to 100x slower than C is kind of
making me cry.

I've tried some things to improve on this without much luck,
on the contrary:

*) Marking dist_fast for inlining makes things worse; in general
the inlined version is 2x slower for low dimensionality, and even
5x slower for larger dimensionality.
This was somewhat surprising to me...

*) In a moment of weakness, I used the Foreign Function
Interface to call the dist_c C-only implementation from Haskell.
Unfortunately, there seems to be a lot of overhead in calling
dist_c from Haskell. Most of the performance gain from using
C melts away, and sometimes the performance of the FFI'd
dist_c is 15-30% worse than the native dist_fast version
(especially at low dimensionality).

Only for the largest dimensionalities (10k-100kD), the FFI'd
version reaches the performance of the native C approach.
But, since I'm mostly interested in the 10-100D range, this is
of little use to me.

One thing I noticed is that compiling through C using
-fvia-C -optc-O3 might be a bad idea, depending on your system.

On an Intel Pentium 4 system, -fvia-C -optc-O3 was giving me
a speedup of up 70% (large dim.), while on Core 2 and Core i7
it resulted in a slowdown of 15-20% !
I was using roughly equal versions of GCC with this, i.e. a
self-bootstrapped GCC 4.3.x.

So, my question to the list if simple: how can I get better
performance out of a Haskell-only approach?
Any comments/suggestions are highly appreciated.

I'd prefer a Haskell-only approach, but my main concern is speed.
The Euclidean distance function will be used quite heavily in various  
tools.

I currently have a C-version of some of the tools, but the amount of  
code that is
needed for those tools is becoming ridiculously big. I believe using  
Haskell
will allow me to come up with a more easy to maintain code base.  
However,
I don't want to pay a huge price for this in terms of performance.

greetings,

Kenneth

[1] MicroBench: http://hackage.haskell.org/cgi-bin/hackage-scripts/package/microbench

-- 

Kenneth Hoste
Paris research group - ELIS - Ghent University, Belgium
email: kenneth.hoste at elis.ugent.be
website: http://www.elis.ugent.be/~kehoste
blog: http://boegel.kejo.be