[Haskell-cafe] announce: Glome.hs-0.3 (Haskell raytracer)

Fri Apr 18 19:25:53 EDT 2008

jsnow:
> David Roundy wrote:
> >On Fri, Apr 18, 2008 at 02:09:28PM -0700, Jim Snow wrote:
> >  
> >>On a particular scene with one instance of the single-threaded renderer
> >>running, it takes about 19 seconds to render an image.  With two
> >>instances running, they each take about 23 seconds.  This is on an
> >>Athlon-64 3800+ dual core, with 512kB of L2 cache per core.  So, it
> >>seems my memory really is slowing things down noticeably.
> >>    
> >
> >This doesn't mean there's no hope, it just means that you'll need to be
> >extra-clever if you're to get a speedup that is close to optimal.  The key
> >to overcoming memory bandwidth issues is to think about cache use and
> >figure out how to improve it.  For instance, O(N^3) matrix multiplication
> >can be done in close to O(N^2) time provided it's memory-limited, by
> >blocking memory accesses so that you access less memory at once.
> >
> >In the case of ray-tracing I've little idea where or how you could improve
> >memory access patterns, but this is what you should be thinking about (if
> >you want to optimize this code).  Of course, improving overall scaling is
> >best (e.g. avoiding using lists if you need random access).  Next I'd ask
> >if there are more efficent or compact data structures that you could be
> >using.  If your program uses less memory, a greater fraction of that 
> >memory
> >will fit into cache.  Switching to stricter data structures and turning on
> >-funbox-strict-fields (or whatever it's called) may help localize your
> >memory access.  Even better if you could manage to use unboxed arrays, so
> >that your memory accesses really would be localized (assuming that you
> >actually do have localize memory-access patterns).
> >
> >Of course, also ask yourself how much memory your program is using in
> >total.  If it's not much more than 512kB, for instance, we may have
> >misdiagnosed your problem.
> >  
> Interestingly, switching between "Float" and "Double" doesn't make any 
> noticeable difference in speed (though I see more rendering artifacts 
> with Float).  Transformation matrices are memory hogs, at 24 floats each 
> (a 4x4 matrix and its inverse with the bottom rows omitted (they're 
> always 0 0 0 1)).  This may be one reason why many real-time ray tracers 
> just stick with triangles; a triangle can be transformed by transforming 
> its verticies, and then you can throw the matrix away.

The only differences I'd expect to see here would
be with -fvia-C -fexcess-precision -O2 -optc-O2

which might trigger some SSE stuff from the C compiler