[Haskell-cafe] announce: Glome.hs-0.3 (Haskell raytracer)

Fri Apr 18 19:11:53 EDT 2008

David Roundy wrote:
> On Fri, Apr 18, 2008 at 02:09:28PM -0700, Jim Snow wrote:
>   
>> On a particular scene with one instance of the single-threaded renderer
>> running, it takes about 19 seconds to render an image.  With two
>> instances running, they each take about 23 seconds.  This is on an
>> Athlon-64 3800+ dual core, with 512kB of L2 cache per core.  So, it
>> seems my memory really is slowing things down noticeably.
>>     
>
> This doesn't mean there's no hope, it just means that you'll need to be
> extra-clever if you're to get a speedup that is close to optimal.  The key
> to overcoming memory bandwidth issues is to think about cache use and
> figure out how to improve it.  For instance, O(N^3) matrix multiplication
> can be done in close to O(N^2) time provided it's memory-limited, by
> blocking memory accesses so that you access less memory at once.
>
> In the case of ray-tracing I've little idea where or how you could improve
> memory access patterns, but this is what you should be thinking about (if
> you want to optimize this code).  Of course, improving overall scaling is
> best (e.g. avoiding using lists if you need random access).  Next I'd ask
> if there are more efficent or compact data structures that you could be
> using.  If your program uses less memory, a greater fraction of that memory
> will fit into cache.  Switching to stricter data structures and turning on
> -funbox-strict-fields (or whatever it's called) may help localize your
> memory access.  Even better if you could manage to use unboxed arrays, so
> that your memory accesses really would be localized (assuming that you
> actually do have localize memory-access patterns).
>
> Of course, also ask yourself how much memory your program is using in
> total.  If it's not much more than 512kB, for instance, we may have
> misdiagnosed your problem.
>   
Interestingly, switching between "Float" and "Double" doesn't make any 
noticeable difference in speed (though I see more rendering artifacts 
with Float).  Transformation matrices are memory hogs, at 24 floats each 
(a 4x4 matrix and its inverse with the bottom rows omitted (they're 
always 0 0 0 1)).  This may be one reason why many real-time ray tracers 
just stick with triangles; a triangle can be transformed by transforming 
its verticies, and then you can throw the matrix away.

There are a lot of tricks for making ray tracers more memory-coherent.  
You can trace packets of rays instead of single rays against whatever 
acceleration structure you may be using.  Kd-tree nodes can be compacted 
to fit in a single cacheline if you arrange the tree in memory in a 
particular way that allows you to omit some of the pointers.  (I use BIH 
trees, but the same ideas probably apply.)  A lot of these sorts of 
tricks make the resulting code more complex and/or uglier.

Useful references: "What Every Programmer Needs to Know About Memory" 
http://lwn.net/Articles/250967/
Siggraph presentation on optimizing ray tracers (warning: ppt) 
http://www.openrt.de/Siggraph05/UpdatedCourseNotes/Stoll_Realtime.ppt

Bryan O'Sullivan wrote:
> Jim Snow wrote:
>
>   
>> > The concurrency bug has to do with excessive memory use, and was
>> > discussed earlier here on the mailing list (search for Glome).
>> > http://hackage.haskell.org/trac/ghc/ticket/2185
>>     
>
> Interesting.  I looked at your test case.  I can reproduce your problem
> when I build with the threaded runtime and run with a single core, but
> not if I use +RTS -N2.  Did you overlook the possibility that you may
> not have told GHC how many cores to use?
>
>   
I just tested it again.  Memory usage behaves differently depending on 
how many cores I tell it to run on, but it always used the least memory 
when I compiled without threading support.  With -N1 memory usage grows 
faster than -N2, but memory is smaller and doesn't grow larger with each 
re-render (except by a very small amount) if I don't use parmap.
> Also, your code is sprinkled with many more strictness annotations than
> it needs.
>
> 	<b
>   
That doesn't surprise me.  I haven't really figured out why somethings 
are faster strict or not strict, or where it doesn't matter or the 
annotations are redundant.

-jim