aha! I think.

Wed Oct 26 20:32:52 EDT 2005

I think I might have found why (or partially why) ghc is so slow on x86-64.. 

section 5.10 of the optimization manual
 http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF

(which has a whole lot of good info for any processor, including a
whole chapter on how to write C code that optimizes well independent of
the CPU)

"don't place code and data on the same cache line"

it will cast out the code line from the cache on acces to the data and
vice versa. so basically, ghc is running L1 cacheless on the x86-64 if
I understand things properly. (maybe for other CPUs too, we might want
to check the intel optimization manuals too)

If it is too difficult to separate the code and data from each other
(which it might be, since ghc goes through specific measures to put
them next to each other) then making sure the transition from code to
data occurs exactly on a 64 byte cache line boundry might solve this
issue. it would mean that each function takes up a minimum of 128 bytes
and we can't have more than one per cache line.. but perhaps that is an
acceptable tradeoff, but we might want to inline more to get bigger
functions so we don't have to pad so much.  

        John

-- 
John Meacham - ⑆repetae.net⑆john⑈