[Haskell-cafe] Program optimisation

Fri Apr 20 15:07:45 EDT 2007

On Fri, Apr 20, 2007 at 07:01:32PM +0100, Andrew Coppin wrote:
> I have *no idea* if this is an appropriate place to put this, but here
> goes...

Sure.

> Next, I read the GHC manual. Apparently there's a command-line switch
> "-O2" which makes GHC try harder to optimise the program. (I had
> stupidly assumed it always tries to make good code...) The result? The
> program now does the same job in 30 seconds. That's quite a big speedup
> just for typing 3 characters! (Gee... I wonder what it does differently?
> I did notice that the compilation time went from 5 seconds up to 20
> seconds. And the *.exe got smaller.)

This is standard behavior for compilers.

> Right, that seems to be no-op. Next I replaced lists with immutable
> boxed arrays. New runtime: 33 seconds.
> 
> Ooops. Now, I would have thought an array would make the GC's life
> *much* easier by replacing tens of thousands of cons cells with one nice
> big array. But, apparently, no. The profiler output says GC is almost
> identical. Hmm... weird!

The compiler is pretty smart about lists.  As long as you're doing simple
things with them (e.g. mapping), odds are pretty good there won't be any
intermediate cons cells generated (this is called deforestation).  In
contrast, no such optimization is available (yet) for arrays.  Every time
you create a new array, it needs to actually be created.

> Finally, in a fit of desperation, I changed it to *unboxed* arrays. For
> some reason, you can't have an unboxed array of complex numbers. So I
> had to make *two* unboxed arrays of doubles, and manually split/join
> myself. Similar problems happened elsewhere too. Suffice it to say, my
> program is now vomit-inducingly ugly. It doesn't even *function*
> correctly any more! (There's a byte alignment issue somewhere... I can't
> track it down.) New runtime: 15 seconds.

Indeed, unboxed arrays are much nicer, but the whole Array batch of the
standard library is pretty useless, in my experience.  You can roll your
own relatively easily using ForeignPtr, similar to Data.ByteString, but
that's a bit of a pain.

> I will mention that the ByteString module was the only thing I would
> find in all of Haskell for doing binary I/O. The final (fast) program
> uses an IOUArray Int Word8 to do the binary I/O instead. (What's the
> difference BTW? Is there one? Other than the name anyway.) Perhaps part
> of the speed was in avoiding an array type conversion? I don't know.
> Certainly this is the messiest part, and the part where I suspect my bug
> is hiding.

Binary IO is definitely an issue with Haskell.

> Finally, in the fast version, quant8 uses way less CPU (more like 50%),
> and the complex number math uses much more. Really goes look like the
> profiler being counterintuitive to me...

Profilers are always counterintuitive in any language.  (Okay, not quite
always, but pretty much whenever you actually need a profiler.)

> A short program synopsys:
> 
> module Main where
> 
> init_frame :: [[Complex Double]]
> 
> next_frame :: [[Complex Double]] -> [[Complex Double]]
> 
> process :: [[Complex Double]] -> [[Colour]]
> 
> save_file :: FilePath -> [[Colour]] -> IO ()
> 
> main = do
>  work 0 init_frame
> 
> work n xss = do
>  save_file ("Frame" ++ show n) (process xss)
>  if n < 500
>    then work (n+1) (next_frame xss)
>    else return ()

If you want actual help, you'll have to supply something more than this.
You could cut out the actual "work", and just post some code that does the
framework.  My guess (and you should have tested this) is that the runtime
won't decrease much if you cut out the actual work (or replace it by
something simple, like taking a exponential of everything), and the result
would be that we'd be able to do more than commiserate.

What happens if you cut the save_file from the process? Far better than
profiling is checking by hand the actual removal of code from the program,
as profiling has a tendancy to disable optimizations.  You did run your
timings with profiling disabled?

You've got three potentially expensive functions here (process, save_file
and next_frame), and it'd certainly help to have some idea which is the
bottleneck.

Due to lazy evaluation (less of an issue with unboxed arrays), you do need
to be careful when removing bits of code, as this could cause other bits of
code to not actually be run (e.g. if you never print the output, it might
be that the next_frame is never evaluated, so you might need to print just
the last frame, or compute the sum of each frame and print that).
-- 
David Roundy
Department of Physics
Oregon State University