Fun with GHC's optimiser

Simon Peyton-Jones simonpj@microsoft.com
Thu, 7 Dec 2000 09:17:44 -0800


Manuel

OK, I've improved the eta expander so it does a better
job of the code you found.

Try now.  On the HEAD.  Which should compile OK now, incidentally.

How's your parallel Haskell stuff coming?

Simon


| -----Original Message-----
| From: Manuel M. T. Chakravarty [mailto:chak@cse.unsw.edu.au]
| Sent: 02 November 2000 13:54
| To: Simon Peyton-Jones
| Cc: glasgow-haskell-users@haskell.org
| Subject: RE: Fun with GHC's optimiser
| 
| 
| Simon Peyton-Jones <simonpj@microsoft.com> wrote,
| 
| > I can never resist messages like these, even when I'm meant
| > to be doing other things.  
| 
| That's good to know ;-)
| 
| > It's very helpful when people offer
| > fairly precise performance-bug reports.  Thanks!
| 
| Thanks for the prompt reply!  To SimonM, too.
| 
| > | I am wondering whether there is a particular reason why the
| > | optimiser doesn't pull the
| > | 
| > |   (1)  a = NO_CCS PArray! [wild1 mba#];
| > 
| > This one is a definite bug.  It turns out that the head of the
| > before-ghci-branch doesn't have this bug, so I'm disinclined
| > to investigate it further.  
| 
| No problem, I can always build a new ghc from CVS.
| 
| > |   (2)  case w of wild3 {
| > |          I# e# ->
| > | 
| > | As for (2), the loop would be nice and straight if that
| > | unboxing where outside of the loop - as it is, we break the
| > | pipeline once per iteration it seems
| > 
| > This one is a bit harder.  Basically we want to make a wrapper
| > for a recursive function if it's sure to evaluate its free 
| variables.
| > 
| > In fact the 'liberate-case' pass (which isn't switched on in 4.08)
| > is meant to do just this. It's in simplCore/LiberateCase.lhs,
| > and it's not very complicated.  I've just tried it and it 
| doesn't seem
| > to have the desired effect, but I'm sure that's for a boring reason.
| > If anyone would like to fix it, go ahead!
| 
| Ok - something for a rainy weekend, I guess...
| 
| > Incidentally, you'll find that -ddump-simpl gives you a dump that
| > is pretty close to STG and usually much more readable.  Most
| > performance bugs show up there.  -dverbose-simpl gives you more
| > clues about what is happening where.
| 
| Good to know.  I just wasn't sure whether I wouldn't miss
| out on some optimisations.
| 
| > | Also if somebody is looking at the attached source, I was
| > | wondering why, when I use the commented out code in
| > | `newPArray', I get a lot worse code (the STG code is in a
| > | comment at the end of the file).  In particular, the lambda
| > | abstraction is not inlined, whereas `fill' gets inlined into
| > | the code of which the dump is above.  Is it because the
| > | compiler has a lot harder time with explicit recursion than
| > | with fold/build?  If so, the right RULES magic should allow
| > | me to do the same for my own recursively defined
| > | combinators, shouldn't it?
| > 
| > I couldn't figure out exactly what you meant.  The only commented
| > out code is STG code.  Maybe send a module with the actual
| > source you are bothered about.
| 
| Appended the version of the module where the code is
| activated (it was commented out with --) - you'll find the
| problematic code by searching for (**) in the code.  It
| generates
| 
| $w$snewPArray
|   = \ ww :: Int# w :: Int ->
| 	case newIntArray# @ RealWorld ww realWorld#
| 	of wild { (# s2#, mba# #) ->
| 	case $wsimpleGen
| 		 ww
| 		 (\ i :: Int ->
| 		      case i of wild1 { I# i# ->
| 		      case w of wild2 { I# e# ->
| 		      __coerce (ST RealWorld ())
| 		      (\ s# :: (State# RealWorld) ->
| 			   case writeIntArray# @ RealWorld mba# i# e# s#
| 			   of s2#1 { __DEFAULT ->
| 			   (# s2#1, () #)
| 			   })
| 		      }
| 		      })
| 		 s2#
| 	of wild1 { (# new_s, r #) ->
| 	let {
| 	  a :: Int
| 	  a = $wI# ww
| 	} in  (# a, (__coerce ByteArray# mba#) #)
| 	}
| 	}
| 
| One thing that is not nice is in the lambda abstraction
| based to $wsimpleGen.  There is one lambda, then two
| unboxing operations and another lambda for the state
| variables s#.  These two nested lambda's become in STG
| 
| 		stg_c1Op =
| 		    NO_CCS[] \r[i]
| 			case i of wild1 {
| 			  I# i# ->
| 			      case w of wild2 {
| 				I# e# ->
| 				    let {
| 				      stg_c1Sv =
| 					  NO_CCS[] \r[s#]
| 					      case 
| writeIntArray# [mba# i# e# s#] of s2#1 {
| 						DEFAULT -> 
| (#,#) [s2#1 ()]
| 					      };
| 				    } in  stg_c1Sv;
| 			      };
| 			};
| 
| So, stg_c1Sv is allocated and immediately entered.  I would
| have hoped that the two lambda abstractions are merged into
| something like this:
| 
| 		 (\ i :: Int s# :: (State# RealWorld) ->
| 		      case i of wild1 { I# i# ->
| 		      case w of wild2 { I# e# ->
| 		      case writeIntArray# @ RealWorld mba# i# e# s#
| 			   of s2#1 { __DEFAULT -> (# s2#1, () #)
| 		      })
| 		      }
| 		      })
| 
| Maybe it is the __coerce, which prevents this from
| happening?
| 
| Also, I am wondering why in the case of the 
| 
|   foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1])
| 
| the function `fill' gets inlined into the foldr loop, but in
| the above, the lambda abstraction gets not inlined into
| $wsimpleGen.  Maybe it is because $wsimpleGen itself gets
| not inlined...
| 
| Cheers,
| Manuel
|