Fun with GHC's optimiser

Manuel M. T. Chakravarty chak@cse.unsw.edu.au
Fri, 03 Nov 2000 00:54:25 +1100


----Next_Part(Fri_Nov__3_00:54:22_2000_559)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

Simon Peyton-Jones <simonpj@microsoft.com> wrote,

> I can never resist messages like these, even when I'm meant
> to be doing other things.  

That's good to know ;-)

> It's very helpful when people offer
> fairly precise performance-bug reports.  Thanks!

Thanks for the prompt reply!  To SimonM, too.

> | I am wondering whether there is a particular reason why the
> | optimiser doesn't pull the
> | 
> |   (1)  a = NO_CCS PArray! [wild1 mba#];
> 
> This one is a definite bug.  It turns out that the head of the
> before-ghci-branch doesn't have this bug, so I'm disinclined
> to investigate it further.  

No problem, I can always build a new ghc from CVS.

> |   (2)  case w of wild3 {
> |          I# e# ->
> | 
> | As for (2), the loop would be nice and straight if that
> | unboxing where outside of the loop - as it is, we break the
> | pipeline once per iteration it seems
> 
> This one is a bit harder.  Basically we want to make a wrapper
> for a recursive function if it's sure to evaluate its free variables.
> 
> In fact the 'liberate-case' pass (which isn't switched on in 4.08)
> is meant to do just this. It's in simplCore/LiberateCase.lhs,
> and it's not very complicated.  I've just tried it and it doesn't seem
> to have the desired effect, but I'm sure that's for a boring reason.
> If anyone would like to fix it, go ahead!

Ok - something for a rainy weekend, I guess...

> Incidentally, you'll find that -ddump-simpl gives you a dump that
> is pretty close to STG and usually much more readable.  Most
> performance bugs show up there.  -dverbose-simpl gives you more
> clues about what is happening where.

Good to know.  I just wasn't sure whether I wouldn't miss
out on some optimisations.

> | Also if somebody is looking at the attached source, I was
> | wondering why, when I use the commented out code in
> | `newPArray', I get a lot worse code (the STG code is in a
> | comment at the end of the file).  In particular, the lambda
> | abstraction is not inlined, whereas `fill' gets inlined into
> | the code of which the dump is above.  Is it because the
> | compiler has a lot harder time with explicit recursion than
> | with fold/build?  If so, the right RULES magic should allow
> | me to do the same for my own recursively defined
> | combinators, shouldn't it?
> 
> I couldn't figure out exactly what you meant.  The only commented
> out code is STG code.  Maybe send a module with the actual
> source you are bothered about.

Appended the version of the module where the code is
activated (it was commented out with --) - you'll find the
problematic code by searching for (**) in the code.  It
generates

$w$snewPArray
  = \ ww :: Int# w :: Int ->
	case newIntArray# @ RealWorld ww realWorld#
	of wild { (# s2#, mba# #) ->
	case $wsimpleGen
		 ww
		 (\ i :: Int ->
		      case i of wild1 { I# i# ->
		      case w of wild2 { I# e# ->
		      __coerce (ST RealWorld ())
		      (\ s# :: (State# RealWorld) ->
			   case writeIntArray# @ RealWorld mba# i# e# s#
			   of s2#1 { __DEFAULT ->
			   (# s2#1, () #)
			   })
		      }
		      })
		 s2#
	of wild1 { (# new_s, r #) ->
	let {
	  a :: Int
	  a = $wI# ww
	} in  (# a, (__coerce ByteArray# mba#) #)
	}
	}

One thing that is not nice is in the lambda abstraction
based to $wsimpleGen.  There is one lambda, then two
unboxing operations and another lambda for the state
variables s#.  These two nested lambda's become in STG

		stg_c1Op =
		    NO_CCS[] \r[i]
			case i of wild1 {
			  I# i# ->
			      case w of wild2 {
				I# e# ->
				    let {
				      stg_c1Sv =
					  NO_CCS[] \r[s#]
					      case writeIntArray# [mba# i# e# s#] of s2#1 {
						DEFAULT -> (#,#) [s2#1 ()]
					      };
				    } in  stg_c1Sv;
			      };
			};

So, stg_c1Sv is allocated and immediately entered.  I would
have hoped that the two lambda abstractions are merged into
something like this:

		 (\ i :: Int s# :: (State# RealWorld) ->
		      case i of wild1 { I# i# ->
		      case w of wild2 { I# e# ->
		      case writeIntArray# @ RealWorld mba# i# e# s#
			   of s2#1 { __DEFAULT -> (# s2#1, () #)
		      })
		      }
		      })

Maybe it is the __coerce, which prevents this from
happening?

Also, I am wondering why in the case of the 

  foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1])

the function `fill' gets inlined into the foldr loop, but in
the above, the lambda abstraction gets not inlined into
$wsimpleGen.  Maybe it is because $wsimpleGen itself gets
not inlined...

Cheers,
Manuel

----Next_Part(Fri_Nov__3_00:54:22_2000_559)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="PArrays.hs"

--  Parallel Arrays: Sequential prototype
--
--  Authors: Manuel M. T. Chakravarty
--	     Gabriele Keller
--  Created: 26 October 2000
--
--  Version $Revision$ from $Date$
--
--  Copyright (c) 2000 Chakravarty & Keller
--
--  This file is free software; you can redistribute it and/or modify
--  it under the terms of the GNU General Public License as published by
--  the Free Software Foundation; either version 2 of the License, or
--  (at your option) any later version.
--
--  This file is distributed in the hope that it will be useful,
--  but WITHOUT ANY WARRANTY; without even the implied warranty of
--  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
--  GNU General Public License for more details.
--
--- DESCRIPTION ---------------------------------------------------------------
--
--  This module provides unboxed arrays of primitive types as a sequential
--  prototype for the parallel arrays of Gabi's SCL.  They come with all the
--  necessary operations to support flattened Haskell generated from lambdaPA.
--
--- DOCU ----------------------------------------------------------------------
--
--  language: Haskell 98 + GHC extensions (unboxed types and byte arrays)
--
--  We export the immutable `PArray' only; however, we internally also use a
--  mutable variant `MPArray', which allows us an efficient incremental
--  definition of the collective structure.
--
--  The class `PAE' is quite peculiar.  It essentially serves to overload
--  array access (which depends on the size of the unboxed value stored in the
--  parallel array); however, the actual operations in the class are only to
--  be used internally in this module - and therefore, the class `PAE' is
--  exported *abstractly*.  In fact, most operations work on mutable parallel
--  arrays for efficiency.
--
--  We shamelessly steal coding tricks from GHC's `ArrayBase', `IArray' &
--  friends, but without most of that overloading mess.
--
--- TODO ----------------------------------------------------------------------
--

module PArrays (
  PAE, PArray, newPArray, (!|), loop, gen
) where

import PrelGHC
import PrelBase
import PrelST

import ST


infixl 9 !|


-- data structures
-- ---------------

-- * although all `PArray's are represented by the same structure, the
--   functions operating on them differ with the element type; hence, we have
--   to overload these functions

-- integer indexed array (EXPORTED ABSTRACTLY)
--
data PArray e = PArray Int ByteArray#

-- mutable integer indexed array
--
data MPArray s e = MPArray Int (MutableByteArray# s)

-- the class of "Parallel Array Element"s (EXPORTED ABSTRACTLY)
--
class PAE a where
  newMPArray   :: Int                     -> ST s (MPArray s a)
  readMPArray  :: MPArray s a -> Int      -> ST s a
  writeMPArray :: MPArray s a -> Int -> a -> ST s ()
  indexPArray  ::  PArray a   -> Int      -> a


-- exported functions
-- ------------------

-- create new parallel array, where all elements are initialised to the given
-- values (EXPORTED)
--
newPArray     :: PAE e => Int -> e -> PArray e
newPArray n e  = runST (do
  mpa <- newMPArray n
  simpleGen n $				-- (**) These three lines instead
    \i -> writeMPArray mpa i e		--      of the following commented
  return $ unsafeFreezeMPArray mpa)	--      out line gives worse code.
--  foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1])  -- a la ArrayBase

{-# SPECIALIZE newPArray :: Int -> Int -> PArray Int #-}

fill mpa e i next = writeMPArray mpa i e >> next

-- indexing of a parallel array (EXPORTED)
--
(!|) :: PAE e => PArray e -> Int -> e
(!|)  = indexPArray

-- Gabi's loop (EXPORTED)
--
loop :: () -- ??
loop  = error "loop?"

-- Gabi's gen (EXPORTED)
--
gen :: () -- ??
gen  = error "gen??"


-- instances of `PAE'
-- ------------------

instance PAE Int where
  newMPArray   = newMPArrayInt
  readMPArray  = readMPArrayInt
  writeMPArray = writeMPArrayInt
  indexPArray  = indexPArrayInt

newMPArrayInt           :: Int -> ST s (MPArray s Int)
newMPArrayInt n@(I# n#)  = ST $ \s# ->
  case (newIntArray# n# s#)	     of {(# s2#, mba# #) ->
  (# s2#, MPArray n mba# #)}

readMPArrayInt                         :: MPArray s Int -> Int -> ST s Int
{-# INLINE readMPArrayInt #-}
readMPArrayInt (MPArray _ mba#) (I# i#)  = ST $ \s# ->
  case readIntArray# mba# i# s#      of {(# s2#, r# #) ->
  (# s2#, I# r# #)}

writeMPArrayInt :: MPArray s Int -> Int -> Int -> ST s ()
{-# INLINE writeMPArrayInt #-}
writeMPArrayInt (MPArray _ mba#) (I# i#) (I# e#) = ST $ \s# ->
  case writeIntArray# mba# i# e# s#  of {s2#   ->
  (# s2#, () #)}

indexPArrayInt                        :: PArray Int -> Int -> Int
{-# INLINE indexPArrayInt #-}
indexPArrayInt (PArray _ ba#) (I# i#)  =
  case indexIntArray# ba# i# 	     of {r# ->
  (I# r#)}


-- auxilliary functions
-- --------------------

-- unsafely convert a mutable into an immutable array
--
unsafeFreezeMPArray                  :: MPArray s e -> PArray e
unsafeFreezeMPArray (MPArray n mba#)  = PArray n (unsafeCoerce# mba#)

-- simple generator abstraction
--
simpleGen     :: Monad m => Int -> (Int -> m ()) -> m ()
{-# INLINE simpleGen #-}
simpleGen 0 p  = return ()
simpleGen n p  = p (n - 1) >> simpleGen (n - 1) p


{-

-- That's what we get for a newPArray specialised for Int and using the
-- explicitly recursive simpleGen.

$wsimpleGen =
    NO_CCS srt: (0,2)[] \r[ww w]
	case ww of ds {
	  0 -> $wlvl1;
	  DEFAULT ->
	      let {
		stg_c1Qg =
		    NO_CCS srt: (0,1)[] \r[s1]
			case -# [ds 1] of a {
			  DEFAULT ->
			      let { stg_c1Lg = NO_CCS I#! [a];
			      } in 
				case w stg_c1Lg s1 of wild {
				  (#,#) new_s r -> $wsimpleGen a w new_s;
				}
			};
	      } in  stg_c1Qg
	};
SRT: [$wsimpleGen, $wlvl1]

$w$snewPArray =
    NO_CCS srt: (0,1)[] \r[ww w]
	case newIntArray# [ww realWorld#] of wild {
	  (#,#) s2# mba# ->
	      let {
		stg_c1M1 =
		    NO_CCS[] \r[i]
			case i of wild1 {
			  I# i# ->
			      case w of wild2 {
				I# e# ->
				    let {
				      stg_c1Q7 =
					  NO_CCS[] \r[s#]
					      case writeIntArray# [mba# i# e# s#] of s2#1 {
						DEFAULT -> (#,#) [s2#1 ()]
					      };
				    } in  stg_c1Q7;
			      };
			};
	      } in 
		case $wsimpleGen ww stg_c1M1 s2# of wild1 {
		  (#,#) new_s r -> let { a = NO_CCS I#! [ww]; } in  (#,#) [a mba#];
		};
	};
SRT: [$wsimpleGen]

-- the `case w of' could be pulled out of the loop
-- stg_c1Q7 builds a superfluous closure (which is immediately entered)
-- could we somehow get a specialised version of simpleGen?

-}
----Next_Part(Fri_Nov__3_00:54:22_2000_559)----