Fun with GHC's optimiser
Manuel M. T. Chakravarty
chak@cse.unsw.edu.au
Fri, 03 Nov 2000 00:54:25 +1100
----Next_Part(Fri_Nov__3_00:54:22_2000_559)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Simon Peyton-Jones <simonpj@microsoft.com> wrote,
> I can never resist messages like these, even when I'm meant
> to be doing other things.
That's good to know ;-)
> It's very helpful when people offer
> fairly precise performance-bug reports. Thanks!
Thanks for the prompt reply! To SimonM, too.
> | I am wondering whether there is a particular reason why the
> | optimiser doesn't pull the
> |
> | (1) a = NO_CCS PArray! [wild1 mba#];
>
> This one is a definite bug. It turns out that the head of the
> before-ghci-branch doesn't have this bug, so I'm disinclined
> to investigate it further.
No problem, I can always build a new ghc from CVS.
> | (2) case w of wild3 {
> | I# e# ->
> |
> | As for (2), the loop would be nice and straight if that
> | unboxing where outside of the loop - as it is, we break the
> | pipeline once per iteration it seems
>
> This one is a bit harder. Basically we want to make a wrapper
> for a recursive function if it's sure to evaluate its free variables.
>
> In fact the 'liberate-case' pass (which isn't switched on in 4.08)
> is meant to do just this. It's in simplCore/LiberateCase.lhs,
> and it's not very complicated. I've just tried it and it doesn't seem
> to have the desired effect, but I'm sure that's for a boring reason.
> If anyone would like to fix it, go ahead!
Ok - something for a rainy weekend, I guess...
> Incidentally, you'll find that -ddump-simpl gives you a dump that
> is pretty close to STG and usually much more readable. Most
> performance bugs show up there. -dverbose-simpl gives you more
> clues about what is happening where.
Good to know. I just wasn't sure whether I wouldn't miss
out on some optimisations.
> | Also if somebody is looking at the attached source, I was
> | wondering why, when I use the commented out code in
> | `newPArray', I get a lot worse code (the STG code is in a
> | comment at the end of the file). In particular, the lambda
> | abstraction is not inlined, whereas `fill' gets inlined into
> | the code of which the dump is above. Is it because the
> | compiler has a lot harder time with explicit recursion than
> | with fold/build? If so, the right RULES magic should allow
> | me to do the same for my own recursively defined
> | combinators, shouldn't it?
>
> I couldn't figure out exactly what you meant. The only commented
> out code is STG code. Maybe send a module with the actual
> source you are bothered about.
Appended the version of the module where the code is
activated (it was commented out with --) - you'll find the
problematic code by searching for (**) in the code. It
generates
$w$snewPArray
= \ ww :: Int# w :: Int ->
case newIntArray# @ RealWorld ww realWorld#
of wild { (# s2#, mba# #) ->
case $wsimpleGen
ww
(\ i :: Int ->
case i of wild1 { I# i# ->
case w of wild2 { I# e# ->
__coerce (ST RealWorld ())
(\ s# :: (State# RealWorld) ->
case writeIntArray# @ RealWorld mba# i# e# s#
of s2#1 { __DEFAULT ->
(# s2#1, () #)
})
}
})
s2#
of wild1 { (# new_s, r #) ->
let {
a :: Int
a = $wI# ww
} in (# a, (__coerce ByteArray# mba#) #)
}
}
One thing that is not nice is in the lambda abstraction
based to $wsimpleGen. There is one lambda, then two
unboxing operations and another lambda for the state
variables s#. These two nested lambda's become in STG
stg_c1Op =
NO_CCS[] \r[i]
case i of wild1 {
I# i# ->
case w of wild2 {
I# e# ->
let {
stg_c1Sv =
NO_CCS[] \r[s#]
case writeIntArray# [mba# i# e# s#] of s2#1 {
DEFAULT -> (#,#) [s2#1 ()]
};
} in stg_c1Sv;
};
};
So, stg_c1Sv is allocated and immediately entered. I would
have hoped that the two lambda abstractions are merged into
something like this:
(\ i :: Int s# :: (State# RealWorld) ->
case i of wild1 { I# i# ->
case w of wild2 { I# e# ->
case writeIntArray# @ RealWorld mba# i# e# s#
of s2#1 { __DEFAULT -> (# s2#1, () #)
})
}
})
Maybe it is the __coerce, which prevents this from
happening?
Also, I am wondering why in the case of the
foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1])
the function `fill' gets inlined into the foldr loop, but in
the above, the lambda abstraction gets not inlined into
$wsimpleGen. Maybe it is because $wsimpleGen itself gets
not inlined...
Cheers,
Manuel
----Next_Part(Fri_Nov__3_00:54:22_2000_559)--
Content-Type: Text/Plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Disposition: attachment; filename="PArrays.hs"
-- Parallel Arrays: Sequential prototype
--
-- Authors: Manuel M. T. Chakravarty
-- Gabriele Keller
-- Created: 26 October 2000
--
-- Version $Revision$ from $Date$
--
-- Copyright (c) 2000 Chakravarty & Keller
--
-- This file is free software; you can redistribute it and/or modify
-- it under the terms of the GNU General Public License as published by
-- the Free Software Foundation; either version 2 of the License, or
-- (at your option) any later version.
--
-- This file is distributed in the hope that it will be useful,
-- but WITHOUT ANY WARRANTY; without even the implied warranty of
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
-- GNU General Public License for more details.
--
--- DESCRIPTION ---------------------------------------------------------------
--
-- This module provides unboxed arrays of primitive types as a sequential
-- prototype for the parallel arrays of Gabi's SCL. They come with all the
-- necessary operations to support flattened Haskell generated from lambdaPA.
--
--- DOCU ----------------------------------------------------------------------
--
-- language: Haskell 98 + GHC extensions (unboxed types and byte arrays)
--
-- We export the immutable `PArray' only; however, we internally also use a
-- mutable variant `MPArray', which allows us an efficient incremental
-- definition of the collective structure.
--
-- The class `PAE' is quite peculiar. It essentially serves to overload
-- array access (which depends on the size of the unboxed value stored in the
-- parallel array); however, the actual operations in the class are only to
-- be used internally in this module - and therefore, the class `PAE' is
-- exported *abstractly*. In fact, most operations work on mutable parallel
-- arrays for efficiency.
--
-- We shamelessly steal coding tricks from GHC's `ArrayBase', `IArray' &
-- friends, but without most of that overloading mess.
--
--- TODO ----------------------------------------------------------------------
--
module PArrays (
PAE, PArray, newPArray, (!|), loop, gen
) where
import PrelGHC
import PrelBase
import PrelST
import ST
infixl 9 !|
-- data structures
-- ---------------
-- * although all `PArray's are represented by the same structure, the
-- functions operating on them differ with the element type; hence, we have
-- to overload these functions
-- integer indexed array (EXPORTED ABSTRACTLY)
--
data PArray e = PArray Int ByteArray#
-- mutable integer indexed array
--
data MPArray s e = MPArray Int (MutableByteArray# s)
-- the class of "Parallel Array Element"s (EXPORTED ABSTRACTLY)
--
class PAE a where
newMPArray :: Int -> ST s (MPArray s a)
readMPArray :: MPArray s a -> Int -> ST s a
writeMPArray :: MPArray s a -> Int -> a -> ST s ()
indexPArray :: PArray a -> Int -> a
-- exported functions
-- ------------------
-- create new parallel array, where all elements are initialised to the given
-- values (EXPORTED)
--
newPArray :: PAE e => Int -> e -> PArray e
newPArray n e = runST (do
mpa <- newMPArray n
simpleGen n $ -- (**) These three lines instead
\i -> writeMPArray mpa i e -- of the following commented
return $ unsafeFreezeMPArray mpa) -- out line gives worse code.
-- foldr (fill mpa e) (return $ unsafeFreezeMPArray mpa) [0..n-1]) -- a la ArrayBase
{-# SPECIALIZE newPArray :: Int -> Int -> PArray Int #-}
fill mpa e i next = writeMPArray mpa i e >> next
-- indexing of a parallel array (EXPORTED)
--
(!|) :: PAE e => PArray e -> Int -> e
(!|) = indexPArray
-- Gabi's loop (EXPORTED)
--
loop :: () -- ??
loop = error "loop?"
-- Gabi's gen (EXPORTED)
--
gen :: () -- ??
gen = error "gen??"
-- instances of `PAE'
-- ------------------
instance PAE Int where
newMPArray = newMPArrayInt
readMPArray = readMPArrayInt
writeMPArray = writeMPArrayInt
indexPArray = indexPArrayInt
newMPArrayInt :: Int -> ST s (MPArray s Int)
newMPArrayInt n@(I# n#) = ST $ \s# ->
case (newIntArray# n# s#) of {(# s2#, mba# #) ->
(# s2#, MPArray n mba# #)}
readMPArrayInt :: MPArray s Int -> Int -> ST s Int
{-# INLINE readMPArrayInt #-}
readMPArrayInt (MPArray _ mba#) (I# i#) = ST $ \s# ->
case readIntArray# mba# i# s# of {(# s2#, r# #) ->
(# s2#, I# r# #)}
writeMPArrayInt :: MPArray s Int -> Int -> Int -> ST s ()
{-# INLINE writeMPArrayInt #-}
writeMPArrayInt (MPArray _ mba#) (I# i#) (I# e#) = ST $ \s# ->
case writeIntArray# mba# i# e# s# of {s2# ->
(# s2#, () #)}
indexPArrayInt :: PArray Int -> Int -> Int
{-# INLINE indexPArrayInt #-}
indexPArrayInt (PArray _ ba#) (I# i#) =
case indexIntArray# ba# i# of {r# ->
(I# r#)}
-- auxilliary functions
-- --------------------
-- unsafely convert a mutable into an immutable array
--
unsafeFreezeMPArray :: MPArray s e -> PArray e
unsafeFreezeMPArray (MPArray n mba#) = PArray n (unsafeCoerce# mba#)
-- simple generator abstraction
--
simpleGen :: Monad m => Int -> (Int -> m ()) -> m ()
{-# INLINE simpleGen #-}
simpleGen 0 p = return ()
simpleGen n p = p (n - 1) >> simpleGen (n - 1) p
{-
-- That's what we get for a newPArray specialised for Int and using the
-- explicitly recursive simpleGen.
$wsimpleGen =
NO_CCS srt: (0,2)[] \r[ww w]
case ww of ds {
0 -> $wlvl1;
DEFAULT ->
let {
stg_c1Qg =
NO_CCS srt: (0,1)[] \r[s1]
case -# [ds 1] of a {
DEFAULT ->
let { stg_c1Lg = NO_CCS I#! [a];
} in
case w stg_c1Lg s1 of wild {
(#,#) new_s r -> $wsimpleGen a w new_s;
}
};
} in stg_c1Qg
};
SRT: [$wsimpleGen, $wlvl1]
$w$snewPArray =
NO_CCS srt: (0,1)[] \r[ww w]
case newIntArray# [ww realWorld#] of wild {
(#,#) s2# mba# ->
let {
stg_c1M1 =
NO_CCS[] \r[i]
case i of wild1 {
I# i# ->
case w of wild2 {
I# e# ->
let {
stg_c1Q7 =
NO_CCS[] \r[s#]
case writeIntArray# [mba# i# e# s#] of s2#1 {
DEFAULT -> (#,#) [s2#1 ()]
};
} in stg_c1Q7;
};
};
} in
case $wsimpleGen ww stg_c1M1 s2# of wild1 {
(#,#) new_s r -> let { a = NO_CCS I#! [ww]; } in (#,#) [a mba#];
};
};
SRT: [$wsimpleGen]
-- the `case w of' could be pulled out of the loop
-- stg_c1Q7 builds a superfluous closure (which is immediately entered)
-- could we somehow get a specialised version of simpleGen?
-}
----Next_Part(Fri_Nov__3_00:54:22_2000_559)----