Inliner behaviour - tiny changes lead to huge performance
differences
Roman Leshchinskiy
rl at cse.unsw.edu.au
Fri Nov 13 03:19:12 EST 2009
On 13/11/2009, at 18:04, Bryan O'Sullivan wrote:
> main = do
> args <- getArgs
> forM_ args $ \a -> do
> s <- B.readFile a
> let t = T.decodeUtf8 s
> print (T.length t)
>
> The streamUtf8 function looks roughly like this:
>
> streamUtf8 :: OnDecodeError -> ByteString -> Stream Char
> streamUtf8 onErr bs = Stream next 0 (maxSize l)
> where
> l = B.length bs
> next i
> | i >= l = Done
> | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1)
> | {- etc. -}
> {-# INLINE [0] streamUtf8 #-}
>
> The values being Yielded from the inner function are, as you can
> see, themselves constructed by functions.
>
> Originally, with the inner next function manually marked as INLINE,
> I found that functions like unsafeChr8 were not being inlined by
> GHC, and performance was terrible due to the amount of boxing and
> unboxing happening in the inner loop.
Let's see if I understand this correctly. In your code, decodeUtf8
calls streamUtf8. They both get inlined into main but then unsafeChr8
does not. Correct?
If so, are you sure that unsafeChr8 is really called in the simplified
code? IIUC, this isn't necessary if you don't actually inspect the
Chars (which length presumably doesn't). So perhaps GHC removes the
call altogether? If not, what does it do with the result?
> I somehow stumbled on the idea of removing the INLINE annotation
> from next, and performance suddenly improved by a significant
> integer multiple. This caused the body of streamUtf8 to be inlined
> into my test program, as I hoped.
Or are you saying that it's streamUtf8 that isn't getting inlined into
main?
> length :: Text -> Int
> length t = Stream.length (Stream.stream t)
> {-# INLINE length #-}
>
> And the streaming length is:
>
> length :: Stream Char -> Int
> length = S.lengthI
> {-# INLINE[1] length #-}
>
> And the lengthI function is defined more generally, in the hope that
> I could use it for both Int and Int64 lengths:
>
> lengthI :: Integral a => Stream Char -> a
> lengthI (Stream next s0 _len) = loop_length 0 s0
> where
> loop_length !z s = case next s of
> Done -> z
> Skip s' -> loop_length z s'
> Yield _ s' -> loop_length (z + 1) s'
> {-# INLINE[0] lengthI #-}
>
> Unfortunately, although lengthI is inlined into the Int-typed
> streaming length function, that function is not in turn marked with
> __inline_me in simplifier output, so the length/decodeUtf8 loops do
> not fuse. The code is pretty fast, but there's still a lot of boxing
> and unboxing happening for all the Yields.
Does changing the definition of length to
length = id S.lengthI
help? GHC used to have a bug in this area but I haven't been bitten by
it for quite some time.
Also, I wonder how Stream.stream is defined. Is it strict in Text? If
it isn't, does making it strict help?
> All of these flip-flops in inliner behaviour are very difficult to
> understand, and they seem to be exceedingly fragile. Should I expect
> the situation to be better with the new inliner in 6.12?
I suspect that the fragility you are seeing is just a symptom of a
problem in how the UTF-8 library implements stream fusion. It's a bit
tricky to get everything right. Generally, I've found the simplifier
to be quite stable and predictable in the last year or so. Simon is
working hard on making it even better. If you have a spare minute,
perhaps you could try the HEAD with the new inliner and see if that
helps? Although I somewhat doubt it, to be honest.
Roman
More information about the Glasgow-haskell-users
mailing list