Inliner behaviour - tiny changes lead to huge performance differences

Roman Leshchinskiy rl at cse.unsw.edu.au
Fri Nov 13 03:19:12 EST 2009


On 13/11/2009, at 18:04, Bryan O'Sullivan wrote:

> main = do
>   args <- getArgs
>   forM_ args $ \a -> do
>     s <- B.readFile a
>     let t = T.decodeUtf8 s
>     print (T.length t)
>
> The streamUtf8 function looks roughly like this:
>
> streamUtf8 :: OnDecodeError -> ByteString -> Stream Char
> streamUtf8 onErr bs = Stream next 0 (maxSize l)
>     where
>       l = B.length bs
>       next i
>           | i >= l =          Done
>           | U8.validate1 x1 = Yield (unsafeChr8 x1) (i+1)
>           | {- etc. -}
> {-# INLINE [0] streamUtf8 #-}
>
> The values being Yielded from the inner function are, as you can  
> see, themselves constructed by functions.
>
> Originally, with the inner next function manually marked as INLINE,  
> I found that functions like unsafeChr8 were not being inlined by  
> GHC, and performance was terrible due to the amount of boxing and  
> unboxing happening in the inner loop.

Let's see if I understand this correctly. In your code, decodeUtf8  
calls streamUtf8. They both get inlined into main but then unsafeChr8  
does not. Correct?

If so, are you sure that unsafeChr8 is really called in the simplified  
code? IIUC, this isn't necessary if you don't actually inspect the  
Chars (which length presumably doesn't). So perhaps GHC removes the  
call altogether? If not, what does it do with the result?

> I somehow stumbled on the idea of removing the INLINE annotation  
> from next, and performance suddenly improved by a significant  
> integer multiple. This caused the body of streamUtf8 to be inlined  
> into my test program, as I hoped.

Or are you saying that it's streamUtf8 that isn't getting inlined into  
main?

> length :: Text -> Int
> length t = Stream.length (Stream.stream t)
> {-# INLINE length #-}
>
> And the streaming length is:
>
> length :: Stream Char -> Int
> length = S.lengthI
> {-# INLINE[1] length #-}
>
> And the lengthI function is defined more generally, in the hope that  
> I could use it for both Int and Int64 lengths:
>
> lengthI :: Integral a => Stream Char -> a
> lengthI (Stream next s0 _len) = loop_length 0 s0
>     where
>       loop_length !z s  = case next s of
>                            Done       -> z
>                            Skip    s' -> loop_length z s'
>                            Yield _ s' -> loop_length (z + 1) s'
> {-# INLINE[0] lengthI #-}
>
> Unfortunately, although lengthI is inlined into the Int-typed  
> streaming length function, that function is not in turn marked with  
> __inline_me in simplifier output, so the length/decodeUtf8 loops do  
> not fuse. The code is pretty fast, but there's still a lot of boxing  
> and unboxing happening for all the Yields.

Does changing the definition of length to

length = id S.lengthI

help? GHC used to have a bug in this area but I haven't been bitten by  
it for quite some time.

Also, I wonder how Stream.stream is defined. Is it strict in Text? If  
it isn't, does making it strict help?

> All of these flip-flops in inliner behaviour are very difficult to  
> understand, and they seem to be exceedingly fragile. Should I expect  
> the situation to be better with the new inliner in 6.12?

I suspect that the fragility you are seeing is just a symptom of a  
problem in how the UTF-8 library implements stream fusion. It's a bit  
tricky to get everything right. Generally, I've found the simplifier  
to be quite stable and predictable in the last year or so. Simon is  
working hard on making it even better. If you have a spare minute,  
perhaps you could try the HEAD with the new inliner and see if that  
helps? Although I somewhat doubt it, to be honest.

Roman




More information about the Glasgow-haskell-users mailing list