debugging why we end up calling the wrapper rather than the worker

Fri Jun 1 16:20:01 EDT 2007

Hia all,

I'm trying to figure out why this piece of code does not optimise the
way I expect. It's binary serialisation again. Yes, again.

The crux is the write function

write :: Int -> (Ptr Word8 -> IO ()) -> Put ()
write !n body = Put $ \c buf@(Buffer fp o u l) ->
  if n <= l
    then write' c fp o u l
    else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0

  where {-# NOINLINE write' #-}
        write' c !fp !o !u !l =
          -- warning: this is a tad hardcore
          B.inlinePerformIO
            (withForeignPtr fp
              (\p -> body $! (p `plusPtr` (o+u))))
          `seq` c () (Buffer fp o (u+n) (l-n))

In the if condition, the first path is the fast path. We do want this
write' join point, we want to make a call to write' at runtime rather
than inlining it into both the fast and slow paths. This is why we tag
it with NOINLINE. Simon kindly recently fixed a bug where this NOINLINE
was not remembered in the .hi files, so it now really does not inline
write' into both branches.

However, looking at the core/stg we can see that we're always calling
the write' wrapper function that takes boxed arguments and calls the
wrapper. It seems to me that the calls in both branches ought to be
calls directly to the worker rather than the wrapper. We do have all the
unboxed arguments available but they get boxed up to make the call to
the write' wrapper. This isn't going to do good things for performance.

Here's the STG code:

So the important question is whether we have unboxed versions of all the
args available. First of all lets look at the wrapper and see what args
it unboxes when it calls the worker ($wwrite'_s17v in this case).

write'_s17W = sat-only \r [w1_s17P w2_s17C w3_s17G w4_s17J w5_s17M]
  case w2_s17C of w6_s18S {
    GHC.ForeignPtr.ForeignPtr ww6_s17Q ww7_s17R ->
      case w3_s17G of w7_s18T {
        GHC.Base.I# ww8_s17S ->
          case w4_s17J of w8_s18U {
            GHC.Base.I# ww9_s17T ->
              case w5_s17M of w9_s18V {
                GHC.Base.I# ww10_s17U ->
                  $wwrite'_s17v
                  w1_s17P
                  ww6_s17Q
                  ww7_s17R
                  ww8_s17S
                  ww9_s17T
                  ww10_s17U;
        };
      };
    };
  };

So out of the args [w1_s17P w2_s17C w3_s17G w4_s17J w5_s17M] all but the
first (which is a continuation function) get unboxed and the bits passed
on to $wwrite'_s17v. So we should expect that if we have all those
components available that we could make a direct call to the worker
$wwrite'_s17v rather than going via the wrapper write'_s17W.

So lets look at the STG code that we get from this bit of source:

  if n <= l
    then write' c fp o u l
    else write' (flushOld c n fp o u) (newBuffer c n) 0 0 0

(The STG code here is from a use when n was 3):

We want to see if all the components that the worker $wwrite'_s17v needs
are available:

case <=# [3 ww5_s17X] of wild1_s18W {
  GHC.Base.False ->
  case Put.newBuffer w_s17Z n1_r16x of sat_s18d {
    __DEFAULT ->
      let { sat_s189 = NO_CCS GHC.Base.I#! [ww4_s187]; } in
      let { sat_s186 = NO_CCS GHC.Base.I#! [ww3_s184]; } in
      let { sat_s183 = NO_CCS GHC.ForeignPtr.ForeignPtr! [ww1_s180 ww2_s181]; } in
      let { sat_s18b = \u [] Put.flushOld w_s17Z n1_r16x sat_s183 sat_s186 sat_s189;
      } in  write'_s17W sat_s18b sat_s18d lvl_r16z lvl_r16z lvl_r16z;
      };
  GHC.Base.True ->
    let { sat_s18l = NO_CCS GHC.Base.I#! [ww5_s17X]; } in
    let { sat_s18j = NO_CCS GHC.Base.I#! [ww4_s187]; } in
    let { sat_s18h = NO_CCS GHC.Base.I#! [ww3_s184]; } in
    let { sat_s18f = NO_CCS GHC.ForeignPtr.ForeignPtr! [ww1_s180 ww2_s181];
    } in  write'_s17W w_s17Z sat_s18f sat_s18h sat_s18j sat_s18l;
};

In the False case it looks like there is an excuse for calling the
wrapper, because we're calling the wrapper for both Put.flushOld and
Put.newBuffer. For Put.flushOld it's just to construct the continuation
which is a boxed arg anyway, so that's not stopping us calling the
worker. The Put.newBuffer returns a ForeignPtr and the write' worker
takes the components of the ForeignPtr unboxed. However we are doing a
case analysis on the result of Put.newBuffer so we could easily extract
the components and pass them on to the write' worker.

In the True case there is no excuse at all as far as I can figure out.
We are explicitly boxing up exactly the arguments that the wrapper
unboxes. So we really should be able to call the worker directly with
the appropriate unboxed values as args.

So what is going on here? Does NOINLINE prevent calling the worker or
something?

>From a quick experiment it would appear so:

{-# OPTIONS_GHC -fbang-patterns #-}
module Foo (foo, bar) where

foo :: Int -> Int
foo n = bar n

bar :: Int -> Int
bar !n = bar (n+1)

As is, bar will not be inlined because it's recursive and looking at the
STG code we see that foo makes a call to bar's worker:

Foo.$wbar = \r [ww_sc8] case +# [ww_sc8 1] of sat_sca { __DEFAULT -> Foo.$wbar sat_sca; };
Foo.bar = \r [w_scd] case w_scd of w1_scl { GHC.Base.I# ww_scg -> Foo.$wbar ww_scg; };
Foo.foo = \r [eta_sck] Foo.bar eta_sck;

however if we add in what you'd think is a redundant pragma
{-# NOINLINE bar #-}

and look at the stg code again:

Foo.bar = \r [w_sc9] case w_sc9 of w1_sco { GHC.Base.I# ww_scc -> Foo.$wbar ww_scc; };
Foo.$wbar =
    \r [ww_scf]
        case +# [ww_scf 1] of sat_sch {
          __DEFAULT ->
              let { sat_scj = NO_CCS GHC.Base.I#! [sat_sch]; } in  Foo.bar sat_scj;
        };
Foo.foo = \r [eta_scn] Foo.bar eta_scn;

then we see that foo is now calling bar's wrapper, and what's worse, bar
is calling it's wrapper in the recursive call! Oh noes!

So it seems to me that NOINLINE should prevent inlining but not prevent
calling the worker rather than the wrapper. I don't fully understand how
NOINLINE interacts with the worker/wrapper transform (or I wouldn't have
been surprised by this behaviour). I'm guessing that it works by doing
the worker/wrapper split and then trying to inline the wrapper into as
many call sites as possible. If this is indeed how it works then it'd
explain why attaching NOINLINE to the function causes the observed
behaviour since looking at the .hi file we see that the NOINLINE is
attached to the wrapper function and not the worker.

So perhaps the solution is to attach the NOINLINE to the worker rather
than the wrapper when doing the worker/wrapper split. Would that work or
cause other problems?

Seems otherwise I'm stuck. I thought I could use NOINLINE to control the
creation of join points like in my original example.

Duncan