Proposal: Non-allocating way to iterate over a Data.Map: traverseWithKey_

Sat Aug 3 23:31:45 CEST 2013

For example the STG for the following leaves the allocation of the tuple and two Ints:

foo :: Int -> IO (Int,Int)
foo x | x < 10 = return (x, 2*x)
foo x = foo (x-1)

Fixing this involves *nested* CPR analysis, which I am working on at the moment.

The Data.Map.Base.foldRWithKey function discussed in this thread is another example.  That's a place where even after it inlines the provided function into the fold, we end up with the below STG with an allocation of an IO () function inside the inner loop:

This one I do not understand. Could you pull out the two alternative ways of phrasing this algorithm into a standalone form, in which one allocates more than t'other?  Then I could investigate more easily.

thanks!

Simon

From: Libraries [mailto:libraries-bounces at haskell.org] On Behalf Of Ryan Newton
Sent: 03 August 2013 08:45
To: Roman Cheplyaka
Cc: Haskell Libraries
Subject: Re: Proposal: Non-allocating way to iterate over a Data.Map: traverseWithKey_

> Re: expectations.  You don't get a funny feeling when monadic values are
> used as first class rather than second class ;-)?  Whether in the
> accumulator of a fold, or in cases like this:
>
>   do act <- f x
>        act
I don't. I don't even believe that the compiler can spot the difference
between the two. (But maybe it's just my ignorance.)

Do you have a specific example where this is a problem?

Well, I just generally assumed that anywhere where inlining is foiled would, any kind of allocation-eliminating optimization is unlikely to happen.

As one example, in the monad-par scheduler(s) <https://github.com/simonmar/monad-par/blob/0e0401ea2cfbab787c67c2ed826c123c5d83112f/monad-par/Control/Monad/Par/Scheds/TraceInternal.hs#L144> we do inscrutable computations to retrieve a monadic-value, and then run it.  This works fine, but it does mean that we pay the full cost of an abstract monadic action at that point (indirect jump, etc), rather than the highly discounted rate we are used to paying when we chain together monadic actions in a syntactically adjacent way within a function.

Personally, I don't understand all the GHC optimization steps well.  But I have verified certain things that it doesn't do.  For example, if I have an allocating action in the tail (only exiting case) of a recursive monadic function, it cannot do "loop peeling" to eliminate that allocation against whatever context is calling the recursive function.  For example the STG for the following leaves the allocation of the tuple and two Ints:

foo :: Int -> IO (Int,Int)
foo x | x < 10 = return (x, 2*x)
foo x = foo (x-1)

main = do
  (x,y) <- foo 1000
  print x
  print y

Which is normally fine if the trip count is high.  But consider something like compare-and-swap where there's a loop, but the expected trip count is very low.  Perhaps this is just one example of how GHC tends not to explicitly represent and optimize loops?

The Data.Map.Base.foldRWithKey function discussed in this thread is another example.  That's a place where even after it inlines the provided function into the fold, we end up with the below STG with an allocation of an IO () function inside the inner loop:

go10_r3TD :: IO () -> Map Int Int -> IO ()
[GblId, Arity=2, Str=DmdType LS, Unf=OtherCon []] =
    sat-only \r srt:(0,*bitmap*) [z'_s3Yk ds_s3Y7]
        case ds_s3Y7 of _ {
          Bin rb_s43z kx_s3Ye x_s43A l_s3Ys r_s3Yl ->
              case kx_s3Ye of _ {
                I# x1_s3Yi ->
                    let {
                      sat_s43x :: IO ()
                      [LclId] =
                          \r srt:(0,*bitmap*) [eta_s3Ym]
                              case x1_s3Yi of _ {
                                __DEFAULT -> go10_r3TD z'_s3Yk r_s3Yl eta_s3Ym;
                                500000 ->
                                    case hPutStr2 stdout lvl_r3Ty True eta_s3Ym of _ {
                                      (#,#) ipv_s3Yq _ -> go10_r3TD z'_s3Yk r_s3Yl ipv_s3Yq;
                                    };
                              };
                    } in  go10_r3TD sat_s43x l_s3Ys;
              };
          Tip -> z'_s3Yk;
        };
SRT(go10_r3TD): [hPutStr2, stdout, lvl_r3Ty, go10_r3TD]

The specific problem in this example seems to be that -- based on a literal reading of the above -- it's creating a closure that closes over the Int#, x1_s3Yi.  Shachaf's version that uses a newtype seems to avoid this trouble by not allowing IO's (>>=) into it at all.  The traverseWithKey_ version is attached below [1], and it manages to get rid of the IO newtype in the loop and resolves to a State#.

Perhaps there is a missing optimization tweak that would help GHC get rid of the IO type in the above STG?

Cheers,
  -Ryan

[1] P.S.  Here's the non-allocating loop produced by traverseWithKey_:

a_r3X7
  :: Map Int Int -> State# RealWorld -> (# State# RealWorld, () #)
[GblId, Arity=2, Str=DmdType SL, Unf=OtherCon []] =
    sat-only \r srt:(0,*bitmap*) [ds_s41Z eta_s42c]
        case ds_s41Z of _ {
          Bin rb_s47H k_s426 v_s47I l_s42b r_s42g ->
              case k_s426 of _ {
                I# x_s429 ->
                    case x_s429 of _ {
                      __DEFAULT ->
                          case a_r3X7 l_s42b eta_s42c of _ {
                            (#,#) ipv_s42h _ -> a_r3X7 r_s42g ipv_s42h;
                          };
                      500000 ->
                          case hPutStr2 stdout lvl_r3X1 True eta_s42c of _ {
                            (#,#) ipv_s42l _ ->
                                case a_r3X7 l_s42b ipv_s42l of _ {
                                  (#,#) ipv2_s42p _ -> a_r3X7 r_s42g ipv2_s42p;
                                };
                          };
                    };
              };
          Tip -> (#,#) [eta_s42c ()];
        };
SRT(a_r3X7): [hPutStr2, stdout, lvl_r3X1, a_r3X7]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/libraries/attachments/20130803/f41c7f5a/attachment.htm>