[Haskell-beginners] The cost of generality, or how expensive is realToFrac?

Sun Sep 19 10:56:21 EDT 2010

On Sunday 19 September 2010 02:41:36, Greg wrote:
> > It would be interesting to see what core GHC produces for that (you
> > can get the core with the `-ddump-simpl' command line flag [redirect
> > stdout to a file] or with the ghc-core tool [available on hackage]).
> > If it runs as fast as realToFrac :: Double -> Float (with
> > optimisations), GHC must have rewritten realToFrac to double2Float#
> > and it should only do that if there are rewrite rules for GLclampf.
>
> I'm not sure if you literally meant you wanted to see the output or not,

Yes, but only if you were willing to take the trouble of producing it.
I actually was more interested in the core for the real app, but the core 
for the toy benchmark is already interesting (see below).

> but I've attached a zip of the dump files and my simple source file. 
> The dump file naming is cryptic, but the first letters refer to the
> definition of 'convert' where:
>
> fTF:   use the floatToFloat function in the source file
> rTF:   use the standard realToFrac
> fRtR: use (fromRational . toRational)
>
> The next three characters indicate the type signature of convert:
>
> d2f: Double -> Float
> d2g: Double -> GL.GLclampf
>
> I'd summarize the results, but apparently I took the blue pill and can't
> make heads or tails of what I'm seeing in the dump format...
>

Okay, for the results for the Double -> Float conversion,
fromRational . toRational took ~3.35 seconds
floatToFloat took 32 ms
realToFrac took 8 ms

(always compiled with -O2; the times are slightly higher than the criterion 
benchmarking results from Wednesday/Thursday, that's probably because those 
ran pre-warmed while today's run-once started up cold [and included a call 
to getCPUTime]).

Now to GLclampf. I remembered that I had installed OpenGL with one of my 
old GHCs (turned out to be 6.10.3), so I could also run the tests for 
Double -> GLclampf.
Unsurprisingly, fromRational . toRational and floatToFloat had the same 
performance as for Double -> Float. Equally unsurprisingly, were it not for 
your results and the core you sent, realToFrac had the same performance as 
fromRational . toRational.

In the core you sent for realToFrac :: Double -> GLclampf, we find the loop 
for summing a list of GLclampf:

Rec {
$wlgo_r1wv :: GHC.Prim.Float#
              -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]
              -> GHC.Prim.Float#
GblId
[Arity 2
 NoCafRefs
 Str: DmdType LS]
$wlgo_r1wv =
  \ (ww_s1vV :: GHC.Prim.Float#)
    (w_s1vX :: [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]) ->
    case w_s1vX of _ {
      [] -> ww_s1vV;
      : x_aVE xs_aVF ->
        case x_aVE of _ { GHC.Types.F# y_a13F ->
        $wlgo_r1wv (GHC.Prim.plusFloat# ww_s1vV y_a13F) xs_aVF
        }
    }
end Rec }

Wow, did you remove the casting annotations or does it really match a 
GLclampf against the Float constructor F# without any ado?
If the latter, which compiler version have you?
Just for the record, 6.10.3 produced the same code, but with several levels 
of casting from Float to GLclampf.

More interesting is the generation of the list:

Rec {
go_r1wx :: GHC.Prim.Int#
           -> [Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf]
GblId
[Arity 1
 NoCafRefs
 Str: DmdType L]
go_r1wx =
  \ (x_a13o :: GHC.Prim.Int#) ->
    GHC.Types.:
      @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf
      (case GHC.Prim./## 1.0 (GHC.Prim.int2Double# x_a13o)
       of wild2_a14i { __DEFAULT ->
       GHC.Types.F# (GHC.Prim.double2Float# wild2_a14i)
       })
      (case x_a13o of wild_B1 {
         __DEFAULT -> go_r1wx (GHC.Prim.+# wild_B1 1);
         100000 ->
           GHC.Types.[] @ Graphics.Rendering.OpenGL.GL.BasicTypes.GLclampf
       })
end Rec }

Wowwowwow, it conses a Float to a list of GLclampf without even mentioning 
a cast. Since it feels free to do that, no wonder that it uses 
double2Float#.
Hrm, okay, perhaps a new version of OpenGL[Raw]? Nope, 2.4.0.1 and 1.1.0.1, 
what I have with 6.10.3.
So, perhaps it's 6.12 vs. 6.10? Install OpenGL for 6.12.3, try, nope, same 
as 6.10.3, the summing is identical except for the casting annotations, but 
the generation goes through fromRational and toRational [expected, because 
there are no rewrite rules in OpenGLRaw].

What compiler are you using? HEAD? The core doesn't look like HEAD's core 
to me, but that might be because nothing except main is exported.

Okay, so I threw a couple of rewrite rules into OpenGLRaw, reinstalled and 
reran, now realToFrac gets properly rewritten to double2Float# (with 
casts).

> > In that case, the problem is probably that GHC doesn't see the
> > realToFrac applications because they're too deeply wrapped in your
> > coordToCoord2D calls.

Okay, your compiler *does* rewrite realToFrac :: Double -> GLclampf to 
double2Float#, at least when the situation is simple enough, although there 
are no rewrite rules in the package for that.
Looks like a fortuitous bug.
But it doesn't do the rewriting in the real app, so it's probably indeed 
too deeply wrapped there.

> >
> > If that is the problem, it might help to use {-# INLINE #-} pragmas on
> > coordToCoord2D, fromCartesian2D and toCartesian2D.
> > Can you try with realToFrac and the {-# INLINE #-} pragmas?
>
> I tried inlining the functions you suggest with little effect.  The
> realToFrac version (in this case I just set floatToFloat=realToFrac to
> save the search and replace effort) is just too heavily loaded to see
> any difference at all (98+% of CPU is spent in realToFrac).  The same
> inlining using my definition of floatToFloat gave me a 10% improvement
> from 50% -> 46% of the CPU spent in floatToFloat and an inverse change
> in allocation to match.
>
> Best I can tell, the inlining is being recognized, but just not changing
> much.
>

Looking at the Coord stuff more closely, you'd probably need much more 
inlining to get a good effect. And you probably need a bit more strictness 
too.

============================================================

--Coord2D is a typeclass I created to hold 2D data
data Cartesian2D a = Cartesian2D a a deriving (Show, Eq, Read)

-- Needs testing, but I suspect
{-
data Cartesian2D a = Cartesian2D !a !a deriving (...)

or even

data Cartesian2D a = Cartesian2D {-# UNPACK #-} !a {-# UNPACK #-} !a
    deriving (...)
-}
-- would have a beneficial effect.

{- Pair instances -}
instance (RealFloat a, RealFloat b) => Coord2D (a,b) where
  xComponent = realToFrac . fst
  yComponent = realToFrac . snd
  fromCartesian2D p = ((xComponent p),(yComponent p))

-- That might be too lazy, perhaps
{-
  xComponent (x,_) = realToFrac x
  yComponent (_,y) = realToFrac y
  fromCartesian2D (Cartesian2D x y) = (x,y)
-}
-- will be better

-- anyhow, maybe you need to inline all methods of Coord2D to get the rules 
to fire:

class Coord2D a where
  {-# INLINE xComponent #-}
  xComponent :: (RealFloat b) => a -> b
  {-# INLINE yComponent #-}
  yComponent :: (RealFloat b) => a -> b
  {-# INLINE toCartesian2D #-}
  toCartesian2D :: (RealFloat b) => a -> Cartesian2D b
  toCartesian2D p = Cartesian2D (xComponent p) (yComponent p)
  {-# INLINE fromCartesian2D #-}
  fromCartesian2D :: (RealFloat b) => Cartesian2D b -> a

-- I'm rather convinced inlining the component functions will be good, but
-- there's a good chance that they're small enough to be inlined anyway.

-- The inlining of the to/fromCratesian2D functions is doubtful, because

--and this function allows conversion between coordinate representations
coordToCoord2D :: (Coord2D a, Coord2D b) => a -> b
coordToCoord2D = fromCartesian2D . toCartesian2D

-- cries loudly for

{-# RULES
"toCart/fromCart"   forall p. toCartesian2D (fromCartesian2D p) = p
  #-}

-- whenever that's possible

-- so, perhaps first try to rewrite, whenever that's possible, afterwards 
inline, hence

-- {-# INLINE [2] toCartesian2D #-}
-- {-# INLINE [2] fromCartesian2D #-}
-- {-# RULES
-- "toCart/fromCart" [~2]   forall p. toCartesian (fromCartesian p) = p
--    #-}

-- dunno whether that works, but -ddump-simpl-stats should tell
============================================================

Finally, there's one other thing to try, with or without rules/inlining:

coordToVertex2 :: Coord2D a => a -> (GL.Vertex2  GL.GLclampf)
coordToVertex2 = coordToCoord2D

GLclampf is a newtype wrapper around a newtype wrapper around Float.
Coercing between newtype and original is supposed to be safe, so

import Unsafe.Coerce

floatToGLclampf :: Float -> GL.GLclampf
floatToGLclampf = unsafeCoerce

coordToVertex2 c =
  case coordToCoord2D c of
    (x,y) -> GL.Vertex2 (floatToGLclampf x) (floatToGLclampf y)

That way, we circumvent a potentially expensive call to
realToFrac :: a -> GLclampf
for a = Double or a = Float and split it into a no-op (unsafeCoerce) and a 
hopefully cheap conversion to Float.

> >> And still ran faster than floatToFloat.  However there's no denying
> >> that floatToFloat runs *much* faster than realToFrac in the larger
> >> application.  Profiling shows floatToFloat taking about 50% of my CPU
> >
> > That's too much for my liking, a simple conversion from Double to
> > Float shouldn't take long, even if the Float is wrapped in newtypes
> > (after all, the newtypes don't exist at runtime).
>
> Agreed.  The rest of the application right now isn't doing a lot of work
> yet though-- I'm generating (pre-calculating, if Haskell is doing it's
> job) a list of 360*180 points on a sphere and dumping that to OpenGL
> which should be doing most of the dirty work in hardware.  I'm not
> entirely sure why floatToFloat recalculates every iteration and isn't
> just cached,

Code? Maybe you have to give a name for it to be cached.

> but I'm guessing it's because the floatToFloat is being
> done in an OpenGL callback within the IO monad.  Eventually I'll be
> providing time-varying data anyway, so the conversions will have to be
> continuously recalculated in the end.
>
> That comes out to 65000 conversions every 30ms, or about 2 million
> conversions a second.  I'd probably just leave it at that except, as
> you've demonstrated, there is at least a factor of 3 or 4 to be gained
> somehow-- realToFrac can provide it under the right conditions.
>
> >> {-# RULES
> >> "floatToFloat/id" floatToFloat=id
> >> "floatToFloat x2" floatToFloat . floatToFloat = floatToFloat
> >>  #-}

I'm not sure how the rule-spotting works with compositions, whether it 
matches `foo . bar' with `foo (bar x)' [one in the code, the other in the 
rule], it might be necessary to give the rule in both forms.

> >>
> >> Neither of which seems to fires in this application,
> >
> > GHC reports fired rules with -ddump-simpl-stats.
> > Getting rules to fire is a little brittle, GHC does not try too hard
> > to match expressions with rules, and if several rules match, it
> > chooses one arbitrarily, so your rules may have been missed because
> > the actual code looked different (perhaps because other rewrite rules
> > fired first).
>
> Yeah, I've been looking at the -ddump-simp-stats output.  If I'm reading
> the documentation right, rules are enabled simply by invoking ghc with
> -O or -O2, right?

Right, -O implies -fenable-rewrite-rules (and hence -O2 too).
On the other hand, you can't have rewrite-rules without -O [that is, you 
can pass -fenable-rwerite-rules on the command line without -O, it will 
just have no effect]. Presumably the flag exists for its negation, so you 
can invoke GHC with -O -fno-enable-rewrite-rules to have the rules not 
firing.

> I'm now not convinced any of my rewrite rules are
> firing-- or at least I can't seem to get them to again.
>

If they fire, -ddump-simpl-stats tells you, there's a piece like

9 RuleFired                 
    1 ==#->case             
    1 >#
    1 eftInt
    1 fold/build
    1 fromIntegral/Int->Double
    1 int2Float#
    1 realToFrac/Double->Float
    1 unpack
    1 unpack-list

in the dump, if it contains the name of one of your rules, it fired n 
times, otherwise it didn't fire.

> No, that doesn't do it.  I tried a few variations on that and it always
> chokes on the => symbol or whatever other syntax I try to use.  The Num
> constraint was added because it was needed on related functions (3
> element vertices where the z was stuffed with 0, for example), so I got
> rid of those and the Num constraint.  Doesn't matter, the rule still
> doesn't fire...  =(

Might have been inlined before the rule got a chance to fire.

>
> Cheers--
>  Greg