Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)

Mon Jul 28 00:16:04 UTC 2025

Hello GHC devs,

I'm currently working on Cmm documentation and tooling improvements as part
of my Google Summer of Code project. One of my core goals is to make Cmm
roundtrip serializable.

Right now, the in-memory Cmm data structure—generated programmatically
(e.g., from STG via GHC)—can be pretty-printed, and Cmm can also be parsed.
However, the pretty-printed version is not compatible with the parser. That
is, we cannot take the output of the pretty printer and feed it directly
back into the parser.

Example:

Parseable version:

sum {
 cr:
  bits64 x;
  x = R1 + R2;
  R1 = x;
  jump %ENTRY_CODE(Sp(0))[R1];
}

Pretty-printed version:

sum() { // []
  { info_tbls: []
    stack_info: arg_space: 8
  }
  {offset
    cf: // global
      _ce::I64 = R1 + R2;
      R1 = _ce::I64;
      call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8;
  }
}

Another example:

Parseable version:

simple_sum_4 { // [R2, R1]
  cr: // global
    bits64 _cq;
    _cq = R2;
    bits64 _cp;
    _cp = R1;
    R1 = _cq + _cp;
    jump (bits64[Sp])[R1];
}

Pretty-printed version:

simple_sum_4() { // []
  { info_tbls: []
    stack_info: arg_space: 8
  }
  {offset
    cs: // global
      _cq::I64 = R2;
      _cr::I64 = R1;
      R1 = _cq::I64 + _cr::I64;
      call (I64[Sp])(R1) args: 8, res: 0, upd: 8;
  }
}

While it’s possible to write parseable Cmm that resembles the
pretty-printed version (and hence the internal ADT), they don’t fully
match—mainly because the parser inserts inferred fields using convenience
functions.

Proposal:

To make roundtrip serialization possible, I propose supporting a new syntax
that matches the pretty printer output exactly.

There are a couple of design options:

   1.

   Create a separate parser that accepts the pretty-printed syntax. Files
   could then use either the current parser or the new strict one.
   2.

   Extend the current parser with a dedicated block syntax like:

low_level_unwrapped {
  ...
}

This second option is the one my mentor recommends, as it may better
reflect GHC developers' preferences. In this mode, the parser would not
insert any inferred data and would expect the input to match the
pretty-printed form exactly.

This would enable a true roundtrip:

   -

   Compile Haskell to Cmm (in-memory AST)
   -

   Pretty-print and write it to disk (wrapped in low_level_unwrapped { ...
   })
   -

   Later read it back using the parser and continue with codegen

Optional future direction:

As a side note: currently the parser has both a “high-level” and a
“low-level” mode. The low-level mode resembles the AST more closely but
still inserts some inferred data.

If we introduce this new “exact” low-level form, it's possible the existing
low-level mode could become redundant. We might then have:

   -

   High-level syntax
   -

   New low-level (exact)
   -

   And possibly deprecate the current low-level variant

I’d be interested in your thoughts on whether that direction makes sense.

Serialization libraries?

One technically possible—but likely unacceptable—alternative would be to
derive serialization via a library like aeson. That would enable
serializing and deserializing the Cmm AST directly. However, I understand
that aeson adds a large dependency footprint, and likely wouldn't be
suitable for inclusion in GHC.

Final question:

Lastly—I’ve heard that parts of the Cmm pipeline may currently be under
refactoring. If that’s the case, could you point me to which parts (parser,
pretty printer, internal representation, etc.) are being modified? I’d like
to align my efforts accordingly and avoid conflicts.

Thanks very much for your time and input! I'm happy to iterate on this
based on your feedback.

Best regards,
Diego Antonio Rosario Palomino
GSoC 2025 – Cmm Documentation & Tooling
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20250727/913d6c1a/attachment.html>