Fwd: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)

Mon Jul 28 18:55:19 UTC 2025

Diego Antonio Rosario Palomino <diegorosario2013 at gmail.com> writes:

> ---------- Forwarded message ---------
> De: Diego Antonio Rosario Palomino <diegorosario2013 at gmail.com>
> Date: lun, 28 jul 2025 a la(s) 12:56 p.m.
> Subject: Re: Proposal: Roundtrip serialization of Cmm (parser-compatible
> pretty-printer output)
> To: Hécate <hecate at glitchbra.in>
>
>
> Hello all,
>
> Thank you for the thoughtful responses so far, and thank you Simon for
> summarizing Andreas's comments.
>
Hi Diego,

In the future it would make things easier if you could use one of the
common email quoting conventions (i.e. starting lines with >). It is
otherwise a bit hard to distinguish your replies from the questions
you are responding to.

> > *"Do you have any use-cases in mind? Suppose you were 100% successful —
> > would anyone use it?"*
>
> Yes — my mentor, *Csaba Hruska*, would. He's currently working on a custom
> STG optimizer that uses experimental techniques to enable whole-program
> optimizations for Haskell code. The intended pipeline is:
>
> > *GHC STG → custom optimizer → textual Cmm → code generation*
>
> However, the current *parseable* Cmm is not sufficient for his use case,
> because it *cannot represent everything the Cmm AST can express*.
>
> Beyond this specific use case, achieving *roundtrip serializability* for
> Cmm could make it a *viable alternative to LLVM* for Haskell projects.
> Native code generation via Cmm is much faster than through LLVM. And while
> outputting LLVM from Cmm currently produces *less performant* code than
> directly targetting LLVM, I believe the inefficiencies could be fixed
> relatively easily. Enabling such improvements is part of the motivation for
> my documentation work — to help developers understand and work with Cmm and
> its infrastructure.
>
> > *"You need a compelling reason to change the input language (understood by
> > the parser) since libraries may include .cmm files, which will break. (It'd
> > be interesting to audit Hackage to see how many libraries do include such
> > .cmm files.)"*
>
> To clarify, this proposal would *not* break backwards compatibility. There
> are two implementation paths:
>
>    1. Introduce a *second parser* that accepts a syntax 100% identical to the
>       pretty printer output.
>
>    2. Extend the *current parser* by adding a mode (or block) that uses a
>       distinct keyword (e.g., low_level_unwrapped) to indicate: "expect exact
>       syntax, no convenience fills."
>
> In either case, existing .cmm files would continue to be supported as-is.
> The current parser wouldn't need features removed or changed — the new
> syntax would *only add capabilities*.
>
Duplicating the parser seems like a very heavy cost to pay here. Do we
have a concrete list of places where the parsed grammar differs from
that which is produced? I feel it might be useful to get a sense of how
much divergence there is before we entertain such drastic steps.

> > *"It’s unclear from your example how those blocks would work exactly. Is
> > low_level_unwrapped a label? If so can we goto it? Is it a keyword?
> > Something else entirely?"* — Andreas
>
> Apologies for the confusion — I’m not well-versed in the formal terminology.
>
> To clarify: low_level_unwrapped (or very_low_level, or another name) would
> be a *keyword or syntactic construct* that tells the parser to interpret
> the contents of the block { ... } using a syntax *identical to what the
> pretty printer emits*.
>
> For example:
>
> function1 { }            // existing low-level syntax
> function2() { }          // existing high-level syntax
>
> very_low_level { ... }   // new mode: code with exact pretty-printed
> syntax inside the block
>
> > *"Rather than change the language understood by the parser, would it not be
> > easier to change the language spat out by the pretty-printer to be
> > compatible with the parser?"*
>
> Unfortunately, that’s not a practical path forward.
>
> At the start of the project, Csaba (my mentor) recommended leaving the
> parser mostly untouched and focusing instead on extending the pretty
> printer. However, we’ve realized that the differences between the parser
> and the pretty printer are not trivial. The parser — even in its current
> “low-level” mode — *inserts inferred data* via convenience functions.
> It *abstracts part of the structure*, meaning we cannot fully recover
> the original Cmm ADT just by parsing.
>
Sure, but instead of adding a whole new branch to the grammar, why don't
we start by enumerating the specific places where the Cmm
parser elaborates. We can then introduce specific productions
to allow expression of those particular cases. Ideally the existing
productions would be special cases of the new, more expressive
productions.

> In other words, *modifying the pretty printer to match the parser would
> require it to lose information* — which I strongly oppose. If Cmm is
> generated programmatically, the pretty-printed version would lack
> structural information present in the internal data structure. And
> parseable Cmm would still be *incapable of expressing all features of the
> AST*.
>
> I hope that also addresses your concern, Hécate.
>
> This GSoC project runs until *November 10th*. I was granted extra time
> since, unlike most participants, I’m not working through summer vacation —
> I’m in the Southern Hemisphere.
>
> (Also, I realize I previously used the wrong project name in this thread —
> the correct title of my GSoC project is *“Documenting and improving Cmm.”*)
>
> Regarding the risk of *bitrot* in a new parser or new syntax mode: one
> possible mitigation would be to add *regression tests* that check whether
> parsing a file and pretty-printing it results in compatible output.
>
Yes, this would alert us of some cases of bitrot (specifically, those
cases that we think to test, although that set can be very large with
property testing). Nevertheless, fixing it still requires effort and
maintenance effort is something that we must weigh.

> On a related note, I’ve noticed that *some Cmm examples in the
> documentation and even in source code comments are incorrect or outdated*.
> Part of my work includes identifying and correcting these inconsistencies.

That is great. Do open merge requests as you find these. It would be
great to get these into the tree now rather than build up a large
backlog for review at the end of the project.

Cheers,

- Ben

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 255 bytes
Desc: not available
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20250728/2fcb1ffd/attachment.sig>