Fwd: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)
Hécate
hecate at glitchbra.in
Mon Jul 28 21:30:47 UTC 2025
Thanks a lot Diego, that indeed addresses my concerns. :)
Le 28/07/2025 à 20:26, Diego Antonio Rosario Palomino a écrit :
>
>
> ---------- Forwarded message ---------
> De: *Diego Antonio Rosario Palomino* <diegorosario2013 at gmail.com>
> Date: lun, 28 jul 2025 a la(s) 12:56 p.m.
> Subject: Re: Proposal: Roundtrip serialization of Cmm
> (parser-compatible pretty-printer output)
> To: Hécate <hecate at glitchbra.in>
>
>
> Hello all,
>
> Thank you for the thoughtful responses so far, and thank you Simon for
> summarizing Andreas's comments.
>
> /"Do you have any use-cases in mind? Suppose you were 100%
> successful — would anyone use it?"/
>
> Yes — my mentor, *Csaba Hruska*, would. He's currently working on a
> custom STG optimizer that uses experimental techniques to enable
> whole-program optimizations for Haskell code. The intended pipeline is:
>
> *GHC STG → custom optimizer → textual Cmm → code generation*
>
> However, the current /parseable/ Cmm is not sufficient for his use
> case, because it *cannot represent everything the Cmm AST can express*.
>
> Beyond this specific use case, achieving *roundtrip serializability*
> for Cmm could make it a *viable alternative to LLVM* for Haskell
> projects. Native code generation via Cmm is much faster than through
> LLVM. And while outputting LLVM from Cmm currently produces /less
> performant/ code than directly targetting LLVM, I believe the
> inefficiencies could be fixed relatively easily. Enabling such
> improvements is part of the motivation for my documentation work — to
> help developers understand and work with Cmm and its infrastructure.
>
> /"You need a compelling reason to change the input language
> (understood by the parser) since libraries may include .cmm files,
> which will break. (It'd be interesting to audit Hackage to see how
> many libraries do include such .cmm files.)"/
>
> To clarify, this proposal would *not* break backwards compatibility.
> There are two implementation paths:
>
> 1.
>
> Introduce a *second parser* that accepts a syntax 100% identical
> to the pretty printer output.
>
> 2.
>
> Extend the *current parser* by adding a mode (or block) that uses
> a distinct keyword (e.g., |low_level_unwrapped|) to indicate:
> "expect exact syntax, no convenience fills."
>
> In either case, existing |.cmm| files would continue to be supported
> as-is. The current parser wouldn't need features removed or changed —
> the new syntax would *only add capabilities*.
>
> /"It’s unclear from your example how those blocks would work
> exactly. Is |low_level_unwrapped| a label? If so can we |goto| it?
> Is it a keyword? Something else entirely?"/ — Andreas
>
> Apologies for the confusion — I’m not well-versed in the formal
> terminology.
>
> To clarify: |low_level_unwrapped| (or |very_low_level|, or another
> name) would be a *keyword or syntactic construct* that tells the
> parser to interpret the contents of the block |{ ... }| using a syntax
> *identical to what the pretty printer emits*.
>
> For example:
>
> |function1 { } // existing low-level syntax function2() { } //
> existing high-level syntax very_low_level { ... } // new mode: code
> with exact pretty-printed syntax inside the block |
>
> /"Rather than change the language understood by the parser, would
> it not be easier to change the language spat out by the
> pretty-printer to be compatible with the parser?"/
>
> Unfortunately, that’s not a practical path forward.
>
> At the start of the project, Csaba (my mentor) recommended leaving the
> parser mostly untouched and focusing instead on extending the pretty
> printer. However, we’ve realized that the differences between the
> parser and the pretty printer are not trivial. The parser — even in
> its current “low-level” mode — *inserts inferred data* via convenience
> functions. It *abstracts part of the structure*, meaning we cannot
> fully recover the original Cmm ADT just by parsing.
>
> In other words, *modifying the pretty printer to match the parser
> would require it to /lose information/* — which I strongly oppose. If
> Cmm is generated programmatically, the pretty-printed version would
> lack structural information present in the internal data structure.
> And parseable Cmm would still be *incapable of expressing all features
> of the AST*.
>
> I hope that also addresses your concern, Hécate.
>
> This GSoC project runs until *November 10th*. I was granted extra time
> since, unlike most participants, I’m not working through summer
> vacation — I’m in the Southern Hemisphere.
>
> (Also, I realize I previously used the wrong project name in this
> thread — the correct title of my GSoC project is *“Documenting and
> improving Cmm.”*)
>
> Regarding the risk of *bitrot* in a new parser or new syntax mode: one
> possible mitigation would be to add *regression tests* that check
> whether parsing a file and pretty-printing it results in compatible
> output.
>
> On a related note, I’ve noticed that *some Cmm examples in the
> documentation and even in source code comments are incorrect or
> outdated*. Part of my work includes identifying and correcting these
> inconsistencies.
>
> Thanks again to everyone for your time and input — I greatly
> appreciate the discussion and feedback.
>
> Best regards,
> *Diego Antonio Rosario Palomino*
> GSoC 2025 – Documenting and improving Cmm
>
>
> El lun, 28 jul 2025 a la(s) 11:04 a.m., Hécate via ghc-devs
> (ghc-devs at haskell.org) escribió:
>
> Hi Diego,
>
> Thank you very much for your work in this direction, it's sorely
> needed.
>
> I'm all for having proper roundtrip correctness for Cmm, but I am
> not sure altering the parser is the way to go.
> In my opinion, GHC should produce valid textual Cmm, that can be
> ingested by the parser at it is today.
>
> Have a nice day,
> Hécate
>
> Le 28/07/2025 à 02:16, Diego Antonio Rosario Palomino a écrit :
>>
>> Hello GHC devs,
>>
>> I'm currently working on Cmm documentation and tooling
>> improvements as part of my Google Summer of Code project. One of
>> my core goals is to make Cmm roundtrip serializable.
>>
>> Right now, the in-memory Cmm data structure—generated
>> programmatically (e.g., from STG via GHC)—can be pretty-printed,
>> and Cmm can also be parsed. However, the pretty-printed version
>> is not compatible with the parser. That is, we cannot take the
>> output of the pretty printer and feed it directly back into the
>> parser.
>>
>> Example:
>>
>> Parseable version:
>>
>> |sum { cr: bits64 x; x = R1 + R2; R1 = x; jump
>> %ENTRY_CODE(Sp(0))[R1]; } |
>>
>> Pretty-printed version:
>>
>> |sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset
>> cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0
>> * 8])(R1) args: 8, res: 0, upd: 8; } } |
>>
>> Another example:
>>
>> Parseable version:
>>
>> |simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2;
>> bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } |
>>
>> Pretty-printed version:
>>
>> |simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8
>> } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 =
>> _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } |
>>
>> While it’s possible to write parseable Cmm that resembles the
>> pretty-printed version (and hence the internal ADT), they don’t
>> fully match—mainly because the parser inserts inferred fields
>> using convenience functions.
>>
>> Proposal:
>>
>> To make roundtrip serialization possible, I propose supporting a
>> new syntax that matches the pretty printer output exactly.
>>
>> There are a couple of design options:
>>
>> 1.
>>
>> Create a separate parser that accepts the pretty-printed
>> syntax. Files could then use either the current parser or the
>> new strict one.
>>
>> 2.
>>
>> Extend the current parser with a dedicated block syntax like:
>>
>> |low_level_unwrapped { ... } |
>>
>> This second option is the one my mentor recommends, as it may
>> better reflect GHC developers' preferences. In this mode, the
>> parser would not insert any inferred data and would expect the
>> input to match the pretty-printed form exactly.
>>
>> This would enable a true roundtrip:
>>
>> *
>>
>> Compile Haskell to Cmm (in-memory AST)
>>
>> *
>>
>> Pretty-print and write it to disk (wrapped in
>> low_level_unwrapped { ... })
>>
>> *
>>
>> Later read it back using the parser and continue with codegen
>>
>> Optional future direction:
>>
>> As a side note: currently the parser has both a “high-level” and
>> a “low-level” mode. The low-level mode resembles the AST more
>> closely but still inserts some inferred data.
>>
>> If we introduce this new “exact” low-level form, it's possible
>> the existing low-level mode could become redundant. We might then
>> have:
>>
>> *
>>
>> High-level syntax
>>
>> *
>>
>> New low-level (exact)
>>
>> *
>>
>> And possibly deprecate the current low-level variant
>>
>> I’d be interested in your thoughts on whether that direction
>> makes sense.
>>
>> Serialization libraries?
>>
>> One technically possible—but likely unacceptable—alternative
>> would be to derive serialization via a library like |aeson|. That
>> would enable serializing and deserializing the Cmm AST directly.
>> However, I understand that |aeson| adds a large dependency
>> footprint, and likely wouldn't be suitable for inclusion in GHC.
>>
>> Final question:
>>
>> Lastly—I’ve heard that parts of the Cmm pipeline may currently be
>> under refactoring. If that’s the case, could you point me to
>> which parts (parser, pretty printer, internal representation,
>> etc.) are being modified? I’d like to align my efforts
>> accordingly and avoid conflicts.
>>
>> Thanks very much for your time and input! I'm happy to iterate on
>> this based on your feedback.
>>
>> Best regards,
>> Diego Antonio Rosario Palomino
>> GSoC 2025 – Cmm Documentation & Tooling
>>
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
> --
> Hécate ✨
> 🐦: @TechnoEmpress
> IRC: Hecate
> WWW:https://glitchbra.in
> RUN: BSD
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
--
Hécate ✨
🐦: @TechnoEmpress
IRC: Hecate
WWW:https://glitchbra.in
RUN: BSD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20250728/37496371/attachment.html>
More information about the ghc-devs
mailing list