Fwd: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)

Mon Jul 28 21:30:47 UTC 2025

Thanks a lot Diego, that indeed addresses my concerns. :)

Le 28/07/2025 à 20:26, Diego Antonio Rosario Palomino a écrit :
>
>
> ---------- Forwarded message ---------
> De: *Diego Antonio Rosario Palomino* <diegorosario2013 at gmail.com>
> Date: lun, 28 jul 2025 a la(s) 12:56 p.m.
> Subject: Re: Proposal: Roundtrip serialization of Cmm 
> (parser-compatible pretty-printer output)
> To: Hécate <hecate at glitchbra.in>
>
>
> Hello all,
>
> Thank you for the thoughtful responses so far, and thank you Simon for 
> summarizing Andreas's comments.
>
>     /"Do you have any use-cases in mind? Suppose you were 100%
>     successful — would anyone use it?"/
>
> Yes — my mentor, *Csaba Hruska*, would. He's currently working on a 
> custom STG optimizer that uses experimental techniques to enable 
> whole-program optimizations for Haskell code. The intended pipeline is:
>
> *GHC STG → custom optimizer → textual Cmm → code generation*
>
> However, the current /parseable/ Cmm is not sufficient for his use 
> case, because it *cannot represent everything the Cmm AST can express*.
>
> Beyond this specific use case, achieving *roundtrip serializability* 
> for Cmm could make it a *viable alternative to LLVM* for Haskell 
> projects. Native code generation via Cmm is much faster than through 
> LLVM. And while outputting LLVM from Cmm currently produces /less 
> performant/ code than directly targetting LLVM, I believe the 
> inefficiencies could be fixed relatively easily. Enabling such 
> improvements is part of the motivation for my documentation work — to 
> help developers understand and work with Cmm and its infrastructure.
>
>     /"You need a compelling reason to change the input language
>     (understood by the parser) since libraries may include .cmm files,
>     which will break. (It'd be interesting to audit Hackage to see how
>     many libraries do include such .cmm files.)"/
>
> To clarify, this proposal would *not* break backwards compatibility. 
> There are two implementation paths:
>
> 1.
>
>     Introduce a *second parser* that accepts a syntax 100% identical
>     to the pretty printer output.
>
> 2.
>
>     Extend the *current parser* by adding a mode (or block) that uses
>     a distinct keyword (e.g., |low_level_unwrapped|) to indicate:
>     "expect exact syntax, no convenience fills."
>
> In either case, existing |.cmm| files would continue to be supported 
> as-is. The current parser wouldn't need features removed or changed — 
> the new syntax would *only add capabilities*.
>
>     /"It’s unclear from your example how those blocks would work
>     exactly. Is |low_level_unwrapped| a label? If so can we |goto| it?
>     Is it a keyword? Something else entirely?"/ — Andreas
>
> Apologies for the confusion — I’m not well-versed in the formal 
> terminology.
>
> To clarify: |low_level_unwrapped| (or |very_low_level|, or another 
> name) would be a *keyword or syntactic construct* that tells the 
> parser to interpret the contents of the block |{ ... }| using a syntax 
> *identical to what the pretty printer emits*.
>
> For example:
>
> |function1 { } // existing low-level syntax function2() { } // 
> existing high-level syntax very_low_level { ... } // new mode: code 
> with exact pretty-printed syntax inside the block |
>
>     /"Rather than change the language understood by the parser, would
>     it not be easier to change the language spat out by the
>     pretty-printer to be compatible with the parser?"/
>
> Unfortunately, that’s not a practical path forward.
>
> At the start of the project, Csaba (my mentor) recommended leaving the 
> parser mostly untouched and focusing instead on extending the pretty 
> printer. However, we’ve realized that the differences between the 
> parser and the pretty printer are not trivial. The parser — even in 
> its current “low-level” mode — *inserts inferred data* via convenience 
> functions. It *abstracts part of the structure*, meaning we cannot 
> fully recover the original Cmm ADT just by parsing.
>
> In other words, *modifying the pretty printer to match the parser 
> would require it to /lose information/* — which I strongly oppose. If 
> Cmm is generated programmatically, the pretty-printed version would 
> lack structural information present in the internal data structure. 
> And parseable Cmm would still be *incapable of expressing all features 
> of the AST*.
>
> I hope that also addresses your concern, Hécate.
>
> This GSoC project runs until *November 10th*. I was granted extra time 
> since, unlike most participants, I’m not working through summer 
> vacation — I’m in the Southern Hemisphere.
>
> (Also, I realize I previously used the wrong project name in this 
> thread — the correct title of my GSoC project is *“Documenting and 
> improving Cmm.”*)
>
> Regarding the risk of *bitrot* in a new parser or new syntax mode: one 
> possible mitigation would be to add *regression tests* that check 
> whether parsing a file and pretty-printing it results in compatible 
> output.
>
> On a related note, I’ve noticed that *some Cmm examples in the 
> documentation and even in source code comments are incorrect or 
> outdated*. Part of my work includes identifying and correcting these 
> inconsistencies.
>
> Thanks again to everyone for your time and input — I greatly 
> appreciate the discussion and feedback.
>
> Best regards,
> *Diego Antonio Rosario Palomino*
> GSoC 2025 – Documenting and improving Cmm
>
>
> El lun, 28 jul 2025 a la(s) 11:04 a.m., Hécate via ghc-devs 
> (ghc-devs at haskell.org) escribió:
>
>     Hi Diego,
>
>     Thank you very much for your work in this direction, it's sorely
>     needed.
>
>     I'm all for having proper roundtrip correctness for Cmm, but I am
>     not sure altering the parser is the way to go.
>     In my opinion, GHC should produce valid textual Cmm, that can be
>     ingested by the parser at it is today.
>
>     Have a nice day,
>     Hécate
>
>     Le 28/07/2025 à 02:16, Diego Antonio Rosario Palomino a écrit :
>>
>>     Hello GHC devs,
>>
>>     I'm currently working on Cmm documentation and tooling
>>     improvements as part of my Google Summer of Code project. One of
>>     my core goals is to make Cmm roundtrip serializable.
>>
>>     Right now, the in-memory Cmm data structure—generated
>>     programmatically (e.g., from STG via GHC)—can be pretty-printed,
>>     and Cmm can also be parsed. However, the pretty-printed version
>>     is not compatible with the parser. That is, we cannot take the
>>     output of the pretty printer and feed it directly back into the
>>     parser.
>>
>>     Example:
>>
>>     Parseable version:
>>
>>     |sum { cr: bits64 x; x = R1 + R2; R1 = x; jump
>>     %ENTRY_CODE(Sp(0))[R1]; } |
>>
>>     Pretty-printed version:
>>
>>     |sum() { // [] { info_tbls: [] stack_info: arg_space: 8 } {offset
>>     cf: // global _ce::I64 = R1 + R2; R1 = _ce::I64; call (I64[Sp + 0
>>     * 8])(R1) args: 8, res: 0, upd: 8; } } |
>>
>>     Another example:
>>
>>     Parseable version:
>>
>>     |simple_sum_4 { // [R2, R1] cr: // global bits64 _cq; _cq = R2;
>>     bits64 _cp; _cp = R1; R1 = _cq + _cp; jump (bits64[Sp])[R1]; } |
>>
>>     Pretty-printed version:
>>
>>     |simple_sum_4() { // [] { info_tbls: [] stack_info: arg_space: 8
>>     } {offset cs: // global _cq::I64 = R2; _cr::I64 = R1; R1 =
>>     _cq::I64 + _cr::I64; call (I64[Sp])(R1) args: 8, res: 0, upd: 8; } } |
>>
>>     While it’s possible to write parseable Cmm that resembles the
>>     pretty-printed version (and hence the internal ADT), they don’t
>>     fully match—mainly because the parser inserts inferred fields
>>     using convenience functions.
>>
>>     Proposal:
>>
>>     To make roundtrip serialization possible, I propose supporting a
>>     new syntax that matches the pretty printer output exactly.
>>
>>     There are a couple of design options:
>>
>>     1.
>>
>>         Create a separate parser that accepts the pretty-printed
>>         syntax. Files could then use either the current parser or the
>>         new strict one.
>>
>>     2.
>>
>>         Extend the current parser with a dedicated block syntax like:
>>
>>     |low_level_unwrapped { ... } |
>>
>>     This second option is the one my mentor recommends, as it may
>>     better reflect GHC developers' preferences. In this mode, the
>>     parser would not insert any inferred data and would expect the
>>     input to match the pretty-printed form exactly.
>>
>>     This would enable a true roundtrip:
>>
>>      *
>>
>>         Compile Haskell to Cmm (in-memory AST)
>>
>>      *
>>
>>         Pretty-print and write it to disk (wrapped in
>>         low_level_unwrapped { ... })
>>
>>      *
>>
>>         Later read it back using the parser and continue with codegen
>>
>>     Optional future direction:
>>
>>     As a side note: currently the parser has both a “high-level” and
>>     a “low-level” mode. The low-level mode resembles the AST more
>>     closely but still inserts some inferred data.
>>
>>     If we introduce this new “exact” low-level form, it's possible
>>     the existing low-level mode could become redundant. We might then
>>     have:
>>
>>      *
>>
>>         High-level syntax
>>
>>      *
>>
>>         New low-level (exact)
>>
>>      *
>>
>>         And possibly deprecate the current low-level variant
>>
>>     I’d be interested in your thoughts on whether that direction
>>     makes sense.
>>
>>     Serialization libraries?
>>
>>     One technically possible—but likely unacceptable—alternative
>>     would be to derive serialization via a library like |aeson|. That
>>     would enable serializing and deserializing the Cmm AST directly.
>>     However, I understand that |aeson| adds a large dependency
>>     footprint, and likely wouldn't be suitable for inclusion in GHC.
>>
>>     Final question:
>>
>>     Lastly—I’ve heard that parts of the Cmm pipeline may currently be
>>     under refactoring. If that’s the case, could you point me to
>>     which parts (parser, pretty printer, internal representation,
>>     etc.) are being modified? I’d like to align my efforts
>>     accordingly and avoid conflicts.
>>
>>     Thanks very much for your time and input! I'm happy to iterate on
>>     this based on your feedback.
>>
>>     Best regards,
>>     Diego Antonio Rosario Palomino
>>     GSoC 2025 – Cmm Documentation & Tooling
>>
>>
>>     _______________________________________________
>>     ghc-devs mailing list
>>     ghc-devs at haskell.org
>>     http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>     -- 
>     Hécate ✨
>     🐦: @TechnoEmpress
>     IRC: Hecate
>     WWW:https://glitchbra.in
>     RUN: BSD
>
>     _______________________________________________
>     ghc-devs mailing list
>     ghc-devs at haskell.org
>     http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

-- 
Hécate ✨
🐦: @TechnoEmpress
IRC: Hecate
WWW:https://glitchbra.in
RUN: BSD
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20250728/37496371/attachment.html>