<div dir="ltr"><br><br><div class="gmail_quote gmail_quote_container"><div dir="ltr" class="gmail_attr">---------- Forwarded message ---------<br>De: <b class="gmail_sendername" dir="auto">Diego Antonio Rosario Palomino</b> <span dir="auto"><<a href="mailto:diegorosario2013@gmail.com">diegorosario2013@gmail.com</a>></span><br>Date: lun, 28 jul 2025 a la(s) 12:56 p.m.<br>Subject: Re: Proposal: Roundtrip serialization of Cmm (parser-compatible pretty-printer output)<br>To: Hécate <<a href="mailto:hecate@glitchbra.in">hecate@glitchbra.in</a>><br></div><br><br><div dir="ltr"><p>Hello all,</p>

<p>Thank you for the thoughtful responses so far, and thank you Simon for summarizing Andreas's comments.</p>

<blockquote>

<p><i>"Do you have any use-cases in mind? Suppose you were 100% successful — would anyone use it?"</i></p>

</blockquote>

<p>Yes — my mentor, <b>Csaba Hruska</b>, would. He's currently working on a custom STG optimizer that uses experimental techniques to enable whole-program optimizations for Haskell code. The intended pipeline is:</p>

<p><b>GHC STG → custom optimizer → textual Cmm → code generation</b></p>

<p>However, the current <i>parseable</i> Cmm is not sufficient for his use case, because it <b>cannot represent everything the Cmm AST can express</b>.</p>

<p>Beyond this specific use case, achieving <b>roundtrip serializability</b> for Cmm could make it a <b>viable alternative to LLVM</b> for Haskell projects. Native code generation via Cmm is much faster than through LLVM. And while outputting LLVM from Cmm currently produces <i>less performant</i> code than directly targetting LLVM, I believe the inefficiencies could be fixed relatively easily. Enabling such improvements is part of the motivation for my documentation work — to help developers understand and work with Cmm and its infrastructure.</p>

<blockquote>

<p><i>"You need a compelling reason to change the input language (understood by the parser) since libraries may include .cmm files, which will break. (It'd be interesting to audit Hackage to see how many libraries do include such .cmm files.)"</i></p>

</blockquote>

<p>To clarify, this proposal would <b>not</b> break backwards compatibility. There are two implementation paths:</p>

<ol><li>

<p>Introduce a <b>second parser</b> that accepts a syntax 100% identical to the pretty printer output.</p>

</li><li>

<p>Extend the <b>current parser</b> by adding a mode (or block) that uses a distinct keyword (e.g., <code>low_level_unwrapped</code>) to indicate: "expect exact syntax, no convenience fills."</p>

</li></ol>

<p>In either case, existing <code>.cmm</code> files would continue to be supported as-is. The current parser wouldn't need features removed or changed — the new syntax would <b>only add capabilities</b>.</p>

<blockquote>

<p><i>"It’s unclear from your example how those blocks would work exactly. Is <code>low_level_unwrapped</code> a label? If so can we <code>goto</code> it? Is it a keyword? Something else entirely?"</i> — Andreas</p>

</blockquote>

<p>Apologies for the confusion — I’m not well-versed in the formal terminology.</p>

<p>To clarify: <code>low_level_unwrapped</code> (or <code>very_low_level</code>, or another name) would be a <b>keyword or syntactic construct</b> that tells the parser to interpret the contents of the block <code>{ ... }</code> using a syntax <b>identical to what the pretty printer emits</b>.</p>

<p>For example:</p>

<pre><code>function1 { }            // existing low-level syntax

function2() { }          // existing high-level syntax

very_low_level { ... }   // new mode: code with exact pretty-printed syntax inside the block

</code></pre>

<blockquote>

<p><i>"Rather than change the language understood by the parser, would it not be easier to change the language spat out by the pretty-printer to be compatible with the parser?"</i></p>

</blockquote>

<p>Unfortunately, that’s not a practical path forward.</p>

<p>At the start of the project, Csaba (my mentor) recommended leaving the parser mostly untouched and focusing instead on extending the pretty printer. However, we’ve realized that the differences between the parser and the pretty printer are not trivial. The parser — even in its current “low-level” mode — <b>inserts inferred data</b> via convenience functions. It <b>abstracts part of the structure</b>, meaning we cannot fully recover the original Cmm ADT just by parsing.</p>

<p>In other words, <b>modifying the pretty printer to match the parser would require it to <i>lose information</i></b> — which I strongly oppose. If Cmm is generated programmatically, the pretty-printed version would lack structural information present in the internal data structure. And parseable Cmm would still be <b>incapable of expressing all features of the AST</b>.</p>

<p>I hope that also addresses your concern, Hécate.</p>

<p>This GSoC project runs until <b>November 10th</b>. I was granted extra time since, unlike most participants, I’m not working through summer vacation — I’m in the Southern Hemisphere.</p>

<p>(Also, I realize I previously used the wrong project name in this thread — the correct title of my GSoC project is <b>“Documenting and improving Cmm.”</b>)</p>

<p>Regarding the risk of <b>bitrot</b> in a new parser or new syntax mode: one possible mitigation would be to add <b>regression tests</b> that check whether parsing a file and pretty-printing it results in compatible output.</p>

<p>On a related note, I’ve noticed that <b>some Cmm examples in the documentation and even in source code comments are incorrect or outdated</b>. Part of my work includes identifying and correcting these inconsistencies.</p>

<p>Thanks again to everyone for your time and input — I greatly appreciate the discussion and feedback.</p>

<p>Best regards,<br>

<b>Diego Antonio Rosario Palomino</b><br>

GSoC 2025 – Documenting and improving Cmm</p></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">El lun, 28 jul 2025 a la(s) 11:04 a.m., Hécate via ghc-devs (<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a>) escribió:<br></div><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><u></u>

  <div>

    <p>Hi Diego,</p>

    <p>Thank you very much for your work in this direction, it's sorely

      needed.<br>

    </p>

    <p>I'm all for having proper roundtrip correctness for Cmm, but I am

      not sure altering the parser is the way to go.<br>

      In my opinion, GHC should produce valid textual Cmm, that can be

      ingested by the parser at it is today.<br>

      <br>

      Have a nice day,<br>

      Hécate<br>

    </p>

    <div>Le 28/07/2025 à 02:16, Diego Antonio

      Rosario Palomino a écrit :<br>

    </div>

    <blockquote type="cite">

      <div dir="ltr">

        <p>Hello GHC devs,</p>

        <p>I'm currently working on Cmm documentation and tooling

          improvements as part of my Google Summer of Code project. One

          of my core goals is to make Cmm roundtrip serializable.</p>

        <p>Right now, the in-memory Cmm data structure—generated

          programmatically (e.g., from STG via GHC)—can be

          pretty-printed, and Cmm can also be parsed. However, the

          pretty-printed version is not compatible with the parser. That

          is, we cannot take the output of the pretty printer and feed

          it directly back into the parser.</p>

        <p>Example:</p>

        <p>Parseable version:</p>

        <pre><code>sum {

 cr:

  bits64 x;

  x = R1 + R2;

  R1 = x;

  jump %ENTRY_CODE(Sp(0))[R1];

}

</code></pre>

        <p>Pretty-printed version:</p>

        <pre><code>sum() { // []

  { info_tbls: []

    stack_info: arg_space: 8

  }

  {offset

    cf: // global

      _ce::I64 = R1 + R2;

      R1 = _ce::I64;

      call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8;

  }

}

</code></pre>

        <p>Another example:</p>

        <p>Parseable version:</p>

        <pre><code>simple_sum_4 { // [R2, R1]

  cr: // global

    bits64 _cq;

    _cq = R2;

    bits64 _cp;

    _cp = R1;

    R1 = _cq + _cp;

    jump (bits64[Sp])[R1];

}

</code></pre>

        <p>Pretty-printed version:</p>

        <pre><code>simple_sum_4() { // []

  { info_tbls: []

    stack_info: arg_space: 8

  }

  {offset

    cs: // global

      _cq::I64 = R2;

      _cr::I64 = R1;

      R1 = _cq::I64 + _cr::I64;

      call (I64[Sp])(R1) args: 8, res: 0, upd: 8;

  }

}

</code></pre>

        <p>While it’s possible to write parseable Cmm that resembles the

          pretty-printed version (and hence the internal ADT), they

          don’t fully match—mainly because the parser inserts inferred

          fields using convenience functions.</p>

        <p>Proposal:</p>

        <p>To make roundtrip serialization possible, I propose

          supporting a new syntax that matches the pretty printer output

          exactly.</p>

        <p>There are a couple of design options:</p>

        <ol>

          <li>

            <p>Create a separate parser that accepts the pretty-printed

              syntax. Files could then use either the current parser or

              the new strict one.</p>

          </li>

          <li>

            <p>Extend the current parser with a dedicated block syntax

              like:</p>

          </li>

        </ol>

        <pre><code>low_level_unwrapped {

  ...

}

</code></pre>

        <p>This second option is the one my mentor recommends, as it may

          better reflect GHC developers' preferences. In this mode, the

          parser would not insert any inferred data and would expect the

          input to match the pretty-printed form exactly.</p>

        <p>This would enable a true roundtrip:</p>

        <ul>

          <li>

            <p>Compile Haskell to Cmm (in-memory AST)</p>

          </li>

          <li>

            <p>Pretty-print and write it to disk (wrapped in

              low_level_unwrapped { ... })</p>

          </li>

          <li>

            <p>Later read it back using the parser and continue with

              codegen</p>

          </li>

        </ul>

        <p>Optional future direction:</p>

        <p>As a side note: currently the parser has both a “high-level”

          and a “low-level” mode. The low-level mode resembles the AST

          more closely but still inserts some inferred data.</p>

        <p>If we introduce this new “exact” low-level form, it's

          possible the existing low-level mode could become redundant.

          We might then have:</p>

        <ul>

          <li>

            <p>High-level syntax</p>

          </li>

          <li>

            <p>New low-level (exact)</p>

          </li>

          <li>

            <p>And possibly deprecate the current low-level variant</p>

          </li>

        </ul>

        <p>I’d be interested in your thoughts on whether that direction

          makes sense.</p>

        <p>Serialization libraries?</p>

        <p>One technically possible—but likely unacceptable—alternative

          would be to derive serialization via a library like <code>aeson</code>.

          That would enable serializing and deserializing the Cmm AST

          directly. However, I understand that <code>aeson</code> adds

          a large dependency footprint, and likely wouldn't be suitable

          for inclusion in GHC.</p>

        <p>Final question:</p>

        <p>Lastly—I’ve heard that parts of the Cmm pipeline may

          currently be under refactoring. If that’s the case, could you

          point me to which parts (parser, pretty printer, internal

          representation, etc.) are being modified? I’d like to align my

          efforts accordingly and avoid conflicts.</p>

        <p>Thanks very much for your time and input! I'm happy to

          iterate on this based on your feedback.</p>

        <p>Best regards,<br>

          Diego Antonio Rosario Palomino<br>

          GSoC 2025 – Cmm Documentation & Tooling</p>

      </div>

      <br>

      <fieldset></fieldset>

      <pre>_______________________________________________

ghc-devs mailing list

<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a>

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a>

</pre>

    </blockquote>

    <pre cols="72">-- 

Hécate ✨

🐦: @TechnoEmpress

IRC: Hecate

WWW: <a href="https://glitchbra.in" target="_blank">https://glitchbra.in</a>

RUN: BSD</pre>

  </div>

_______________________________________________<br>

ghc-devs mailing list<br>

<a href="mailto:ghc-devs@haskell.org" target="_blank">ghc-devs@haskell.org</a><br>

<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs" rel="noreferrer" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a><br>

</blockquote></div>

</div></div>