<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Thanks a lot Diego, that indeed addresses my concerns. :) <br>
    </p>
    <div class="moz-cite-prefix">Le 28/07/2025 à 20:26, Diego Antonio
      Rosario Palomino a écrit :<br>
    </div>
    <blockquote type="cite"
cite="mid:CAONcbWKcTEjv6HskEyWRjjXHYhuXyDuFMkFwWmS9n84bSDmjvQ@mail.gmail.com">
      <meta http-equiv="content-type" content="text/html; charset=UTF-8">
      <div dir="ltr"><br>
        <br>
        <div class="gmail_quote gmail_quote_container">
          <div dir="ltr" class="gmail_attr">---------- Forwarded message
            ---------<br>
            De: <b class="gmail_sendername" dir="auto">Diego Antonio
              Rosario Palomino</b> <span dir="auto"><<a
                href="mailto:diegorosario2013@gmail.com"
                moz-do-not-send="true" class="moz-txt-link-freetext">diegorosario2013@gmail.com</a>></span><br>
            Date: lun, 28 jul 2025 a la(s) 12:56 p.m.<br>
            Subject: Re: Proposal: Roundtrip serialization of Cmm
            (parser-compatible pretty-printer output)<br>
            To: Hécate <<a href="mailto:hecate@glitchbra.in"
              moz-do-not-send="true" class="moz-txt-link-freetext">hecate@glitchbra.in</a>><br>
          </div>
          <br>
          <br>
          <div dir="ltr">
            <p>Hello all,</p>
            <p>Thank you for the thoughtful responses so far, and thank
              you Simon for summarizing Andreas's comments.</p>
            <blockquote>
              <p><i>"Do you have any use-cases in mind? Suppose you were
                  100% successful — would anyone use it?"</i></p>
            </blockquote>
            <p>Yes — my mentor, <b>Csaba Hruska</b>, would. He's
              currently working on a custom STG optimizer that uses
              experimental techniques to enable whole-program
              optimizations for Haskell code. The intended pipeline is:</p>
            <p><b>GHC STG → custom optimizer → textual Cmm → code
                generation</b></p>
            <p>However, the current <i>parseable</i> Cmm is not
              sufficient for his use case, because it <b>cannot
                represent everything the Cmm AST can express</b>.</p>
            <p>Beyond this specific use case, achieving <b>roundtrip
                serializability</b> for Cmm could make it a <b>viable
                alternative to LLVM</b> for Haskell projects. Native
              code generation via Cmm is much faster than through LLVM.
              And while outputting LLVM from Cmm currently produces <i>less
                performant</i> code than directly targetting LLVM, I
              believe the inefficiencies could be fixed relatively
              easily. Enabling such improvements is part of the
              motivation for my documentation work — to help developers
              understand and work with Cmm and its infrastructure.</p>
            <blockquote>
              <p><i>"You need a compelling reason to change the input
                  language (understood by the parser) since libraries
                  may include .cmm files, which will break. (It'd be
                  interesting to audit Hackage to see how many libraries
                  do include such .cmm files.)"</i></p>
            </blockquote>
            <p>To clarify, this proposal would <b>not</b> break
              backwards compatibility. There are two implementation
              paths:</p>
            <ol>
              <li>
                <p>Introduce a <b>second parser</b> that accepts a
                  syntax 100% identical to the pretty printer output.</p>
              </li>
              <li>
                <p>Extend the <b>current parser</b> by adding a mode
                  (or block) that uses a distinct keyword (e.g., <code>low_level_unwrapped</code>)
                  to indicate: "expect exact syntax, no convenience
                  fills."</p>
              </li>
            </ol>
            <p>In either case, existing <code>.cmm</code> files would
              continue to be supported as-is. The current parser
              wouldn't need features removed or changed — the new syntax
              would <b>only add capabilities</b>.</p>
            <blockquote>
              <p><i>"It’s unclear from your example how those blocks
                  would work exactly. Is <code>low_level_unwrapped</code>
                  a label? If so can we <code>goto</code> it? Is it a
                  keyword? Something else entirely?"</i> — Andreas</p>
            </blockquote>
            <p>Apologies for the confusion — I’m not well-versed in the
              formal terminology.</p>
            <p>To clarify: <code>low_level_unwrapped</code> (or <code>very_low_level</code>,
              or another name) would be a <b>keyword or syntactic
                construct</b> that tells the parser to interpret the
              contents of the block <code>{ ... }</code> using a syntax
              <b>identical to what the pretty printer emits</b>.</p>
            <p>For example:</p>
            <pre><code>function1 { }            // existing low-level syntax
function2() { }          // existing high-level syntax

very_low_level { ... }   // new mode: code with exact pretty-printed syntax inside the block
</code></pre>
            <blockquote>
              <p><i>"Rather than change the language understood by the
                  parser, would it not be easier to change the language
                  spat out by the pretty-printer to be compatible with
                  the parser?"</i></p>
            </blockquote>
            <p>Unfortunately, that’s not a practical path forward.</p>
            <p>At the start of the project, Csaba (my mentor)
              recommended leaving the parser mostly untouched and
              focusing instead on extending the pretty printer. However,
              we’ve realized that the differences between the parser and
              the pretty printer are not trivial. The parser — even in
              its current “low-level” mode — <b>inserts inferred data</b>
              via convenience functions. It <b>abstracts part of the
                structure</b>, meaning we cannot fully recover the
              original Cmm ADT just by parsing.</p>
            <p>In other words, <b>modifying the pretty printer to match
                the parser would require it to <i>lose information</i></b>
              — which I strongly oppose. If Cmm is generated
              programmatically, the pretty-printed version would lack
              structural information present in the internal data
              structure. And parseable Cmm would still be <b>incapable
                of expressing all features of the AST</b>.</p>
            <p>I hope that also addresses your concern, Hécate.</p>
            <p>This GSoC project runs until <b>November 10th</b>. I was
              granted extra time since, unlike most participants, I’m
              not working through summer vacation — I’m in the Southern
              Hemisphere.</p>
            <p>(Also, I realize I previously used the wrong project name
              in this thread — the correct title of my GSoC project is <b>“Documenting
                and improving Cmm.”</b>)</p>
            <p>Regarding the risk of <b>bitrot</b> in a new parser or
              new syntax mode: one possible mitigation would be to add <b>regression
                tests</b> that check whether parsing a file and
              pretty-printing it results in compatible output.</p>
            <p>On a related note, I’ve noticed that <b>some Cmm
                examples in the documentation and even in source code
                comments are incorrect or outdated</b>. Part of my work
              includes identifying and correcting these inconsistencies.</p>
            <p>Thanks again to everyone for your time and input — I
              greatly appreciate the discussion and feedback.</p>
            <p>Best regards,<br>
              <b>Diego Antonio Rosario Palomino</b><br>
              GSoC 2025 – Documenting and improving Cmm</p>
          </div>
          <br>
          <div class="gmail_quote">
            <div dir="ltr" class="gmail_attr">El lun, 28 jul 2025 a
              la(s) 11:04 a.m., Hécate via ghc-devs (<a
                href="mailto:ghc-devs@haskell.org" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">ghc-devs@haskell.org</a>)
              escribió:<br>
            </div>
            <blockquote class="gmail_quote"
style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
              <div>
                <p>Hi Diego,</p>
                <p>Thank you very much for your work in this direction,
                  it's sorely needed.<br>
                </p>
                <p>I'm all for having proper roundtrip correctness for
                  Cmm, but I am not sure altering the parser is the way
                  to go.<br>
                  In my opinion, GHC should produce valid textual Cmm,
                  that can be ingested by the parser at it is today.<br>
                  <br>
                  Have a nice day,<br>
                  Hécate<br>
                </p>
                <div>Le 28/07/2025 à 02:16, Diego Antonio Rosario
                  Palomino a écrit :<br>
                </div>
                <blockquote type="cite">
                  <div dir="ltr">
                    <p>Hello GHC devs,</p>
                    <p>I'm currently working on Cmm documentation and
                      tooling improvements as part of my Google Summer
                      of Code project. One of my core goals is to make
                      Cmm roundtrip serializable.</p>
                    <p>Right now, the in-memory Cmm data
                      structure—generated programmatically (e.g., from
                      STG via GHC)—can be pretty-printed, and Cmm can
                      also be parsed. However, the pretty-printed
                      version is not compatible with the parser. That
                      is, we cannot take the output of the pretty
                      printer and feed it directly back into the parser.</p>
                    <p>Example:</p>
                    <p>Parseable version:</p>
                    <pre><code>sum {
 cr:
  bits64 x;
  x = R1 + R2;
  R1 = x;
  jump %ENTRY_CODE(Sp(0))[R1];
}
</code></pre>
                    <p>Pretty-printed version:</p>
                    <pre><code>sum() { // []
  { info_tbls: []
    stack_info: arg_space: 8
  }
  {offset
    cf: // global
      _ce::I64 = R1 + R2;
      R1 = _ce::I64;
      call (I64[Sp + 0 * 8])(R1) args: 8, res: 0, upd: 8;
  }
}
</code></pre>
                    <p>Another example:</p>
                    <p>Parseable version:</p>
                    <pre><code>simple_sum_4 { // [R2, R1]
  cr: // global
    bits64 _cq;
    _cq = R2;
    bits64 _cp;
    _cp = R1;
    R1 = _cq + _cp;
    jump (bits64[Sp])[R1];
}
</code></pre>
                    <p>Pretty-printed version:</p>
                    <pre><code>simple_sum_4() { // []
  { info_tbls: []
    stack_info: arg_space: 8
  }
  {offset
    cs: // global
      _cq::I64 = R2;
      _cr::I64 = R1;
      R1 = _cq::I64 + _cr::I64;
      call (I64[Sp])(R1) args: 8, res: 0, upd: 8;
  }
}
</code></pre>
                    <p>While it’s possible to write parseable Cmm that
                      resembles the pretty-printed version (and hence
                      the internal ADT), they don’t fully match—mainly
                      because the parser inserts inferred fields using
                      convenience functions.</p>
                    <p>Proposal:</p>
                    <p>To make roundtrip serialization possible, I
                      propose supporting a new syntax that matches the
                      pretty printer output exactly.</p>
                    <p>There are a couple of design options:</p>
                    <ol>
                      <li>
                        <p>Create a separate parser that accepts the
                          pretty-printed syntax. Files could then use
                          either the current parser or the new strict
                          one.</p>
                      </li>
                      <li>
                        <p>Extend the current parser with a dedicated
                          block syntax like:</p>
                      </li>
                    </ol>
                    <pre><code>low_level_unwrapped {
  ...
}
</code></pre>
                    <p>This second option is the one my mentor
                      recommends, as it may better reflect GHC
                      developers' preferences. In this mode, the parser
                      would not insert any inferred data and would
                      expect the input to match the pretty-printed form
                      exactly.</p>
                    <p>This would enable a true roundtrip:</p>
                    <ul>
                      <li>
                        <p>Compile Haskell to Cmm (in-memory AST)</p>
                      </li>
                      <li>
                        <p>Pretty-print and write it to disk (wrapped in
                          low_level_unwrapped { ... })</p>
                      </li>
                      <li>
                        <p>Later read it back using the parser and
                          continue with codegen</p>
                      </li>
                    </ul>
                    <p>Optional future direction:</p>
                    <p>As a side note: currently the parser has both a
                      “high-level” and a “low-level” mode. The low-level
                      mode resembles the AST more closely but still
                      inserts some inferred data.</p>
                    <p>If we introduce this new “exact” low-level form,
                      it's possible the existing low-level mode could
                      become redundant. We might then have:</p>
                    <ul>
                      <li>
                        <p>High-level syntax</p>
                      </li>
                      <li>
                        <p>New low-level (exact)</p>
                      </li>
                      <li>
                        <p>And possibly deprecate the current low-level
                          variant</p>
                      </li>
                    </ul>
                    <p>I’d be interested in your thoughts on whether
                      that direction makes sense.</p>
                    <p>Serialization libraries?</p>
                    <p>One technically possible—but likely
                      unacceptable—alternative would be to derive
                      serialization via a library like <code>aeson</code>.
                      That would enable serializing and deserializing
                      the Cmm AST directly. However, I understand that <code>aeson</code>
                      adds a large dependency footprint, and likely
                      wouldn't be suitable for inclusion in GHC.</p>
                    <p>Final question:</p>
                    <p>Lastly—I’ve heard that parts of the Cmm pipeline
                      may currently be under refactoring. If that’s the
                      case, could you point me to which parts (parser,
                      pretty printer, internal representation, etc.) are
                      being modified? I’d like to align my efforts
                      accordingly and avoid conflicts.</p>
                    <p>Thanks very much for your time and input! I'm
                      happy to iterate on this based on your feedback.</p>
                    <p>Best regards,<br>
                      Diego Antonio Rosario Palomino<br>
                      GSoC 2025 – Cmm Documentation & Tooling</p>
                  </div>
                  <br>
                  <fieldset></fieldset>
                  <pre>_______________________________________________
ghc-devs mailing list
<a href="mailto:ghc-devs@haskell.org" target="_blank"
                  moz-do-not-send="true" class="moz-txt-link-freetext">ghc-devs@haskell.org</a>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs"
                  target="_blank" moz-do-not-send="true"
                  class="moz-txt-link-freetext">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a>
</pre>
                </blockquote>
                <pre cols="72">-- 
Hécate ✨
🐦: @TechnoEmpress
IRC: Hecate
WWW: <a href="https://glitchbra.in" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">https://glitchbra.in</a>
RUN: BSD</pre>
              </div>
              _______________________________________________<br>
              ghc-devs mailing list<br>
              <a href="mailto:ghc-devs@haskell.org" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">ghc-devs@haskell.org</a><br>
              <a
href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs"
                rel="noreferrer" target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a><br>
            </blockquote>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="moz-mime-attachment-header"></fieldset>
      <pre wrap="" class="moz-quote-pre">_______________________________________________
ghc-devs mailing list
<a class="moz-txt-link-abbreviated" href="mailto:ghc-devs@haskell.org">ghc-devs@haskell.org</a>
<a class="moz-txt-link-freetext" href="http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs">http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs</a>
</pre>
    </blockquote>
    <pre class="moz-signature" cols="72">-- 
Hécate ✨
🐦: @TechnoEmpress
IRC: Hecate
WWW: <a class="moz-txt-link-freetext" href="https://glitchbra.in">https://glitchbra.in</a>
RUN: BSD</pre>
  </body>
</html>