Help implementing Multiline String Literals

Thu Feb 8 14:45:37 UTC 2024

Hi Brandon,

I'm not following all of the details here, but from my naïve 
understanding, I would definitely tweak the lexer, do the 
post-processing and then have a canonical string representation rather 
than waiting until desugaring.
If you like 1.4 best, give it a try. You will seen soon enough if some 
performance regression test gets worse. It can't hurt to write a few 
yourself either.
I don't think that post-processing the strings would incur too much a 
hit compared to compiling those strings and serialise them into an 
executable.
I also bet that you can get rid some of the performance problems with 
list fusion.

Cheers,
Sebastian

------ Originalnachricht ------
Von: "Brandon Chinn" <brandonchinn178 at gmail.com>
An: ghc-devs at haskell.org
Gesendet: 04.02.2024 19:24:19
Betreff: Help implementing Multiline String Literals

>  Hello!
>
>I'm trying to implement #24390 
><https://gitlab.haskell.org/ghc/ghc/-/issues/24390>, which implements 
>the multiline string literals proposal 
><https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0569-multiline-strings.rst> 
>(existing work done in wip/multiline-strings 
><https://gitlab.haskell.org/ghc/ghc/-/compare/master...wip%2Fmultiline-strings?from_project_id=1&straight=false>). 
>I originally suggested adding HsMultilineString to HsLit and 
>translating it to HsString in renaming, then Matthew Pickering 
>suggested I translate it in desugaring instead. I tried going down this 
>approach, but I'm running into two main issues: Escaped characters and 
>Overloaded strings.
>
>Apologies in advance for a long email. TL;DR - The best implementation 
>I could think of involves a complete rewrite of how strings are lexed 
>and modifying HsString instead of adding a new HsMultilineString 
>constructor. If this is absolutely crazy talk, please dissuade me from 
>this :)
>
>===== Problem 1: Escaped characters =====
>Currently, Lexer.x resolves escaped characters for string literals. In 
>the Note [Literal source text], we see that this is intentional; 
>HsString should contain a normalized internal representation. However, 
>multiline string literals have a post-processing step that requires 
>distinguishing between the user typing a newline vs the user typing 
>literally a backslash + an `N` (and other things like knowing if a user 
>typed in `\&`, which currently goes away in lexing as well).
>
>Fundamentally, the current logic to resolve escaped characters is 
>specific to the Lexer monad and operates on a per-character basis. But 
>the multiline string literals proposal requires post-processing the 
>whole string, then resolving escaped characters all at once.
>
>Possible solutions:
>
>(1.1) Duplicate the logic for resolving escaped characters
>     * Pro: Leaves normal string lexing untouched
>     * Con: Two sources of truth, possibly divergent behaviors between 
>multiline and normal strings
>
>(1.2) Stick the post-processed string back into P, then rerun normal 
>string lexing to resolve escaped characters
>     * Pro: Leaves normal string lexing untouched
>     * Con: Seems roundabout, inefficient, and hacky
>
>(1.3) Refactor the resolve-escaped-characters logic to work in both the 
>P monad and as a pure function `String -> String`
>     * Pro: Reuses same escaped-characters logic for both normal + 
>multiline strings
>     * Con: Different overall behavior between the two string types: 
>Normal string still lexed per-character, Multiline strings would lex 
>everything
>     * Con: Small refactor of lexing normal strings, which could 
>introduce regressions
>
>(1.4) Read entire string (both normal + multiline) with no 
>preprocessing (including string gaps or anything, except escaping quote 
>delimiters), and define all post-processing steps as pure `String -> 
>String` functions
>     * Pro: Gets out of monadic code quickly, turn bulk of string logic 
>into pure code
>     * Pro: Processes normal + multiline strings exactly the same
>     * Pro: Opens the door for future string behaviors, e.g. raw string 
>could do the same "read entire string" logic, and just not do any 
>post-processing.
>     * Con: Could be less performant
>     * Con: Major refactor of lexing normal strings, which could 
>introduce regressions
>
>I like solution 1.4 the best, as it generalizes string processing 
>behavior the best and is more pipeline-style vs the currently more 
>imperative style. But I recognize possible performance or behavior 
>regressions are a real thing, so if anyone has any thoughts here, I'd 
>love to hear them.
>
>===== Problem 2: Overloaded strings =====
>Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in 
>the renaming phase. Following Matthew's suggestion of resolving 
>multiline string literals in the desugar step, this would mean that 
>multiline string literals are post-processed after OverloadedStrings 
>has already been applied.
>
>I don't like any of the solutions this approach brings up:
>* Do post processing both when Desugaring HsMultilineString AND when 
>Renaming HsMultilineString to HsOverLit - seems wrong to process 
>multiline strings in two different phases
>* Add HsIsStringMultiline and post process when desugaring both 
>HsMultilineString and HsIsStringMultiline - would ideally like to avoid 
>adding a variant of HsIsStringMultiline
>
>Instead, I propose we throw away the HsMultilineString idea and reuse 
>HsString. The multiline syntax would still be preserved in the 
>SourceText, and this also leaves the door open for future string 
>features. For example, if we went with HsMultilineString, then adding 
>raw strings would require adding both HsRawString and 
>HsMultilineRawString.
>
>Here are two possible solutions for reusing HsString:
>
>(2.1) Add a HsStringType parameter to HsString
>     * HsStringType would define the format of the FastString stored in 
>HsString: Normal => processed, Multiline => stores raw string, needs 
>post-processing
>     * Post processing could occur in desugaring, with or without 
>OverloadedStrings
>     * Pro: Shows the parsed multiline string before processing in 
>-ddump-parsed
>     * Con: HsString containing Multiline strings would not contain the 
>normalized representation mentioned in Note [Literal source text]
>     * Con: Breaking change in the GHC API
>
>(2.2) Post-process multiline strings in lexer
>     * Lexer would do all the post processing (for example, in 
>conjunction with solution 1.4) and just return a normal HsString
>     * Pro: Multiline string is immediately desugared and behaves as 
>expected for OverloadedStrings (and any other behaviors of string 
>literals, existing or future) for free
>     * Pro: HsString would still always contain the normalized 
>representation
>     * Con: No way of inspecting the raw multiline parse output before 
>processing, e.g. via -ddump-parsed
>
>I'm leaning towards solution 2.1, but curious what people's thoughts 
>are.
>
>===== Closing remarks =====
>Again, sorry for the long email. My head is spinning trying to figure 
>out this feature. Any help would be greatly appreciated.
>
>As an aside, I last worked on GHC back in 2020 or 2021, and my 
>goodness. The Hadrian build is so much smoother (and faster!? Not sure 
>if it's just my new laptop though) than what it was last time I touched 
>the codebase. Huge thanks to the maintainers, both for the tooling and 
>the docs in the wiki. This is a much more enjoyable experience.
>
>Thanks,
>Brandon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20240208/b2c99799/attachment.html>