Help implementing Multiline String Literals

Thu Feb 8 15:54:53 UTC 2024

I don't think that is the right way to go.

They are different syntactic forms so they should be distinguished in
the syntax tree.

If I want to generate HsSyn directly, and print it out, how does the
compiler know whether I meant to print a normal string literal or a
multi line string literal? What about if the compiler tries to print
out an expression containing a string literal in an error message,
multi or normal?

Matt

On Thu, Feb 8, 2024 at 3:35 PM Brandon Chinn <brandonchinn178 at gmail.com> wrote:
>
> Thanks Sebastian and Matt!
>
> Matt - can you elaborate, I don't understand your comment. A multiline string is just syntax sugar for a normal string, so if the lexer does the post processing, it can be treated as a normal string the rest of the way. Why does anything else in the compiler need to know if the string was written as a multiline string?
>
> Or, to rephrase, a multiline string _should_ be semantically indistinguishable from a normal string with \n characters typed in.
>
> On Thu, Feb 8, 2024, 7:09 AM Matthew Pickering <matthewtpickering at gmail.com> wrote:
>>
>> I would imagine you modify the lexer like you describe, but it's not
>> clear to me you want to use the same constructor `HsString` to
>> represent them all the way through the compiler.
>>
>> If you reuse HsString then how to you distinguish between a string
>> which contains a newline and a multi-line string for example? It just
>> seems simpler to me to explicitly represent a multi-line string..
>> perhaps `HsMultiLineString [String]` rather than trying to shoehorn
>> them together and run into subtle bugs like this.
>>
>> Matt
>>
>> On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf <sgraf1337 at gmail.com> wrote:
>> >
>> > Hi Brandon,
>> >
>> > I'm not following all of the details here, but from my naïve understanding, I would definitely tweak the lexer, do the post-processing and then have a canonical string representation rather than waiting until desugaring.
>> > If you like 1.4 best, give it a try. You will seen soon enough if some performance regression test gets worse. It can't hurt to write a few yourself either.
>> > I don't think that post-processing the strings would incur too much a hit compared to compiling those strings and serialise them into an executable.
>> > I also bet that you can get rid some of the performance problems with list fusion.
>> >
>> > Cheers,
>> > Sebastian
>> >
>> > ------ Originalnachricht ------
>> > Von: "Brandon Chinn" <brandonchinn178 at gmail.com>
>> > An: ghc-devs at haskell.org
>> > Gesendet: 04.02.2024 19:24:19
>> > Betreff: Help implementing Multiline String Literals
>> >
>> >  Hello!
>> >
>> > I'm trying to implement #24390, which implements the multiline string literals proposal (existing work done in wip/multiline-strings). I originally suggested adding HsMultilineString to HsLit and translating it to HsString in renaming, then Matthew Pickering suggested I translate it in desugaring instead. I tried going down this approach, but I'm running into two main issues: Escaped characters and Overloaded strings.
>> >
>> > Apologies in advance for a long email. TL;DR - The best implementation I could think of involves a complete rewrite of how strings are lexed and modifying HsString instead of adding a new HsMultilineString constructor. If this is absolutely crazy talk, please dissuade me from this :)
>> >
>> > ===== Problem 1: Escaped characters =====
>> > Currently, Lexer.x resolves escaped characters for string literals. In the Note [Literal source text], we see that this is intentional; HsString should contain a normalized internal representation. However, multiline string literals have a post-processing step that requires distinguishing between the user typing a newline vs the user typing literally a backslash + an `N` (and other things like knowing if a user typed in `\&`, which currently goes away in lexing as well).
>> >
>> > Fundamentally, the current logic to resolve escaped characters is specific to the Lexer monad and operates on a per-character basis. But the multiline string literals proposal requires post-processing the whole string, then resolving escaped characters all at once.
>> >
>> > Possible solutions:
>> >
>> > (1.1) Duplicate the logic for resolving escaped characters
>> >     * Pro: Leaves normal string lexing untouched
>> >     * Con: Two sources of truth, possibly divergent behaviors between multiline and normal strings
>> >
>> > (1.2) Stick the post-processed string back into P, then rerun normal string lexing to resolve escaped characters
>> >     * Pro: Leaves normal string lexing untouched
>> >     * Con: Seems roundabout, inefficient, and hacky
>> >
>> > (1.3) Refactor the resolve-escaped-characters logic to work in both the P monad and as a pure function `String -> String`
>> >     * Pro: Reuses same escaped-characters logic for both normal + multiline strings
>> >     * Con: Different overall behavior between the two string types: Normal string still lexed per-character, Multiline strings would lex everything
>> >     * Con: Small refactor of lexing normal strings, which could introduce regressions
>> >
>> > (1.4) Read entire string (both normal + multiline) with no preprocessing (including string gaps or anything, except escaping quote delimiters), and define all post-processing steps as pure `String -> String` functions
>> >     * Pro: Gets out of monadic code quickly, turn bulk of string logic into pure code
>> >     * Pro: Processes normal + multiline strings exactly the same
>> >     * Pro: Opens the door for future string behaviors, e.g. raw string could do the same "read entire string" logic, and just not do any post-processing.
>> >     * Con: Could be less performant
>> >     * Con: Major refactor of lexing normal strings, which could introduce regressions
>> >
>> > I like solution 1.4 the best, as it generalizes string processing behavior the best and is more pipeline-style vs the currently more imperative style. But I recognize possible performance or behavior regressions are a real thing, so if anyone has any thoughts here, I'd love to hear them.
>> >
>> > ===== Problem 2: Overloaded strings =====
>> > Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the renaming phase. Following Matthew's suggestion of resolving multiline string literals in the desugar step, this would mean that multiline string literals are post-processed after OverloadedStrings has already been applied.
>> >
>> > I don't like any of the solutions this approach brings up:
>> > * Do post processing both when Desugaring HsMultilineString AND when Renaming HsMultilineString to HsOverLit - seems wrong to process multiline strings in two different phases
>> > * Add HsIsStringMultiline and post process when desugaring both HsMultilineString and HsIsStringMultiline - would ideally like to avoid adding a variant of HsIsStringMultiline
>> >
>> > Instead, I propose we throw away the HsMultilineString idea and reuse HsString. The multiline syntax would still be preserved in the SourceText, and this also leaves the door open for future string features. For example, if we went with HsMultilineString, then adding raw strings would require adding both HsRawString and HsMultilineRawString.
>> >
>> > Here are two possible solutions for reusing HsString:
>> >
>> > (2.1) Add a HsStringType parameter to HsString
>> >     * HsStringType would define the format of the FastString stored in HsString: Normal => processed, Multiline => stores raw string, needs post-processing
>> >     * Post processing could occur in desugaring, with or without OverloadedStrings
>> >     * Pro: Shows the parsed multiline string before processing in -ddump-parsed
>> >     * Con: HsString containing Multiline strings would not contain the normalized representation mentioned in Note [Literal source text]
>> >     * Con: Breaking change in the GHC API
>> >
>> > (2.2) Post-process multiline strings in lexer
>> >     * Lexer would do all the post processing (for example, in conjunction with solution 1.4) and just return a normal HsString
>> >     * Pro: Multiline string is immediately desugared and behaves as expected for OverloadedStrings (and any other behaviors of string literals, existing or future) for free
>> >     * Pro: HsString would still always contain the normalized representation
>> >     * Con: No way of inspecting the raw multiline parse output before processing, e.g. via -ddump-parsed
>> >
>> > I'm leaning towards solution 2.1, but curious what people's thoughts are.
>> >
>> > ===== Closing remarks =====
>> > Again, sorry for the long email. My head is spinning trying to figure out this feature. Any help would be greatly appreciated.
>> >
>> > As an aside, I last worked on GHC back in 2020 or 2021, and my goodness. The Hadrian build is so much smoother (and faster!? Not sure if it's just my new laptop though) than what it was last time I touched the codebase. Huge thanks to the maintainers, both for the tooling and the docs in the wiki. This is a much more enjoyable experience.
>> >
>> > Thanks,
>> > Brandon
>> >
>> > _______________________________________________
>> > ghc-devs mailing list
>> > ghc-devs at haskell.org
>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs