Help implementing Multiline String Literals

Brandon Chinn brandonchinn178 at gmail.com
Thu Feb 8 15:35:10 UTC 2024


Thanks Sebastian and Matt!

Matt - can you elaborate, I don't understand your comment. A multiline
string is just syntax sugar for a normal string, so if the lexer does the
post processing, it can be treated as a normal string the rest of the way.
Why does anything else in the compiler need to know if the string was
written as a multiline string?

Or, to rephrase, a multiline string _should_ be semantically
indistinguishable from a normal string with \n characters typed in.

On Thu, Feb 8, 2024, 7:09 AM Matthew Pickering <matthewtpickering at gmail.com>
wrote:

> I would imagine you modify the lexer like you describe, but it's not
> clear to me you want to use the same constructor `HsString` to
> represent them all the way through the compiler.
>
> If you reuse HsString then how to you distinguish between a string
> which contains a newline and a multi-line string for example? It just
> seems simpler to me to explicitly represent a multi-line string..
> perhaps `HsMultiLineString [String]` rather than trying to shoehorn
> them together and run into subtle bugs like this.
>
> Matt
>
> On Thu, Feb 8, 2024 at 2:45 PM Sebastian Graf <sgraf1337 at gmail.com> wrote:
> >
> > Hi Brandon,
> >
> > I'm not following all of the details here, but from my naïve
> understanding, I would definitely tweak the lexer, do the post-processing
> and then have a canonical string representation rather than waiting until
> desugaring.
> > If you like 1.4 best, give it a try. You will seen soon enough if some
> performance regression test gets worse. It can't hurt to write a few
> yourself either.
> > I don't think that post-processing the strings would incur too much a
> hit compared to compiling those strings and serialise them into an
> executable.
> > I also bet that you can get rid some of the performance problems with
> list fusion.
> >
> > Cheers,
> > Sebastian
> >
> > ------ Originalnachricht ------
> > Von: "Brandon Chinn" <brandonchinn178 at gmail.com>
> > An: ghc-devs at haskell.org
> > Gesendet: 04.02.2024 19:24:19
> > Betreff: Help implementing Multiline String Literals
> >
> >  Hello!
> >
> > I'm trying to implement #24390, which implements the multiline string
> literals proposal (existing work done in wip/multiline-strings). I
> originally suggested adding HsMultilineString to HsLit and translating it
> to HsString in renaming, then Matthew Pickering suggested I translate it in
> desugaring instead. I tried going down this approach, but I'm running into
> two main issues: Escaped characters and Overloaded strings.
> >
> > Apologies in advance for a long email. TL;DR - The best implementation I
> could think of involves a complete rewrite of how strings are lexed and
> modifying HsString instead of adding a new HsMultilineString constructor.
> If this is absolutely crazy talk, please dissuade me from this :)
> >
> > ===== Problem 1: Escaped characters =====
> > Currently, Lexer.x resolves escaped characters for string literals. In
> the Note [Literal source text], we see that this is intentional; HsString
> should contain a normalized internal representation. However, multiline
> string literals have a post-processing step that requires distinguishing
> between the user typing a newline vs the user typing literally a backslash
> + an `N` (and other things like knowing if a user typed in `\&`, which
> currently goes away in lexing as well).
> >
> > Fundamentally, the current logic to resolve escaped characters is
> specific to the Lexer monad and operates on a per-character basis. But the
> multiline string literals proposal requires post-processing the whole
> string, then resolving escaped characters all at once.
> >
> > Possible solutions:
> >
> > (1.1) Duplicate the logic for resolving escaped characters
> >     * Pro: Leaves normal string lexing untouched
> >     * Con: Two sources of truth, possibly divergent behaviors between
> multiline and normal strings
> >
> > (1.2) Stick the post-processed string back into P, then rerun normal
> string lexing to resolve escaped characters
> >     * Pro: Leaves normal string lexing untouched
> >     * Con: Seems roundabout, inefficient, and hacky
> >
> > (1.3) Refactor the resolve-escaped-characters logic to work in both the
> P monad and as a pure function `String -> String`
> >     * Pro: Reuses same escaped-characters logic for both normal +
> multiline strings
> >     * Con: Different overall behavior between the two string types:
> Normal string still lexed per-character, Multiline strings would lex
> everything
> >     * Con: Small refactor of lexing normal strings, which could
> introduce regressions
> >
> > (1.4) Read entire string (both normal + multiline) with no preprocessing
> (including string gaps or anything, except escaping quote delimiters), and
> define all post-processing steps as pure `String -> String` functions
> >     * Pro: Gets out of monadic code quickly, turn bulk of string logic
> into pure code
> >     * Pro: Processes normal + multiline strings exactly the same
> >     * Pro: Opens the door for future string behaviors, e.g. raw string
> could do the same "read entire string" logic, and just not do any
> post-processing.
> >     * Con: Could be less performant
> >     * Con: Major refactor of lexing normal strings, which could
> introduce regressions
> >
> > I like solution 1.4 the best, as it generalizes string processing
> behavior the best and is more pipeline-style vs the currently more
> imperative style. But I recognize possible performance or behavior
> regressions are a real thing, so if anyone has any thoughts here, I'd love
> to hear them.
> >
> > ===== Problem 2: Overloaded strings =====
> > Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in
> the renaming phase. Following Matthew's suggestion of resolving multiline
> string literals in the desugar step, this would mean that multiline string
> literals are post-processed after OverloadedStrings has already been
> applied.
> >
> > I don't like any of the solutions this approach brings up:
> > * Do post processing both when Desugaring HsMultilineString AND when
> Renaming HsMultilineString to HsOverLit - seems wrong to process multiline
> strings in two different phases
> > * Add HsIsStringMultiline and post process when desugaring both
> HsMultilineString and HsIsStringMultiline - would ideally like to avoid
> adding a variant of HsIsStringMultiline
> >
> > Instead, I propose we throw away the HsMultilineString idea and reuse
> HsString. The multiline syntax would still be preserved in the SourceText,
> and this also leaves the door open for future string features. For example,
> if we went with HsMultilineString, then adding raw strings would require
> adding both HsRawString and HsMultilineRawString.
> >
> > Here are two possible solutions for reusing HsString:
> >
> > (2.1) Add a HsStringType parameter to HsString
> >     * HsStringType would define the format of the FastString stored in
> HsString: Normal => processed, Multiline => stores raw string, needs
> post-processing
> >     * Post processing could occur in desugaring, with or without
> OverloadedStrings
> >     * Pro: Shows the parsed multiline string before processing in
> -ddump-parsed
> >     * Con: HsString containing Multiline strings would not contain the
> normalized representation mentioned in Note [Literal source text]
> >     * Con: Breaking change in the GHC API
> >
> > (2.2) Post-process multiline strings in lexer
> >     * Lexer would do all the post processing (for example, in
> conjunction with solution 1.4) and just return a normal HsString
> >     * Pro: Multiline string is immediately desugared and behaves as
> expected for OverloadedStrings (and any other behaviors of string literals,
> existing or future) for free
> >     * Pro: HsString would still always contain the normalized
> representation
> >     * Con: No way of inspecting the raw multiline parse output before
> processing, e.g. via -ddump-parsed
> >
> > I'm leaning towards solution 2.1, but curious what people's thoughts are.
> >
> > ===== Closing remarks =====
> > Again, sorry for the long email. My head is spinning trying to figure
> out this feature. Any help would be greatly appreciated.
> >
> > As an aside, I last worked on GHC back in 2020 or 2021, and my goodness.
> The Hadrian build is so much smoother (and faster!? Not sure if it's just
> my new laptop though) than what it was last time I touched the codebase.
> Huge thanks to the maintainers, both for the tooling and the docs in the
> wiki. This is a much more enjoyable experience.
> >
> > Thanks,
> > Brandon
> >
> > _______________________________________________
> > ghc-devs mailing list
> > ghc-devs at haskell.org
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20240208/791c7667/attachment.html>


More information about the ghc-devs mailing list