Help implementing Multiline String Literals
Sebastian Graf
sgraf1337 at gmail.com
Thu Feb 8 14:45:37 UTC 2024
Hi Brandon,
I'm not following all of the details here, but from my naïve
understanding, I would definitely tweak the lexer, do the
post-processing and then have a canonical string representation rather
than waiting until desugaring.
If you like 1.4 best, give it a try. You will seen soon enough if some
performance regression test gets worse. It can't hurt to write a few
yourself either.
I don't think that post-processing the strings would incur too much a
hit compared to compiling those strings and serialise them into an
executable.
I also bet that you can get rid some of the performance problems with
list fusion.
Cheers,
Sebastian
------ Originalnachricht ------
Von: "Brandon Chinn" <brandonchinn178 at gmail.com>
An: ghc-devs at haskell.org
Gesendet: 04.02.2024 19:24:19
Betreff: Help implementing Multiline String Literals
> Hello!
>
>I'm trying to implement #24390
><https://gitlab.haskell.org/ghc/ghc/-/issues/24390>, which implements
>the multiline string literals proposal
><https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0569-multiline-strings.rst>
>(existing work done in wip/multiline-strings
><https://gitlab.haskell.org/ghc/ghc/-/compare/master...wip%2Fmultiline-strings?from_project_id=1&straight=false>).
>I originally suggested adding HsMultilineString to HsLit and
>translating it to HsString in renaming, then Matthew Pickering
>suggested I translate it in desugaring instead. I tried going down this
>approach, but I'm running into two main issues: Escaped characters and
>Overloaded strings.
>
>Apologies in advance for a long email. TL;DR - The best implementation
>I could think of involves a complete rewrite of how strings are lexed
>and modifying HsString instead of adding a new HsMultilineString
>constructor. If this is absolutely crazy talk, please dissuade me from
>this :)
>
>===== Problem 1: Escaped characters =====
>Currently, Lexer.x resolves escaped characters for string literals. In
>the Note [Literal source text], we see that this is intentional;
>HsString should contain a normalized internal representation. However,
>multiline string literals have a post-processing step that requires
>distinguishing between the user typing a newline vs the user typing
>literally a backslash + an `N` (and other things like knowing if a user
>typed in `\&`, which currently goes away in lexing as well).
>
>Fundamentally, the current logic to resolve escaped characters is
>specific to the Lexer monad and operates on a per-character basis. But
>the multiline string literals proposal requires post-processing the
>whole string, then resolving escaped characters all at once.
>
>Possible solutions:
>
>(1.1) Duplicate the logic for resolving escaped characters
> * Pro: Leaves normal string lexing untouched
> * Con: Two sources of truth, possibly divergent behaviors between
>multiline and normal strings
>
>(1.2) Stick the post-processed string back into P, then rerun normal
>string lexing to resolve escaped characters
> * Pro: Leaves normal string lexing untouched
> * Con: Seems roundabout, inefficient, and hacky
>
>(1.3) Refactor the resolve-escaped-characters logic to work in both the
>P monad and as a pure function `String -> String`
> * Pro: Reuses same escaped-characters logic for both normal +
>multiline strings
> * Con: Different overall behavior between the two string types:
>Normal string still lexed per-character, Multiline strings would lex
>everything
> * Con: Small refactor of lexing normal strings, which could
>introduce regressions
>
>(1.4) Read entire string (both normal + multiline) with no
>preprocessing (including string gaps or anything, except escaping quote
>delimiters), and define all post-processing steps as pure `String ->
>String` functions
> * Pro: Gets out of monadic code quickly, turn bulk of string logic
>into pure code
> * Pro: Processes normal + multiline strings exactly the same
> * Pro: Opens the door for future string behaviors, e.g. raw string
>could do the same "read entire string" logic, and just not do any
>post-processing.
> * Con: Could be less performant
> * Con: Major refactor of lexing normal strings, which could
>introduce regressions
>
>I like solution 1.4 the best, as it generalizes string processing
>behavior the best and is more pipeline-style vs the currently more
>imperative style. But I recognize possible performance or behavior
>regressions are a real thing, so if anyone has any thoughts here, I'd
>love to hear them.
>
>===== Problem 2: Overloaded strings =====
>Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in
>the renaming phase. Following Matthew's suggestion of resolving
>multiline string literals in the desugar step, this would mean that
>multiline string literals are post-processed after OverloadedStrings
>has already been applied.
>
>I don't like any of the solutions this approach brings up:
>* Do post processing both when Desugaring HsMultilineString AND when
>Renaming HsMultilineString to HsOverLit - seems wrong to process
>multiline strings in two different phases
>* Add HsIsStringMultiline and post process when desugaring both
>HsMultilineString and HsIsStringMultiline - would ideally like to avoid
>adding a variant of HsIsStringMultiline
>
>Instead, I propose we throw away the HsMultilineString idea and reuse
>HsString. The multiline syntax would still be preserved in the
>SourceText, and this also leaves the door open for future string
>features. For example, if we went with HsMultilineString, then adding
>raw strings would require adding both HsRawString and
>HsMultilineRawString.
>
>Here are two possible solutions for reusing HsString:
>
>(2.1) Add a HsStringType parameter to HsString
> * HsStringType would define the format of the FastString stored in
>HsString: Normal => processed, Multiline => stores raw string, needs
>post-processing
> * Post processing could occur in desugaring, with or without
>OverloadedStrings
> * Pro: Shows the parsed multiline string before processing in
>-ddump-parsed
> * Con: HsString containing Multiline strings would not contain the
>normalized representation mentioned in Note [Literal source text]
> * Con: Breaking change in the GHC API
>
>(2.2) Post-process multiline strings in lexer
> * Lexer would do all the post processing (for example, in
>conjunction with solution 1.4) and just return a normal HsString
> * Pro: Multiline string is immediately desugared and behaves as
>expected for OverloadedStrings (and any other behaviors of string
>literals, existing or future) for free
> * Pro: HsString would still always contain the normalized
>representation
> * Con: No way of inspecting the raw multiline parse output before
>processing, e.g. via -ddump-parsed
>
>I'm leaning towards solution 2.1, but curious what people's thoughts
>are.
>
>===== Closing remarks =====
>Again, sorry for the long email. My head is spinning trying to figure
>out this feature. Any help would be greatly appreciated.
>
>As an aside, I last worked on GHC back in 2020 or 2021, and my
>goodness. The Hadrian build is so much smoother (and faster!? Not sure
>if it's just my new laptop though) than what it was last time I touched
>the codebase. Huge thanks to the maintainers, both for the tooling and
>the docs in the wiki. This is a much more enjoyable experience.
>
>Thanks,
>Brandon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20240208/b2c99799/attachment.html>
More information about the ghc-devs
mailing list