Help implementing Multiline String Literals

Sun Feb 4 18:24:19 UTC 2024

 Hello!

I'm trying to implement #24390
<https://gitlab.haskell.org/ghc/ghc/-/issues/24390>, which implements
the multiline
string literals proposal
<https://github.com/ghc-proposals/ghc-proposals/blob/master/proposals/0569-multiline-strings.rst>
(existing
work done in wip/multiline-strings
<https://gitlab.haskell.org/ghc/ghc/-/compare/master...wip%2Fmultiline-strings?from_project_id=1&straight=false>).
I originally suggested adding HsMultilineString to HsLit and translating it
to HsString in renaming, then Matthew Pickering suggested I translate it in
desugaring instead. I tried going down this approach, but I'm running into
two main issues: Escaped characters and Overloaded strings.

Apologies in advance for a long email. *TL;DR* - The best implementation I
could think of involves a complete rewrite of how strings are lexed and
modifying HsString instead of adding a new HsMultilineString constructor.
If this is absolutely crazy talk, please dissuade me from this :)

===== Problem 1: Escaped characters =====
Currently, Lexer.x resolves escaped characters for string literals. In
the Note [Literal source text], we see that this is intentional; HsString
should contain a normalized internal representation. However, multiline
string literals have a post-processing step that requires distinguishing
between the user typing a newline vs the user typing literally a
backslash + an `N` (and other things like knowing if a user typed in `\&`,
which currently goes away in lexing as well).

Fundamentally, the current logic to resolve escaped characters is specific
to the Lexer monad and operates on a per-character basis. But the multiline
string literals proposal requires post-processing the whole string, then
resolving escaped characters all at once.

Possible solutions:

(1.1) Duplicate the logic for resolving escaped characters
    * Pro: Leaves normal string lexing untouched
    * Con: Two sources of truth, possibly divergent behaviors between
multiline and normal strings

(1.2) Stick the post-processed string back into P, then rerun normal string
lexing to resolve escaped characters
    * Pro: Leaves normal string lexing untouched
    * Con: Seems roundabout, inefficient, and hacky

(1.3) Refactor the resolve-escaped-characters logic to work in both the P
monad and as a pure function `String -> String`
    * Pro: Reuses same escaped-characters logic for both normal + multiline
strings
    * Con: Different overall behavior between the two string types: Normal
string still lexed per-character, Multiline strings would lex everything
    * Con: Small refactor of lexing normal strings, which could introduce
regressions

(1.4) Read entire string (both normal + multiline) with no preprocessing
(including string gaps or anything, except escaping quote delimiters), and
define all post-processing steps as pure `String -> String` functions
    * Pro: Gets out of monadic code quickly, turn bulk of string logic into
pure code
    * Pro: Processes normal + multiline strings exactly the same
    * Pro: Opens the door for future string behaviors, e.g. raw string
could do the same "read entire string" logic, and just not do any
post-processing.
    * Con: Could be less performant
    * Con: Major refactor of lexing normal strings, which could introduce
regressions

I like solution 1.4 the best, as it generalizes string processing behavior
the best and is more pipeline-style vs the currently more imperative style.
But I recognize possible performance or behavior regressions are a real
thing, so if anyone has any thoughts here, I'd love to hear them.

===== Problem 2: Overloaded strings =====
Currently, `HsString s` is converted into `HsOverLit (HsIsString s)` in the
renaming phase. Following Matthew's suggestion of resolving multiline
string literals in the desugar step, this would mean that multiline string
literals are post-processed after OverloadedStrings has already been
applied.

I don't like any of the solutions this approach brings up:
* Do post processing both when Desugaring HsMultilineString AND when
Renaming HsMultilineString to HsOverLit - seems wrong to process multiline
strings in two different phases
* Add HsIsStringMultiline and post process when desugaring both
HsMultilineString and HsIsStringMultiline - would ideally like to avoid
adding a variant of HsIsStringMultiline

Instead, I propose we throw away the HsMultilineString idea and reuse
HsString. The multiline syntax would still be preserved in the SourceText,
and this also leaves the door open for future string features. For example,
if we went with HsMultilineString, then adding raw strings would require
adding both HsRawString and HsMultilineRawString.

Here are two possible solutions for reusing HsString:

(2.1) Add a HsStringType parameter to HsString
    * HsStringType would define the format of the FastString stored in
HsString: Normal => processed, Multiline => stores raw string, needs
post-processing
    * Post processing could occur in desugaring, with or without
OverloadedStrings
    * Pro: Shows the parsed multiline string before processing in
-ddump-parsed
    * Con: HsString containing Multiline strings would not contain the
normalized representation mentioned in Note [Literal source text]
    * Con: Breaking change in the GHC API

(2.2) Post-process multiline strings in lexer
    * Lexer would do all the post processing (for example, in conjunction
with solution 1.4) and just return a normal HsString
    * Pro: Multiline string is immediately desugared and behaves as
expected for OverloadedStrings (and any other behaviors of string literals,
existing or future) for free
    * Pro: HsString would still always contain the normalized representation
    * Con: No way of inspecting the raw multiline parse output before
processing, e.g. via -ddump-parsed

I'm leaning towards solution 2.1, but curious what people's thoughts are.

===== Closing remarks =====
Again, sorry for the long email. My head is spinning trying to figure out
this feature. Any help would be greatly appreciated.

As an aside, I last worked on GHC back in 2020 or 2021, and my goodness.
The Hadrian build is so much smoother (and faster!? Not sure if it's just
my new laptop though) than what it was last time I touched the codebase.
Huge thanks to the maintainers, both for the tooling and the docs in the
wiki. This is a much more enjoyable experience.

Thanks,
Brandon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-devs/attachments/20240204/f38998ef/attachment.html>