[Git][ghc/ghc][wip/T25375] Fix CRLF in multiline strings (#25375)

Wed Oct 16 04:17:02 UTC 2024


Brandon Chinn pushed to branch wip/T25375 at Glasgow Haskell Compiler / GHC


Commits:
838b5c67 by Brandon Chinn at 2024-10-15T21:15:50-07:00
Fix CRLF in multiline strings (#25375)

- - - - -


6 changed files:

- .gitattributes
- compiler/GHC/Parser/String.hs
- docs/users_guide/exts/multiline_strings.rst
- + testsuite/tests/parser/should_run/T25375.hs
- + testsuite/tests/parser/should_run/T25375.stdout
- testsuite/tests/parser/should_run/all.T


Changes:

=====================================
.gitattributes
=====================================
@@ -2,3 +2,4 @@
 # don't convert anything on checkout
 * text=auto eol=lf
 mk/win32-tarballs.md5sum text=auto eol=LF
+testsuite/tests/parser/should_run/T25375.hs text=auto eol=crlf


=====================================
compiler/GHC/Parser/String.hs
=====================================
@@ -262,6 +262,7 @@ lexMultilineString = lexStringWith processChars processChars
     processChars =
           collapseGaps             -- Step 1
       >>> expandLeadingTabs        -- Step 3
+      >>> normalizeEOL
       >>> rmCommonWhitespacePrefix -- Step 4
       >>> collapseOnlyWsLines      -- Step 5
       >>> rmFirstNewline           -- Step 7a
@@ -280,6 +281,18 @@ lexMultilineString = lexStringWith processChars processChars
             [] -> []
        in go 0
 
+    -- Normalize line endings to LF. The spec dictates that lines should be
+    -- split on EOL and rejoined with LF always, even if originally CRLF. But
+    -- because we aren't actually splitting/rejoining, we'll manually convert
+    -- CRLF here
+    normalizeEOL :: HasChar c => [c] -> [c]
+    normalizeEOL =
+      let go = \case
+            Char '\r' : c@(Char '\n') : cs -> c : go cs
+            c : cs -> c : go cs
+            [] -> []
+       in go
+
     rmCommonWhitespacePrefix :: HasChar c => [c] -> [c]
     rmCommonWhitespacePrefix cs0 =
       let commonWSPrefix = getCommonWsPrefix (map getChar cs0)
@@ -354,14 +367,14 @@ the same behavior as HsString, which contains the normalized string
 
 The canonical steps for post processing a multiline string are:
 1. Collapse string gaps
-2. Split the string by newlines
+2. Split the string by EOL
 3. Convert leading tabs into spaces
     * In each line, any tabs preceding non-whitespace characters are replaced with spaces up to the next tab stop
 4. Remove common whitespace prefix in every line except the first (see below)
 5. If a line contains only whitespace, remove all of the whitespace
 6. Join the string back with `\n` delimiters
-7a. If the first character of the string is a newline, remove it
-7b. If the last character of the string is a newline, remove it
+7a. If the first character of the string is an EOL, remove it
+7b. If the last character of the string is an EOL, remove it
 8. Interpret escaped characters
 
 The common whitespace prefix can be informally defined as "The longest
@@ -372,7 +385,7 @@ It's more precisely defined with the following algorithm:
 
 1. Take a list representing the lines in the string
 2. Ignore the following elements in the list:
-    * The first line (we want to ignore everything before the first newline)
+    * The first line (we want to ignore everything before the first EOL)
     * Empty lines
     * Lines with only whitespace characters
 3. Calculate the longest prefix of whitespace shared by all lines in the remaining list


=====================================
docs/users_guide/exts/multiline_strings.rst
=====================================
@@ -14,7 +14,7 @@ With this extension, GHC now recognizes multiline string literals with ``"""`` d
 
 Normal string literals are lexed, then string gaps are collapsed, then escape characters are resolved. Multiline string literals add the following post-processing steps between collapsing string gaps and resolving escape characters:
 
-#. Split the string by newlines
+#. Split the string by EOL
 
 #. Replace leading tabs with spaces up to the next tab stop
 
@@ -22,9 +22,11 @@ Normal string literals are lexed, then string gaps are collapsed, then escape ch
 
 #. If a line only contains whitespace, remove all of the whitespace
 
-#. Join the string back with ``\n`` delimiters
+#. Join the string back with ``\n`` delimiters -- even if file uses CRLF
 
-#. If the first character of the string is a newline, remove it
+#. If the first character of the string is an EOL, remove it
+
+#. If the last character of the string is an EOL, remove it
 
 Examples
 ~~~~~~~~


=====================================
testsuite/tests/parser/should_run/T25375.hs
=====================================
@@ -0,0 +1,38 @@
+{-# LANGUAGE MultilineStrings #-}
+
+str1 = unlines
+  [ "aaa"
+  , "bbb"
+  , "ccc"
+  ]
+
+str2 = "aaa\n\
+       \bbb\n\
+       \ccc\n"
+
+str3 = """
+       aaa
+       bbb
+       ccc
+       """
+
+str4 = """
+
+       aaa
+       bbb
+       ccc
+
+       """
+
+str5 = """
+       aaa
+       bbb
+       ccc\n
+       """
+
+main = do
+  print str1
+  print str2
+  print str3
+  print str4
+  print str5


=====================================
testsuite/tests/parser/should_run/T25375.stdout
=====================================
@@ -0,0 +1,5 @@
+"aaa\nbbb\nccc\n"
+"aaa\nbbb\nccc\n"
+"aaa\nbbb\nccc"
+"\naaa\nbbb\nccc\n"
+"aaa\nbbb\nccc\n"


=====================================
testsuite/tests/parser/should_run/all.T
=====================================
@@ -23,3 +23,4 @@ test('RecordDotSyntax5', normal, compile_and_run, [''])
 test('ListTuplePunsConstraints', extra_files(['ListTuplePunsConstraints.hs']), ghci_script, ['ListTuplePunsConstraints.script'])
 test('MultilineStrings', normal, compile_and_run, [''])
 test('MultilineStringsOverloaded', normal, compile_and_run, [''])
+test('T25375', normal, compile_and_run, [''])



View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/838b5c679d02ed26e2fbca69c1b1b7b0274ebf84

-- 
View it on GitLab: https://gitlab.haskell.org/ghc/ghc/-/commit/838b5c679d02ed26e2fbca69c1b1b7b0274ebf84
You're receiving this email because of your account on gitlab.haskell.org.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/ghc-commits/attachments/20241016/934885bd/attachment-0001.html>