[RFC] Support Unicode characters in instance Show String
Oleg Grenrus
oleg.grenrus at iki.fi
Thu Jul 8 15:53:38 UTC 2021
Here is a simple patch, which I hope is close to what
1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
Unicode characters out of the range of ASCII.
of a proposed change will look like:
diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs
index 84077e473b..24569168d4 100644
--- a/libraries/base/GHC/Show.hs
+++ b/libraries/base/GHC/Show.hs
@@ -364,7 +364,10 @@ showCommaSpace = showString ", "
-- > showLitChar '\n' s = "\\n" ++ s
--
showLitChar :: Char -> ShowS
-showLitChar c s | c > '\DEL' = showChar '\\' (protectEsc isDec
(shows (ord c)) s)
+showLitChar c s | c > '\DEL' =
+ if isPrint c
+ then showChar c s
+ else showChar '\\' (protectEsc isDec (shows (ord c)) s)
showLitChar '\DEL' s = showString "\\DEL" s
showLitChar '\\' s = showString "\\\\" s
showLitChar c s | c >= ' ' = showChar c s
@@ -380,6 +383,13 @@ showLitChar c s = showString
('\\' : asciiTab!!ord c) s
-- I've done manual eta-expansion here, because otherwise it's
-- impossible to stop (asciiTab!!ord) getting floated out
as an MFE
+-- Local definition of isPrint to avoid fighting with cycles for now.
+isPrint :: Char -> Bool
+isPrint c = iswprint (ord c) /= 0
+
+foreign import ccall unsafe "u_iswprint"
+ iswprint :: Int -> Int
+
showLitString :: String -> ShowS
-- | Same as 'showLitChar', but for strings
-- It converts the string to a string using Haskell escape conventions
I applied it to ghc-8.10 branch,
% _build/stage1/bin/ghc --interactive
GHCi, version 8.10.5: https://www.haskell.org/ghc/ :? for help
Prelude> "äiti"
"äiti"
Prelude> "мир"
"мир"
Prelude> print "мир"
"мир"
Prelude> "😀"
"😀"
And then run test-suites of aeson, dhall and pandoc.
Aeson test-suite passed.
Dhall test-suites passed too,
However pandoc testsuite failed:
78 out of 2819 tests failed (35.88s)
An example failure is:
3587.md
#1:
FAIL (0.01s)
--- test/command/3587.md
+++ pandoc -f latex -t native
+ 1 [Para [Str "1 m",Space,Str "is",Space,Str
"equal",Space,Str "to",Space,Str "1000 mm"]]
- 1 [Para [Str "1\160m",Space,Str "is",Space,Str
"equal",Space,Str "to",Space,Str "1000\160mm"]]
Str is a constructor of Inline type, and takes Text: data Inline = Str
Text | ...
As discussed on the GHC issue [1], Text and ByteString Show Instances
piggyback on
String instance. Bodigrim said that Text will eventually migrate
to do the same as new Show String [2], so this issue will resurface.
Please explain the compatibility story. How library writes should write
their code (in test-suites) which rely on Show String or Show Text, such
that they could support GHC base versions (and/or text) versions
on the both sides of this breaking change.
I agree with Julian that required migration engineering effort across
(even just the open source) ecosystem is non-trivial.
Having a good plan would hopefully make it easier to accept that cost.
The fact it's a change which is not detectable at compile time
makes me very anxious about this, even I don't disagree with motivation
bits.
I have very little idea if and where I depend on Show String behavior.
It would also be interesting to see results of test-suites of all
Stackage, but I leave it for someone else to do.
- Oleg
[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027
[2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519
On 8.7.2021 15.25, Julian Ospald wrote:
> Hi,
>
> I think most seemed to agree on the motivation, but would it be a lot
> of work to ping a few large opensource/industry projects about this
> and get a feel what they think or how much of an expected effort a
> migration would be? I'm afraid that we might take this too lightly and
> possibly cause a lot of engineering effort here. Our expectations how
> or how often people use "show" might or might not be accurate.
>
> I'm aware of e.g. the cardano wallet test suite (open source) and
> other cardano projects that are very large opon source codebases and
> may be affected.
>
> CCing duncan
>
> On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf at gmail.com> wrote:
>
> Hi all
>
> Two weeks ago, I proposed “Support Unicode characters in instance Show
> String” [0] in the GHC issue tracker, and chessai asked me to post it
> here for wider feedback. The proposal posted here is edited to reflect
> new ideas proposed and insights accumulated over the days:
>
> 1. (Proposal) Now the proposal itself is now modeled after Python.
> 2. (Alternative Options) Alternative 2 is the original proposal.
> 3. (Downsides) New. About breakage.
> 4. (Prior Art) New.
> 5. (Unresolved Problems) New. Included for discussion.
>
> Even though I wanted to summarize everything here, some insightful
> comments are perhaps not included or misunderstood. These original
> comments can be found at the original feature request.
>
> [0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 <https://gitlab.haskell.org/ghc/ghc/-/issues/20027>
>
>
> Motivation
> ------------------------------------------------------------------------
> Unicode has been widely adopted and people around the world rely on
> Unicode to write in their native languages. Haskell, however, has been
> stuck in ASCII, and escape all non-ASCII characters in the String's
> instance of the Showclass, despite the fact that each element of a
> String is typically a Unicode code point, and putStrLn actually works as
> expected. Consider the following examples:
>
> ghci> print "Hello, 世界”
> "Hello, \19990\30028”
>
> ghci> print "Hello, мир”
> "Hello, \1084\1080\1088”
>
> ghci> print "Hello, κόσμος”
> "Hello, \954\972\963\956\959\962”
>
> ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped
> "Hello, \19990\30028”
>
> ghci> "😀" -- Not only human scripts, but also emojis!
> "\128512”
>
>
> This status quo is unsatisfactory for a number of reasons:
>
> 1. Even though it's small, it somehow creates an unwelcoming atmosphere
> for native speakers of languages whose scripts are not representable
> in ASCII.
> 2. This is an actual annoyance during debugging localized software, or
> strings with emojis.
> 3. Following 1, Haskell teachers are forced to use other languages
> instead of the students' mother tongues, or relying on I/O functions
> like putStrLn, creating a rather unnecessary burden.
> 4. Other string types, like Text [1], rely on this Show instance.
>
> Moreover, `read` already can handle Unicode strings today, so relaxing
> constraints on `show` doesn't affect `read . show == id`.
>
>
> Proposal
> ------------------------------------------------------------------------
> It's proposed here to change the Show instance of String, to achieve the following output:
>
> ghci> print "Hello, 世界”
> "Hello, 世界”
>
> ghci> print "Hello, мир”
> "Hello, мир”
>
> ghci> print "Hello, κόσμος”
> "Hello, κόσμος”
>
> ghci> "Hello, 世界”
> “Hello, 世界”
>
> ghci> "😀”
> “😀"
>
> More concretely, it means:
>
> 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
> Unicode characters out of the range of ASCII.
> 2. Provide a function showEscaped or newtype Escaped = Escaped String to
> obtain the current escaping behavior, in case anyone wants the
> current behavior back.
>
> This proposal isn't about unescaping everything, but only readable
> Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the
> job, and indeed, there was a similar proposal before [2]. In summary,
> the behavior is similar to what Python `repr` does.
>
>
> Alternative Options
> ------------------------------------------------------------------------
> 1. Always use putStrLn.
>
> This is viable today but unsatisfactory as it requires stdout. In
> some cases, stdout is not accessible, e.g. Telegram or Discord bots.
>
> 2. Don't escape anything.
>
> `show` itself refrains from escaping most of the characters, and let
> ghci do the job instead.
>
> 3. Customize ghci instead.
>
> ghci intercepts output strings and check if they can be converted
> back to readable characters. This potentially allows for better
> compatibility with a variety of strangely behaving terminals, and
> finer-grained user control.
>
> Tom Ellis proposed `-interactive-print`-based solutions in the
> comment section.
>
> 4. A new language extension, e.g. ShowStringUnicode.
>
> Proposed by Julian Ospald. When enabled, readable Unicode characters
> are not escaped, and this is enabled by default by ghci. There are
> concerns about how this would affect cross-module behavior.
>
>
> Downsides
> ------------------------------------------------------------------------
> This is definitely a breaking change, but the breakage, to our current
> understanding, is limited.
>
> First, use of `show` in production code is discouraged. Even if someone
> really does that, the breakage only happens when one tries to send the
> "serialized" data over wire:
>
> Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded
> file, and sends it to Machine B, which expects another encoding. This
> would be surprising for those who are used to the old behavior.
>
> Second, though the breakage is not likely to be catastrophic for correct
> production code, test suites could be badly affected, as pointed out by
> Oleg Grenrus and vdukhovni in the comment section. Some test suites
> compare `show` results with expected results. vdukhovni further
> commented that Haskell escapes are not universally supported by
> non-Haskell tools, so the impact would be confined to Haskell.
>
>
> Prior Art
> ------------------------------------------------------------------------
> Python supports Unicode natively since 3. Python's approach is
> intuitive and capable. Its `repr`, which is equivalent to Haskell's
> `show`, automatically escapes unreadable characters, but leaves readable
> characters unescaped. The criteria of "readable" can be found in
> CPython's code [3]. If we were to realize this proposal, Python could
> be a source of inspiration.
>
>
> Unresolved Problems
> ------------------------------------------------------------------------
> There are some currently unresolved (not discussed enough) issues.
>
> + Locales.
>
> What if the specified locale does not support Unicode? Hécate
> Moonlight pointed out PEP-538 [4] could be a reference.
>
> + Unicode versions.
>
> Javran Cheng pointed out u_iswprint is generated from a Unicode table,
> which is manually updated. This raises a concern that the definition
> of "printable" characters could change from version to version.
>
> + Definition of "readable".
>
> Unicode already defined "printability". It's good, but it is not
> necessarily what we want here.
>
> - Should we support RTL?
> - Should we design a Haskell-specific definition of readability, to
> avoid Unciode version silently introducing breakage?
>
> (More?)
>
> Some issues here perhaps require better answers to: What is our
> expectation of Show? Where should it be used? Should we expect it to
> break on every Unicode update?
>
>
> [1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37 <https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37>
> [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html <https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html>
> [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147 <https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147>
> [4] https://www.python.org/dev/peps/pep-0538/ <https://www.python.org/dev/peps/pep-0538/>
> ------------------------------------------------------------------------
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries <http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>
>
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20210708/140720b4/attachment.html>
More information about the Libraries
mailing list