<html><head></head><body>Hi,<br><br>I think most seemed to agree on the motivation, but would it be a lot of work to ping a few large opensource/industry projects about this and get a feel what they think or how much of an expected effort a migration would be? I'm afraid that we might take this too lightly and possibly cause a lot of engineering effort here. Our expectations how or how often people use "show" might or might not be accurate.<br><br>I'm aware of e.g. the cardano wallet test suite (open source) and other cardano projects that are very large opon source codebases and may be affected. <br><br>CCing duncan<br><br><div class="gmail_quote">On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf@gmail.com> wrote:<blockquote class="gmail_quote" style="margin: 0pt 0pt 0pt 0.8ex; border-left: 1px solid rgb(204, 204, 204); padding-left: 1ex;">
<pre class="k9mail">Hi all<br><br>Two weeks ago, I proposed “Support Unicode characters in instance Show<br>String” [0] in the GHC issue tracker, and chessai asked me to post it<br>here for wider feedback. The proposal posted here is edited to reflect<br>new ideas proposed and insights accumulated over the days:<br><br>1. (Proposal) Now the proposal itself is now modeled after Python.<br>2. (Alternative Options) Alternative 2 is the original proposal.<br>3. (Downsides) New. About breakage.<br>4. (Prior Art) New.<br>5. (Unresolved Problems) New. Included for discussion.<br><br>Even though I wanted to summarize everything here, some insightful<br>comments are perhaps not included or misunderstood. These original<br>comments can be found at the original feature request.<br><br>[0] <a href="https://gitlab.haskell.org/ghc/ghc/-/issues/20027">https://gitlab.haskell.org/ghc/ghc/-/issues/20027</a><br><br><br>Motivation<hr>Unicode has been widely adopted and people around the world rely on<br>Unicode to write in their native languages. Haskell, however, has been<br>stuck in ASCII, and escape all non-ASCII characters in the String's<br>instance of the Showclass, despite the fact that each element of a<br>String is typically a Unicode code point, and putStrLn actually works as<br>expected. Consider the following examples:<br><br> ghci> print "Hello, 世界”<br> "Hello, \19990\30028”<br> <br> ghci> print "Hello, мир”<br> "Hello, \1084\1080\1088”<br> <br> ghci> print "Hello, κόσμος”<br> "Hello, \954\972\963\956\959\962”<br> <br> ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped<br> "Hello, \19990\30028”<br> <br> ghci> "😀" -- Not only human scripts, but also emojis!<br> "\128512”<br><br><br>This status quo is unsatisfactory for a number of reasons:<br><br>1. Even though it's small, it somehow creates an unwelcoming atmosphere<br> for native speakers of languages whose scripts are not representable<br> in ASCII.<br>2. This is an actual annoyance during debugging localized software, or<br> strings with emojis.<br>3. Following 1, Haskell teachers are forced to use other languages<br> instead of the students' mother tongues, or relying on I/O functions<br> like putStrLn, creating a rather unnecessary burden.<br>4. Other string types, like Text [1], rely on this Show instance.<br><br>Moreover, `read` already can handle Unicode strings today, so relaxing<br>constraints on `show` doesn't affect `read . show == id`.<br><br><br>Proposal<hr>It's proposed here to change the Show instance of String, to achieve the following output:<br><br> ghci> print "Hello, 世界”<br> "Hello, 世界”<br> <br> ghci> print "Hello, мир”<br> "Hello, мир”<br> <br> ghci> print "Hello, κόσμος”<br> "Hello, κόσμος”<br> <br> ghci> "Hello, 世界” <br> “Hello, 世界”<br> <br> ghci> "😀” <br> “😀"<br><br>More concretely, it means:<br><br>1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_<br> Unicode characters out of the range of ASCII.<br>2. Provide a function showEscaped or newtype Escaped = Escaped String to<br> obtain the current escaping behavior, in case anyone wants the<br> current behavior back.<br><br>This proposal isn't about unescaping everything, but only readable<br>Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the<br>job, and indeed, there was a similar proposal before [2]. In summary,<br>the behavior is similar to what Python `repr` does.<br><br><br>Alternative Options<hr>1. Always use putStrLn.<br><br> This is viable today but unsatisfactory as it requires stdout. In<br> some cases, stdout is not accessible, e.g. Telegram or Discord bots.<br><br>2. Don't escape anything.<br><br> `show` itself refrains from escaping most of the characters, and let<br> ghci do the job instead.<br><br>3. Customize ghci instead.<br><br> ghci intercepts output strings and check if they can be converted<br> back to readable characters. This potentially allows for better<br> compatibility with a variety of strangely behaving terminals, and<br> finer-grained user control.<br><br> Tom Ellis proposed `-interactive-print`-based solutions in the<br> comment section.<br><br>4. A new language extension, e.g. ShowStringUnicode.<br><br> Proposed by Julian Ospald. When enabled, readable Unicode characters<br> are not escaped, and this is enabled by default by ghci. There are<br> concerns about how this would affect cross-module behavior.<br><br><br>Downsides<hr>This is definitely a breaking change, but the breakage, to our current<br>understanding, is limited.<br><br>First, use of `show` in production code is discouraged. Even if someone<br>really does that, the breakage only happens when one tries to send the<br>"serialized" data over wire:<br><br>Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded<br>file, and sends it to Machine B, which expects another encoding. This<br>would be surprising for those who are used to the old behavior.<br><br>Second, though the breakage is not likely to be catastrophic for correct<br>production code, test suites could be badly affected, as pointed out by<br>Oleg Grenrus and vdukhovni in the comment section. Some test suites<br>compare `show` results with expected results. vdukhovni further<br>commented that Haskell escapes are not universally supported by<br>non-Haskell tools, so the impact would be confined to Haskell.<br><br><br>Prior Art<hr>Python supports Unicode natively since 3. Python's approach is<br>intuitive and capable. Its `repr`, which is equivalent to Haskell's<br>`show`, automatically escapes unreadable characters, but leaves readable<br>characters unescaped. The criteria of "readable" can be found in<br>CPython's code [3]. If we were to realize this proposal, Python could<br>be a source of inspiration.<br><br><br>Unresolved Problems<hr>There are some currently unresolved (not discussed enough) issues.<br><br>+ Locales.<br><br> What if the specified locale does not support Unicode? Hécate<br> Moonlight pointed out PEP-538 [4] could be a reference.<br><br>+ Unicode versions.<br><br> Javran Cheng pointed out u_iswprint is generated from a Unicode table,<br> which is manually updated. This raises a concern that the definition<br> of "printable" characters could change from version to version.<br><br>+ Definition of "readable".<br><br> Unicode already defined "printability". It's good, but it is not<br> necessarily what we want here.<br><br> - Should we support RTL?<br> - Should we design a Haskell-specific definition of readability, to<br> avoid Unciode version silently introducing breakage?<br><br>(More?)<br><br>Some issues here perhaps require better answers to: What is our<br>expectation of Show? Where should it be used? Should we expect it to<br>break on every Unicode update?<br><br><br>[1] <a href="https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37">https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37</a><br>[2] <a href="https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html">https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html</a><br>[3] <a href="https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147">https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147</a><br>[4] <a href="https://www.python.org/dev/peps/pep-0538/">https://www.python.org/dev/peps/pep-0538/</a><hr>Libraries mailing list<br>Libraries@haskell.org<br><a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries">http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries</a><br></pre></blockquote></div></body></html>