[RFC] Support Unicode characters in instance Show String

Thu Jul 8 17:20:43 UTC 2021

It would also be good to have a summary of of previous discussion OP
kindly linked [1], e.g. the comment by David Turner [2]

> One of the most visible uses of Show is that it's how values are shown in
> GHCi. As mentioned earlier in this thread, if you're teaching in a
> non-ASCII language then the user experience is pretty poor.
>
> On the other hand, I see Show (like .ToString() in C# etc.) as a debugging
> tool: not for seriously robust serialisation but useful if you need to
dump
> a value into a log message or email or similar. And in that situation it's
> very useful if it sticks to ASCII: non-ASCII content just isn't resilient
> enough to being passed around the network, truncated and generally
> mutilated on the way through.
>
> These are definitely two different concerns and they pull in opposite
> directions in this discussion. It's a matter of opinion which you think is
> more important. Me, I think the latter, but then I do a lot of logging and
> speak a language that fits into  ASCII. YMMV!

This proposal is motivated by the first point, but doesn't mention debugging
other then

> 2. This is an actual annoyance during debugging localized software, or
     strings with emojis

which I don't agree with.

For example look at the failing test case in the pandoc in my previous
message.
\160 is a non-breaking space, which looks like normal space when rendered
normally. I have my share of bad experience with it. So, indeed YMMV.

- Oleg

[1]:
https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html
[2]:
https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122899.html

On 8.7.2021 18.53, Oleg Grenrus wrote:
>
> Here is a simple patch, which I hope is close to what
>
> 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
>    Unicode characters out of the range of ASCII.
>
> of a proposed change will look like:
>
>     diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs
>     index 84077e473b..24569168d4 100644
>     --- a/libraries/base/GHC/Show.hs
>     +++ b/libraries/base/GHC/Show.hs
>     @@ -364,7 +364,10 @@ showCommaSpace = showString ", "
>      -- > showLitChar '\n' s  =  "\\n" ++ s
>      --
>      showLitChar                :: Char -> ShowS
>     -showLitChar c s | c > '\DEL' =  showChar '\\' (protectEsc isDec
> (shows (ord c)) s)
>     +showLitChar c s | c > '\DEL' =
>     +    if isPrint c
>     +    then showChar c s
>     +    else  showChar '\\' (protectEsc isDec (shows (ord c)) s)
>      showLitChar '\DEL'         s =  showString "\\DEL" s
>      showLitChar '\\'           s =  showString "\\\\" s
>      showLitChar c s | c >= ' '   =  showChar c s
>     @@ -380,6 +383,13 @@ showLitChar c              s =  showString
> ('\\' : asciiTab!!ord c) s
>              -- I've done manual eta-expansion here, because otherwise
> it's
>              -- impossible to stop (asciiTab!!ord) getting floated out
> as an MFE
>     
>     +-- Local definition of isPrint to avoid fighting with cycles for now.
>     +isPrint                 :: Char -> Bool
>     +isPrint    c = iswprint (ord c) /= 0
>     +
>     +foreign import ccall unsafe "u_iswprint"
>     +  iswprint :: Int -> Int
>     +
>      showLitString :: String -> ShowS
>      -- | Same as 'showLitChar', but for strings
>      -- It converts the string to a string using Haskell escape
> conventions
>
> I applied it to ghc-8.10 branch,
>
>     % _build/stage1/bin/ghc --interactive
>     GHCi, version 8.10.5: https://www.haskell.org/ghc/  :? for help
>     Prelude> "äiti"
>     "äiti"
>     Prelude> "мир"
>     "мир"
>     Prelude> print "мир"
>     "мир"
>     Prelude> "😀"
>     "😀"
>
> And then run test-suites of aeson, dhall and pandoc.
>
> Aeson test-suite passed.
> Dhall test-suites passed too,
> However pandoc testsuite failed:
>
> 78 out of 2819 tests failed (35.88s)
>
> An example failure is:
>
>     3587.md
>       #1:                                                           
> FAIL (0.01s)
>         --- test/command/3587.md
>         +++ pandoc -f latex -t native
>         +   1 [Para [Str "1 m",Space,Str "is",Space,Str
> "equal",Space,Str "to",Space,Str "1000 mm"]]
>         -   1 [Para [Str "1\160m",Space,Str "is",Space,Str
> "equal",Space,Str "to",Space,Str "1000\160mm"]]
>
> Str is a constructor of Inline type, and takes Text: data Inline = Str
> Text | ...
> As discussed on the GHC issue [1], Text and ByteString Show Instances
> piggyback on
> String instance. Bodigrim said that Text will eventually migrate
> to do the same as new Show String [2], so this issue will resurface.
>
> Please explain the compatibility story. How library writes should write
> their code (in test-suites) which rely on Show String or Show Text, such
> that they could support GHC base versions (and/or text) versions
> on the both sides of this breaking change.
>
> I agree with Julian that required migration engineering effort across
> (even just the open source) ecosystem is non-trivial.
> Having a good plan would hopefully make it easier to accept that cost.
>
> The fact it's a change which is not detectable at compile time
> makes me very anxious about this, even I don't disagree with
> motivation bits.
> I have very little idea if and where I depend on Show String behavior.
>
> It would also be interesting to see results of test-suites of all
> Stackage, but I leave it for someone else to do.
>
> - Oleg
>
> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027
> [2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519
>
> On 8.7.2021 15.25, Julian Ospald wrote:
>> Hi,
>>
>> I think most seemed to agree on the motivation, but would it be a lot
>> of work to ping a few large opensource/industry projects about this
>> and get a feel what they think or how much of an expected effort a
>> migration would be? I'm afraid that we might take this too lightly
>> and possibly cause a lot of engineering effort here. Our expectations
>> how or how often people use "show" might or might not be accurate.
>>
>> I'm aware of e.g. the cardano wallet test suite (open source) and
>> other cardano projects that are very large opon source codebases and
>> may be affected.
>>
>> CCing duncan
>>
>> On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf at gmail.com> wrote:
>>
>>     Hi all
>>
>>     Two weeks ago, I proposed “Support Unicode characters in instance Show
>>     String” [0] in the GHC issue tracker, and chessai asked me to post it
>>     here for wider feedback.  The proposal posted here is edited to reflect
>>     new ideas proposed and insights accumulated over the days:
>>
>>     1. (Proposal) Now the proposal itself is now modeled after Python.
>>     2. (Alternative Options) Alternative 2 is the original proposal.
>>     3. (Downsides) New.  About breakage.
>>     4. (Prior Art) New.
>>     5. (Unresolved Problems) New.  Included for discussion.
>>
>>     Even though I wanted to summarize everything here, some insightful
>>     comments are perhaps not included or misunderstood.  These original
>>     comments can be found at the original feature request.
>>
>>     [0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 <https://gitlab.haskell.org/ghc/ghc/-/issues/20027>
>>
>>
>>     Motivation
>>     ------------------------------------------------------------------------
>>     Unicode has been widely adopted and people around the world rely on
>>     Unicode to write in their native languages. Haskell, however, has been
>>     stuck in ASCII, and escape all non-ASCII characters in the String's
>>     instance of the Showclass, despite the fact that each element of a
>>     String is typically a Unicode code point, and putStrLn actually works as
>>     expected. Consider the following examples:
>>
>>         ghci> print "Hello, 世界”
>>         "Hello, \19990\30028”
>>         
>>         ghci> print "Hello, мир”
>>         "Hello, \1084\1080\1088”
>>         
>>         ghci> print "Hello, κόσμος”
>>         "Hello, \954\972\963\956\959\962”
>>         
>>         ghci> "Hello, 世界"       -- ghci calls `show`, so string literals are also escaped
>>         "Hello, \19990\30028”
>>         
>>         ghci> "😀"  -- Not only human scripts, but also emojis!
>>         "\128512”
>>
>>
>>     This status quo is unsatisfactory for a number of reasons:
>>
>>     1. Even though it's small, it somehow creates an unwelcoming atmosphere
>>        for native speakers of languages whose scripts are not representable
>>        in ASCII.
>>     2. This is an actual annoyance during debugging localized software, or
>>        strings with emojis.
>>     3. Following 1, Haskell teachers are forced to use other languages
>>        instead of the students' mother tongues, or relying on I/O functions
>>        like putStrLn, creating a rather unnecessary burden.
>>     4. Other string types, like Text [1], rely on this Show instance.
>>
>>     Moreover, `read` already can handle Unicode strings today, so relaxing
>>     constraints on `show` doesn't affect `read . show == id`.
>>
>>
>>     Proposal
>>     ------------------------------------------------------------------------
>>     It's proposed here to change the Show instance of String, to achieve the following output:
>>
>>         ghci> print "Hello, 世界”
>>         "Hello, 世界”
>>         
>>         ghci> print "Hello, мир”
>>         "Hello, мир”
>>         
>>         ghci> print "Hello, κόσμος”
>>         "Hello, κόσμος”
>>         
>>         ghci> "Hello, 世界”      
>>         “Hello, 世界”
>>         
>>         ghci> "😀” 
>>         “😀"
>>
>>     More concretely, it means:
>>
>>     1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
>>        Unicode characters out of the range of ASCII.
>>     2. Provide a function showEscaped or newtype Escaped = Escaped String to
>>        obtain the current escaping behavior, in case anyone wants the
>>        current behavior back.
>>
>>     This proposal isn't about unescaping everything, but only readable
>>     Unicode characters.  u_iswprint (GHC.Unicode.isPrint) seems to do the
>>     job, and indeed, there was a similar proposal before [2].  In summary,
>>     the behavior is similar to what Python `repr` does.
>>
>>
>>     Alternative Options
>>     ------------------------------------------------------------------------
>>     1. Always use putStrLn.
>>
>>        This is viable today but unsatisfactory as it requires stdout.  In
>>        some cases, stdout is not accessible, e.g. Telegram or Discord bots.
>>
>>     2. Don't escape anything.
>>
>>        `show` itself refrains from escaping most of the characters, and let
>>        ghci do the job instead.
>>
>>     3. Customize ghci instead.
>>
>>        ghci intercepts output strings and check if they can be converted
>>        back to readable characters.  This potentially allows for better
>>        compatibility with a variety of strangely behaving terminals, and
>>        finer-grained user control.
>>
>>        Tom Ellis proposed `-interactive-print`-based solutions in the
>>        comment section.
>>
>>     4. A new language extension, e.g. ShowStringUnicode.
>>
>>        Proposed by Julian Ospald.  When enabled, readable Unicode characters
>>        are not escaped, and this is enabled by default by ghci.  There are
>>        concerns about how this would affect cross-module behavior.
>>
>>
>>     Downsides
>>     ------------------------------------------------------------------------
>>     This is definitely a breaking change, but the breakage, to our current
>>     understanding, is limited.
>>
>>     First, use of `show` in production code is discouraged.  Even if someone
>>     really does that, the breakage only happens when one tries to send the
>>     "serialized" data over wire:
>>
>>     Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded
>>     file, and sends it to Machine B, which expects another encoding.  This
>>     would be surprising for those who are used to the old behavior.
>>
>>     Second, though the breakage is not likely to be catastrophic for correct
>>     production code, test suites could be badly affected, as pointed out by
>>     Oleg Grenrus and vdukhovni in the comment section.  Some test suites
>>     compare `show` results with expected results.  vdukhovni further
>>     commented that Haskell escapes are not universally supported by
>>     non-Haskell tools, so the impact would be confined to Haskell.
>>
>>
>>     Prior Art
>>     ------------------------------------------------------------------------
>>     Python supports Unicode natively since 3.  Python's approach is
>>     intuitive and capable.  Its `repr`, which is equivalent to Haskell's
>>     `show`, automatically escapes unreadable characters, but leaves readable
>>     characters unescaped.  The criteria of "readable" can be found in
>>     CPython's code [3].  If we were to realize this proposal, Python could
>>     be a source of inspiration.
>>
>>
>>     Unresolved Problems
>>     ------------------------------------------------------------------------
>>     There are some currently unresolved (not discussed enough) issues.
>>
>>     + Locales.
>>
>>       What if the specified locale does not support Unicode?  Hécate
>>       Moonlight pointed out PEP-538 [4] could be a reference.
>>
>>     + Unicode versions.
>>
>>       Javran Cheng pointed out u_iswprint is generated from a Unicode table,
>>       which is manually updated.  This raises a concern that the definition
>>       of "printable" characters could change from version to version.
>>
>>     + Definition of "readable".
>>
>>       Unicode already defined "printability".  It's good, but it is not
>>       necessarily what we want here.
>>
>>       - Should we support RTL?
>>       - Should we design a Haskell-specific definition of readability, to
>>         avoid Unciode version silently introducing breakage?
>>
>>     (More?)
>>
>>     Some issues here perhaps require better answers to: What is our
>>     expectation of Show?  Where should it be used?  Should we expect it to
>>     break on every Unicode update?
>>
>>
>>     [1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37 <https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37>
>>     [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html <https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html>
>>     [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147 <https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147>
>>     [4] https://www.python.org/dev/peps/pep-0538/ <https://www.python.org/dev/peps/pep-0538/>
>>     ------------------------------------------------------------------------
>>     Libraries mailing list
>>     Libraries at haskell.org
>>     http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries <http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>
>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20210708/3a513e2c/attachment.html>