[RFC] Support Unicode characters in instance Show String

Thu Jul 8 15:53:38 UTC 2021

Here is a simple patch, which I hope is close to what

1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
   Unicode characters out of the range of ASCII.

of a proposed change will look like:

    diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs
    index 84077e473b..24569168d4 100644
    --- a/libraries/base/GHC/Show.hs
    +++ b/libraries/base/GHC/Show.hs
    @@ -364,7 +364,10 @@ showCommaSpace = showString ", "
     -- > showLitChar '\n' s  =  "\\n" ++ s
     --
     showLitChar                :: Char -> ShowS
    -showLitChar c s | c > '\DEL' =  showChar '\\' (protectEsc isDec
(shows (ord c)) s)
    +showLitChar c s | c > '\DEL' =
    +    if isPrint c
    +    then showChar c s
    +    else  showChar '\\' (protectEsc isDec (shows (ord c)) s)
     showLitChar '\DEL'         s =  showString "\\DEL" s
     showLitChar '\\'           s =  showString "\\\\" s
     showLitChar c s | c >= ' '   =  showChar c s
    @@ -380,6 +383,13 @@ showLitChar c              s =  showString
('\\' : asciiTab!!ord c) s
             -- I've done manual eta-expansion here, because otherwise it's
             -- impossible to stop (asciiTab!!ord) getting floated out
as an MFE

    +-- Local definition of isPrint to avoid fighting with cycles for now.
    +isPrint                 :: Char -> Bool
    +isPrint    c = iswprint (ord c) /= 0
    +
    +foreign import ccall unsafe "u_iswprint"
    +  iswprint :: Int -> Int
    +
     showLitString :: String -> ShowS
     -- | Same as 'showLitChar', but for strings
     -- It converts the string to a string using Haskell escape conventions

I applied it to ghc-8.10 branch,

    % _build/stage1/bin/ghc --interactive
    GHCi, version 8.10.5: https://www.haskell.org/ghc/  :? for help
    Prelude> "äiti"
    "äiti"
    Prelude> "мир"
    "мир"
    Prelude> print "мир"
    "мир"
    Prelude> "😀"
    "😀"

And then run test-suites of aeson, dhall and pandoc.

Aeson test-suite passed.
Dhall test-suites passed too,
However pandoc testsuite failed:

78 out of 2819 tests failed (35.88s)

An example failure is:

    3587.md
      #1:                                                           
FAIL (0.01s)
        --- test/command/3587.md
        +++ pandoc -f latex -t native
        +   1 [Para [Str "1 m",Space,Str "is",Space,Str
"equal",Space,Str "to",Space,Str "1000 mm"]]
        -   1 [Para [Str "1\160m",Space,Str "is",Space,Str
"equal",Space,Str "to",Space,Str "1000\160mm"]]

Str is a constructor of Inline type, and takes Text: data Inline = Str
Text | ...
As discussed on the GHC issue [1], Text and ByteString Show Instances
piggyback on
String instance. Bodigrim said that Text will eventually migrate
to do the same as new Show String [2], so this issue will resurface.

Please explain the compatibility story. How library writes should write
their code (in test-suites) which rely on Show String or Show Text, such
that they could support GHC base versions (and/or text) versions
on the both sides of this breaking change.

I agree with Julian that required migration engineering effort across
(even just the open source) ecosystem is non-trivial.
Having a good plan would hopefully make it easier to accept that cost.

The fact it's a change which is not detectable at compile time
makes me very anxious about this, even I don't disagree with motivation
bits.
I have very little idea if and where I depend on Show String behavior.

It would also be interesting to see results of test-suites of all
Stackage, but I leave it for someone else to do.

- Oleg

[1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027
[2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519

On 8.7.2021 15.25, Julian Ospald wrote:
> Hi,
>
> I think most seemed to agree on the motivation, but would it be a lot
> of work to ping a few large opensource/industry projects about this
> and get a feel what they think or how much of an expected effort a
> migration would be? I'm afraid that we might take this too lightly and
> possibly cause a lot of engineering effort here. Our expectations how
> or how often people use "show" might or might not be accurate.
>
> I'm aware of e.g. the cardano wallet test suite (open source) and
> other cardano projects that are very large opon source codebases and
> may be affected.
>
> CCing duncan
>
> On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf at gmail.com> wrote:
>
>     Hi all
>
>     Two weeks ago, I proposed “Support Unicode characters in instance Show
>     String” [0] in the GHC issue tracker, and chessai asked me to post it
>     here for wider feedback.  The proposal posted here is edited to reflect
>     new ideas proposed and insights accumulated over the days:
>
>     1. (Proposal) Now the proposal itself is now modeled after Python.
>     2. (Alternative Options) Alternative 2 is the original proposal.
>     3. (Downsides) New.  About breakage.
>     4. (Prior Art) New.
>     5. (Unresolved Problems) New.  Included for discussion.
>
>     Even though I wanted to summarize everything here, some insightful
>     comments are perhaps not included or misunderstood.  These original
>     comments can be found at the original feature request.
>
>     [0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 <https://gitlab.haskell.org/ghc/ghc/-/issues/20027>
>
>
>     Motivation
>     ------------------------------------------------------------------------
>     Unicode has been widely adopted and people around the world rely on
>     Unicode to write in their native languages. Haskell, however, has been
>     stuck in ASCII, and escape all non-ASCII characters in the String's
>     instance of the Showclass, despite the fact that each element of a
>     String is typically a Unicode code point, and putStrLn actually works as
>     expected. Consider the following examples:
>
>         ghci> print "Hello, 世界”
>         "Hello, \19990\30028”
>         
>         ghci> print "Hello, мир”
>         "Hello, \1084\1080\1088”
>         
>         ghci> print "Hello, κόσμος”
>         "Hello, \954\972\963\956\959\962”
>         
>         ghci> "Hello, 世界"       -- ghci calls `show`, so string literals are also escaped
>         "Hello, \19990\30028”
>         
>         ghci> "😀"  -- Not only human scripts, but also emojis!
>         "\128512”
>
>
>     This status quo is unsatisfactory for a number of reasons:
>
>     1. Even though it's small, it somehow creates an unwelcoming atmosphere
>        for native speakers of languages whose scripts are not representable
>        in ASCII.
>     2. This is an actual annoyance during debugging localized software, or
>        strings with emojis.
>     3. Following 1, Haskell teachers are forced to use other languages
>        instead of the students' mother tongues, or relying on I/O functions
>        like putStrLn, creating a rather unnecessary burden.
>     4. Other string types, like Text [1], rely on this Show instance.
>
>     Moreover, `read` already can handle Unicode strings today, so relaxing
>     constraints on `show` doesn't affect `read . show == id`.
>
>
>     Proposal
>     ------------------------------------------------------------------------
>     It's proposed here to change the Show instance of String, to achieve the following output:
>
>         ghci> print "Hello, 世界”
>         "Hello, 世界”
>         
>         ghci> print "Hello, мир”
>         "Hello, мир”
>         
>         ghci> print "Hello, κόσμος”
>         "Hello, κόσμος”
>         
>         ghci> "Hello, 世界”      
>         “Hello, 世界”
>         
>         ghci> "😀” 
>         “😀"
>
>     More concretely, it means:
>
>     1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
>        Unicode characters out of the range of ASCII.
>     2. Provide a function showEscaped or newtype Escaped = Escaped String to
>        obtain the current escaping behavior, in case anyone wants the
>        current behavior back.
>
>     This proposal isn't about unescaping everything, but only readable
>     Unicode characters.  u_iswprint (GHC.Unicode.isPrint) seems to do the
>     job, and indeed, there was a similar proposal before [2].  In summary,
>     the behavior is similar to what Python `repr` does.
>
>
>     Alternative Options
>     ------------------------------------------------------------------------
>     1. Always use putStrLn.
>
>        This is viable today but unsatisfactory as it requires stdout.  In
>        some cases, stdout is not accessible, e.g. Telegram or Discord bots.
>
>     2. Don't escape anything.
>
>        `show` itself refrains from escaping most of the characters, and let
>        ghci do the job instead.
>
>     3. Customize ghci instead.
>
>        ghci intercepts output strings and check if they can be converted
>        back to readable characters.  This potentially allows for better
>        compatibility with a variety of strangely behaving terminals, and
>        finer-grained user control.
>
>        Tom Ellis proposed `-interactive-print`-based solutions in the
>        comment section.
>
>     4. A new language extension, e.g. ShowStringUnicode.
>
>        Proposed by Julian Ospald.  When enabled, readable Unicode characters
>        are not escaped, and this is enabled by default by ghci.  There are
>        concerns about how this would affect cross-module behavior.
>
>
>     Downsides
>     ------------------------------------------------------------------------
>     This is definitely a breaking change, but the breakage, to our current
>     understanding, is limited.
>
>     First, use of `show` in production code is discouraged.  Even if someone
>     really does that, the breakage only happens when one tries to send the
>     "serialized" data over wire:
>
>     Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded
>     file, and sends it to Machine B, which expects another encoding.  This
>     would be surprising for those who are used to the old behavior.
>
>     Second, though the breakage is not likely to be catastrophic for correct
>     production code, test suites could be badly affected, as pointed out by
>     Oleg Grenrus and vdukhovni in the comment section.  Some test suites
>     compare `show` results with expected results.  vdukhovni further
>     commented that Haskell escapes are not universally supported by
>     non-Haskell tools, so the impact would be confined to Haskell.
>
>
>     Prior Art
>     ------------------------------------------------------------------------
>     Python supports Unicode natively since 3.  Python's approach is
>     intuitive and capable.  Its `repr`, which is equivalent to Haskell's
>     `show`, automatically escapes unreadable characters, but leaves readable
>     characters unescaped.  The criteria of "readable" can be found in
>     CPython's code [3].  If we were to realize this proposal, Python could
>     be a source of inspiration.
>
>
>     Unresolved Problems
>     ------------------------------------------------------------------------
>     There are some currently unresolved (not discussed enough) issues.
>
>     + Locales.
>
>       What if the specified locale does not support Unicode?  Hécate
>       Moonlight pointed out PEP-538 [4] could be a reference.
>
>     + Unicode versions.
>
>       Javran Cheng pointed out u_iswprint is generated from a Unicode table,
>       which is manually updated.  This raises a concern that the definition
>       of "printable" characters could change from version to version.
>
>     + Definition of "readable".
>
>       Unicode already defined "printability".  It's good, but it is not
>       necessarily what we want here.
>
>       - Should we support RTL?
>       - Should we design a Haskell-specific definition of readability, to
>         avoid Unciode version silently introducing breakage?
>
>     (More?)
>
>     Some issues here perhaps require better answers to: What is our
>     expectation of Show?  Where should it be used?  Should we expect it to
>     break on every Unicode update?
>
>
>     [1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37 <https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37>
>     [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html <https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html>
>     [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147 <https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147>
>     [4] https://www.python.org/dev/peps/pep-0538/ <https://www.python.org/dev/peps/pep-0538/>
>     ------------------------------------------------------------------------
>     Libraries mailing list
>     Libraries at haskell.org
>     http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries <http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>
>
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20210708/140720b4/attachment.html>