[RFC] Support Unicode characters in instance Show String

Julian Ospald hasufell at posteo.de
Thu Jul 8 12:25:11 UTC 2021


Hi,

I think most seemed to agree on the motivation, but would it be a lot of work to ping a few large opensource/industry projects about this and get a feel what they think or how much of an expected effort a migration would be? I'm afraid that we might take this too lightly and possibly cause a lot of engineering effort here. Our expectations how or how often people use "show" might or might not be accurate.

I'm aware of e.g. the cardano wallet test suite (open source) and other cardano projects that are very large opon source codebases and may be affected. 

CCing duncan

On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf at gmail.com> wrote:
>Hi all
>
>Two weeks ago, I proposed “Support Unicode characters in instance Show
>String” [0] in the GHC issue tracker, and chessai asked me to post it
>here for wider feedback.  The proposal posted here is edited to reflect
>new ideas proposed and insights accumulated over the days:
>
>1. (Proposal) Now the proposal itself is now modeled after Python.
>2. (Alternative Options) Alternative 2 is the original proposal.
>3. (Downsides) New.  About breakage.
>4. (Prior Art) New.
>5. (Unresolved Problems) New.  Included for discussion.
>
>Even though I wanted to summarize everything here, some insightful
>comments are perhaps not included or misunderstood.  These original
>comments can be found at the original feature request.
>
>[0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027
>
>
>Motivation
>==========
>
>Unicode has been widely adopted and people around the world rely on
>Unicode to write in their native languages. Haskell, however, has been
>stuck in ASCII, and escape all non-ASCII characters in the String's
>instance of the Showclass, despite the fact that each element of a
>String is typically a Unicode code point, and putStrLn actually works
>as
>expected. Consider the following examples:
>
>    ghci> print "Hello, 世界”
>    "Hello, \19990\30028”
>    
>    ghci> print "Hello, мир”
>    "Hello, \1084\1080\1088”
>    
>    ghci> print "Hello, κόσμος”
>    "Hello, \954\972\963\956\959\962”
>    
>ghci> "Hello, 世界"       -- ghci calls `show`, so string literals are
>also escaped
>    "Hello, \19990\30028”
>    
>    ghci> "😀"  -- Not only human scripts, but also emojis!
>    "\128512”
>
>
>This status quo is unsatisfactory for a number of reasons:
>
>1. Even though it's small, it somehow creates an unwelcoming atmosphere
>   for native speakers of languages whose scripts are not representable
>   in ASCII.
>2. This is an actual annoyance during debugging localized software, or
>   strings with emojis.
>3. Following 1, Haskell teachers are forced to use other languages
>   instead of the students' mother tongues, or relying on I/O functions
>   like putStrLn, creating a rather unnecessary burden.
>4. Other string types, like Text [1], rely on this Show instance.
>
>Moreover, `read` already can handle Unicode strings today, so relaxing
>constraints on `show` doesn't affect `read . show == id`.
>
>
>Proposal
>========
>
>It's proposed here to change the Show instance of String, to achieve
>the following output:
>
>    ghci> print "Hello, 世界”
>    "Hello, 世界”
>    
>    ghci> print "Hello, мир”
>    "Hello, мир”
>    
>    ghci> print "Hello, κόσμος”
>    "Hello, κόσμος”
>    
>    ghci> "Hello, 世界”      
>    “Hello, 世界”
>    
>    ghci> "😀” 
>    “😀"
>
>More concretely, it means:
>
>1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
>   Unicode characters out of the range of ASCII.
>2. Provide a function showEscaped or newtype Escaped = Escaped String
>to
>   obtain the current escaping behavior, in case anyone wants the
>   current behavior back.
>
>This proposal isn't about unescaping everything, but only readable
>Unicode characters.  u_iswprint (GHC.Unicode.isPrint) seems to do the
>job, and indeed, there was a similar proposal before [2].  In summary,
>the behavior is similar to what Python `repr` does.
>
>
>Alternative Options
>===================
>
>1. Always use putStrLn.
>
>   This is viable today but unsatisfactory as it requires stdout.  In
>   some cases, stdout is not accessible, e.g. Telegram or Discord bots.
>
>2. Don't escape anything.
>
>   `show` itself refrains from escaping most of the characters, and let
>   ghci do the job instead.
>
>3. Customize ghci instead.
>
>   ghci intercepts output strings and check if they can be converted
>   back to readable characters.  This potentially allows for better
>   compatibility with a variety of strangely behaving terminals, and
>   finer-grained user control.
>
>   Tom Ellis proposed `-interactive-print`-based solutions in the
>   comment section.
>
>4. A new language extension, e.g. ShowStringUnicode.
>
>  Proposed by Julian Ospald.  When enabled, readable Unicode characters
>   are not escaped, and this is enabled by default by ghci.  There are
>   concerns about how this would affect cross-module behavior.
>
>
>Downsides
>=========
>
>This is definitely a breaking change, but the breakage, to our current
>understanding, is limited.
>
>First, use of `show` in production code is discouraged.  Even if
>someone
>really does that, the breakage only happens when one tries to send the
>"serialized" data over wire:
>
>Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded
>file, and sends it to Machine B, which expects another encoding.  This
>would be surprising for those who are used to the old behavior.
>
>Second, though the breakage is not likely to be catastrophic for
>correct
>production code, test suites could be badly affected, as pointed out by
>Oleg Grenrus and vdukhovni in the comment section.  Some test suites
>compare `show` results with expected results.  vdukhovni further
>commented that Haskell escapes are not universally supported by
>non-Haskell tools, so the impact would be confined to Haskell.
>
>
>Prior Art
>=========
>
>Python supports Unicode natively since 3.  Python's approach is
>intuitive and capable.  Its `repr`, which is equivalent to Haskell's
>`show`, automatically escapes unreadable characters, but leaves
>readable
>characters unescaped.  The criteria of "readable" can be found in
>CPython's code [3].  If we were to realize this proposal, Python could
>be a source of inspiration.
>
>
>Unresolved Problems
>===================
>
>There are some currently unresolved (not discussed enough) issues.
>
>+ Locales.
>
>  What if the specified locale does not support Unicode?  Hécate
>  Moonlight pointed out PEP-538 [4] could be a reference.
>
>+ Unicode versions.
>
> Javran Cheng pointed out u_iswprint is generated from a Unicode table,
>  which is manually updated.  This raises a concern that the definition
>  of "printable" characters could change from version to version.
>
>+ Definition of "readable".
>
>  Unicode already defined "printability".  It's good, but it is not
>  necessarily what we want here.
>
>  - Should we support RTL?
>  - Should we design a Haskell-specific definition of readability, to
>    avoid Unciode version silently introducing breakage?
>
>(More?)
>
>Some issues here perhaps require better answers to: What is our
>expectation of Show?  Where should it be used?  Should we expect it to
>break on every Unicode update?
>
>
>[1]
>https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37
>[2]
>https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html
>[3]
>https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147
>[4] https://www.python.org/dev/peps/pep-0538/
>
>_______________________________________________
>Libraries mailing list
>Libraries at haskell.org
>http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20210708/a6ea33e9/attachment.html>


More information about the Libraries mailing list