[RFC] Support Unicode characters in instance Show String
Oleg Grenrus
oleg.grenrus at iki.fi
Thu Jul 8 17:20:43 UTC 2021
It would also be good to have a summary of of previous discussion OP
kindly linked [1], e.g. the comment by David Turner [2]
> One of the most visible uses of Show is that it's how values are shown in
> GHCi. As mentioned earlier in this thread, if you're teaching in a
> non-ASCII language then the user experience is pretty poor.
>
> On the other hand, I see Show (like .ToString() in C# etc.) as a debugging
> tool: not for seriously robust serialisation but useful if you need to
dump
> a value into a log message or email or similar. And in that situation it's
> very useful if it sticks to ASCII: non-ASCII content just isn't resilient
> enough to being passed around the network, truncated and generally
> mutilated on the way through.
>
> These are definitely two different concerns and they pull in opposite
> directions in this discussion. It's a matter of opinion which you think is
> more important. Me, I think the latter, but then I do a lot of logging and
> speak a language that fits into ASCII. YMMV!
This proposal is motivated by the first point, but doesn't mention debugging
other then
> 2. This is an actual annoyance during debugging localized software, or
strings with emojis
which I don't agree with.
For example look at the failing test case in the pandoc in my previous
message.
\160 is a non-breaking space, which looks like normal space when rendered
normally. I have my share of bad experience with it. So, indeed YMMV.
- Oleg
[1]:
https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html
[2]:
https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122899.html
On 8.7.2021 18.53, Oleg Grenrus wrote:
>
> Here is a simple patch, which I hope is close to what
>
> 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
> Unicode characters out of the range of ASCII.
>
> of a proposed change will look like:
>
> diff --git a/libraries/base/GHC/Show.hs b/libraries/base/GHC/Show.hs
> index 84077e473b..24569168d4 100644
> --- a/libraries/base/GHC/Show.hs
> +++ b/libraries/base/GHC/Show.hs
> @@ -364,7 +364,10 @@ showCommaSpace = showString ", "
> -- > showLitChar '\n' s = "\\n" ++ s
> --
> showLitChar :: Char -> ShowS
> -showLitChar c s | c > '\DEL' = showChar '\\' (protectEsc isDec
> (shows (ord c)) s)
> +showLitChar c s | c > '\DEL' =
> + if isPrint c
> + then showChar c s
> + else showChar '\\' (protectEsc isDec (shows (ord c)) s)
> showLitChar '\DEL' s = showString "\\DEL" s
> showLitChar '\\' s = showString "\\\\" s
> showLitChar c s | c >= ' ' = showChar c s
> @@ -380,6 +383,13 @@ showLitChar c s = showString
> ('\\' : asciiTab!!ord c) s
> -- I've done manual eta-expansion here, because otherwise
> it's
> -- impossible to stop (asciiTab!!ord) getting floated out
> as an MFE
>
> +-- Local definition of isPrint to avoid fighting with cycles for now.
> +isPrint :: Char -> Bool
> +isPrint c = iswprint (ord c) /= 0
> +
> +foreign import ccall unsafe "u_iswprint"
> + iswprint :: Int -> Int
> +
> showLitString :: String -> ShowS
> -- | Same as 'showLitChar', but for strings
> -- It converts the string to a string using Haskell escape
> conventions
>
> I applied it to ghc-8.10 branch,
>
> % _build/stage1/bin/ghc --interactive
> GHCi, version 8.10.5: https://www.haskell.org/ghc/ :? for help
> Prelude> "äiti"
> "äiti"
> Prelude> "мир"
> "мир"
> Prelude> print "мир"
> "мир"
> Prelude> "😀"
> "😀"
>
> And then run test-suites of aeson, dhall and pandoc.
>
> Aeson test-suite passed.
> Dhall test-suites passed too,
> However pandoc testsuite failed:
>
> 78 out of 2819 tests failed (35.88s)
>
> An example failure is:
>
> 3587.md
> #1:
> FAIL (0.01s)
> --- test/command/3587.md
> +++ pandoc -f latex -t native
> + 1 [Para [Str "1 m",Space,Str "is",Space,Str
> "equal",Space,Str "to",Space,Str "1000 mm"]]
> - 1 [Para [Str "1\160m",Space,Str "is",Space,Str
> "equal",Space,Str "to",Space,Str "1000\160mm"]]
>
> Str is a constructor of Inline type, and takes Text: data Inline = Str
> Text | ...
> As discussed on the GHC issue [1], Text and ByteString Show Instances
> piggyback on
> String instance. Bodigrim said that Text will eventually migrate
> to do the same as new Show String [2], so this issue will resurface.
>
> Please explain the compatibility story. How library writes should write
> their code (in test-suites) which rely on Show String or Show Text, such
> that they could support GHC base versions (and/or text) versions
> on the both sides of this breaking change.
>
> I agree with Julian that required migration engineering effort across
> (even just the open source) ecosystem is non-trivial.
> Having a good plan would hopefully make it easier to accept that cost.
>
> The fact it's a change which is not detectable at compile time
> makes me very anxious about this, even I don't disagree with
> motivation bits.
> I have very little idea if and where I depend on Show String behavior.
>
> It would also be interesting to see results of test-suites of all
> Stackage, but I leave it for someone else to do.
>
> - Oleg
>
> [1]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027
> [2]: https://gitlab.haskell.org/ghc/ghc/-/issues/20027#note_363519
>
> On 8.7.2021 15.25, Julian Ospald wrote:
>> Hi,
>>
>> I think most seemed to agree on the motivation, but would it be a lot
>> of work to ping a few large opensource/industry projects about this
>> and get a feel what they think or how much of an expected effort a
>> migration would be? I'm afraid that we might take this too lightly
>> and possibly cause a lot of engineering effort here. Our expectations
>> how or how often people use "show" might or might not be accurate.
>>
>> I'm aware of e.g. the cardano wallet test suite (open source) and
>> other cardano projects that are very large opon source codebases and
>> may be affected.
>>
>> CCing duncan
>>
>> On July 8, 2021 10:11:28 AM UTC, Kai Ma <justksqsf at gmail.com> wrote:
>>
>> Hi all
>>
>> Two weeks ago, I proposed “Support Unicode characters in instance Show
>> String” [0] in the GHC issue tracker, and chessai asked me to post it
>> here for wider feedback. The proposal posted here is edited to reflect
>> new ideas proposed and insights accumulated over the days:
>>
>> 1. (Proposal) Now the proposal itself is now modeled after Python.
>> 2. (Alternative Options) Alternative 2 is the original proposal.
>> 3. (Downsides) New. About breakage.
>> 4. (Prior Art) New.
>> 5. (Unresolved Problems) New. Included for discussion.
>>
>> Even though I wanted to summarize everything here, some insightful
>> comments are perhaps not included or misunderstood. These original
>> comments can be found at the original feature request.
>>
>> [0] https://gitlab.haskell.org/ghc/ghc/-/issues/20027 <https://gitlab.haskell.org/ghc/ghc/-/issues/20027>
>>
>>
>> Motivation
>> ------------------------------------------------------------------------
>> Unicode has been widely adopted and people around the world rely on
>> Unicode to write in their native languages. Haskell, however, has been
>> stuck in ASCII, and escape all non-ASCII characters in the String's
>> instance of the Showclass, despite the fact that each element of a
>> String is typically a Unicode code point, and putStrLn actually works as
>> expected. Consider the following examples:
>>
>> ghci> print "Hello, 世界”
>> "Hello, \19990\30028”
>>
>> ghci> print "Hello, мир”
>> "Hello, \1084\1080\1088”
>>
>> ghci> print "Hello, κόσμος”
>> "Hello, \954\972\963\956\959\962”
>>
>> ghci> "Hello, 世界" -- ghci calls `show`, so string literals are also escaped
>> "Hello, \19990\30028”
>>
>> ghci> "😀" -- Not only human scripts, but also emojis!
>> "\128512”
>>
>>
>> This status quo is unsatisfactory for a number of reasons:
>>
>> 1. Even though it's small, it somehow creates an unwelcoming atmosphere
>> for native speakers of languages whose scripts are not representable
>> in ASCII.
>> 2. This is an actual annoyance during debugging localized software, or
>> strings with emojis.
>> 3. Following 1, Haskell teachers are forced to use other languages
>> instead of the students' mother tongues, or relying on I/O functions
>> like putStrLn, creating a rather unnecessary burden.
>> 4. Other string types, like Text [1], rely on this Show instance.
>>
>> Moreover, `read` already can handle Unicode strings today, so relaxing
>> constraints on `show` doesn't affect `read . show == id`.
>>
>>
>> Proposal
>> ------------------------------------------------------------------------
>> It's proposed here to change the Show instance of String, to achieve the following output:
>>
>> ghci> print "Hello, 世界”
>> "Hello, 世界”
>>
>> ghci> print "Hello, мир”
>> "Hello, мир”
>>
>> ghci> print "Hello, κόσμος”
>> "Hello, κόσμος”
>>
>> ghci> "Hello, 世界”
>> “Hello, 世界”
>>
>> ghci> "😀”
>> “😀"
>>
>> More concretely, it means:
>>
>> 1. Modify a few guards in GHC.Show.showLitChar to not escape _readable_
>> Unicode characters out of the range of ASCII.
>> 2. Provide a function showEscaped or newtype Escaped = Escaped String to
>> obtain the current escaping behavior, in case anyone wants the
>> current behavior back.
>>
>> This proposal isn't about unescaping everything, but only readable
>> Unicode characters. u_iswprint (GHC.Unicode.isPrint) seems to do the
>> job, and indeed, there was a similar proposal before [2]. In summary,
>> the behavior is similar to what Python `repr` does.
>>
>>
>> Alternative Options
>> ------------------------------------------------------------------------
>> 1. Always use putStrLn.
>>
>> This is viable today but unsatisfactory as it requires stdout. In
>> some cases, stdout is not accessible, e.g. Telegram or Discord bots.
>>
>> 2. Don't escape anything.
>>
>> `show` itself refrains from escaping most of the characters, and let
>> ghci do the job instead.
>>
>> 3. Customize ghci instead.
>>
>> ghci intercepts output strings and check if they can be converted
>> back to readable characters. This potentially allows for better
>> compatibility with a variety of strangely behaving terminals, and
>> finer-grained user control.
>>
>> Tom Ellis proposed `-interactive-print`-based solutions in the
>> comment section.
>>
>> 4. A new language extension, e.g. ShowStringUnicode.
>>
>> Proposed by Julian Ospald. When enabled, readable Unicode characters
>> are not escaped, and this is enabled by default by ghci. There are
>> concerns about how this would affect cross-module behavior.
>>
>>
>> Downsides
>> ------------------------------------------------------------------------
>> This is definitely a breaking change, but the breakage, to our current
>> understanding, is limited.
>>
>> First, use of `show` in production code is discouraged. Even if someone
>> really does that, the breakage only happens when one tries to send the
>> "serialized" data over wire:
>>
>> Suppose Machine A `show`-ed a string and saved it into a UTF-8-encoded
>> file, and sends it to Machine B, which expects another encoding. This
>> would be surprising for those who are used to the old behavior.
>>
>> Second, though the breakage is not likely to be catastrophic for correct
>> production code, test suites could be badly affected, as pointed out by
>> Oleg Grenrus and vdukhovni in the comment section. Some test suites
>> compare `show` results with expected results. vdukhovni further
>> commented that Haskell escapes are not universally supported by
>> non-Haskell tools, so the impact would be confined to Haskell.
>>
>>
>> Prior Art
>> ------------------------------------------------------------------------
>> Python supports Unicode natively since 3. Python's approach is
>> intuitive and capable. Its `repr`, which is equivalent to Haskell's
>> `show`, automatically escapes unreadable characters, but leaves readable
>> characters unescaped. The criteria of "readable" can be found in
>> CPython's code [3]. If we were to realize this proposal, Python could
>> be a source of inspiration.
>>
>>
>> Unresolved Problems
>> ------------------------------------------------------------------------
>> There are some currently unresolved (not discussed enough) issues.
>>
>> + Locales.
>>
>> What if the specified locale does not support Unicode? Hécate
>> Moonlight pointed out PEP-538 [4] could be a reference.
>>
>> + Unicode versions.
>>
>> Javran Cheng pointed out u_iswprint is generated from a Unicode table,
>> which is manually updated. This raises a concern that the definition
>> of "printable" characters could change from version to version.
>>
>> + Definition of "readable".
>>
>> Unicode already defined "printability". It's good, but it is not
>> necessarily what we want here.
>>
>> - Should we support RTL?
>> - Should we design a Haskell-specific definition of readability, to
>> avoid Unciode version silently introducing breakage?
>>
>> (More?)
>>
>> Some issues here perhaps require better answers to: What is our
>> expectation of Show? Where should it be used? Should we expect it to
>> break on every Unicode update?
>>
>>
>> [1] https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37 <https://hackage.haskell.org/package/text-1.2.4.1/docs/src/Data.Text.Show.html#line-37>
>> [2] https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html <https://mail.haskell.org/pipermail/haskell-cafe/2016-February/122874.html>
>> [3] https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147 <https://github.com/python/cpython/blob/bb3e0c240bc60fe08d332ff5955d54197f79751c/Objects/unicodectype.c#L147>
>> [4] https://www.python.org/dev/peps/pep-0538/ <https://www.python.org/dev/peps/pep-0538/>
>> ------------------------------------------------------------------------
>> Libraries mailing list
>> Libraries at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries <http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries>
>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/libraries/attachments/20210708/3a513e2c/attachment.html>
More information about the Libraries
mailing list