[GHC] #15118: Printing non-ASCII characters to console on Windows
GHC
ghc-devs at haskell.org
Thu May 3 23:32:10 UTC 2018
#15118: Printing non-ASCII characters to console on Windows
-------------------------------------+-------------------------------------
Reporter: lehins | Owner: (none)
Type: bug | Status: new
Priority: normal | Milestone: 8.6.1
Component: Compiler | Version: 8.2.2
Keywords: | Operating System: Unknown/Multiple
Architecture: | Type of failure: None/Unknown
Unknown/Multiple |
Test Case: | Blocked By:
Blocking: | Related Tickets:
Differential Rev(s): | Wiki Page:
-------------------------------------+-------------------------------------
As part of an initiative of getting stack to work properly on Windows for
users with international names
(https://github.com/commercialhaskell/stack/issues/3988) and working on
trying to find a fix for {{{ghc-pkg}}} - #15021 I discovered a weird
behavior that have been known for a while and does affect other languages,
not only Haskell.
First of all here is the default behavior on Windows with Locale that
isn't Cyrillic for this program:
{{{
main :: IO ()
main = putStrLn "Алексей Кулешевич"
}}}
{{{
PS C:\phab\windows-console> stack exec -- console
console.EXE: <stdout>: commitBuffer: invalid argument (invalid character)
}}}
Now consider this program:
{{{
main :: IO ()
main = do
hSetEncoding stdout utf8
putStrLn "Алексей Кулешевич"
}}}
Compiling and running it on Windows 7 with English locale results in:
{{{
PS C:\phab\windows-console> stack exec -- console
Алексей Кулешевич
PS C:\phab\windows-console> chcp 65001
Active code page: 65001
PS C:\phab\windows-console> stack exec -- console
Алексей Кулешевич
лешевич
�ич
}}}
No knowledge of Russian is necessary in order to see that after the code
page is set to {{{65001}}} there are characters printed to the console
that don't belong there. That seems to be bug in Windows handling of
unicode characters, since it's the exactly same result is `cmd` as well as
Powershell and has been reported with other languages like Perl and Java.
Worth noting that this also directly affects `ghc`, whenever
{{{GHC_CHARENC}}} environment variable is set to {{{"UTF-8"}}}.
Besides the bug being described above it is sad that we need to rely on
both the code page and the handle encoding to be set correctly in order to
even see the semi-correct output without a total program crash.
The fix being proposed here is to use {{{WriteConsoleW}}} API call instead
of writing to a handle, but only when the handle is actually a console and
not pipe. This allows us to print unicode characters correctly without
changing or relying on the setting of the current code page. Here is a
sample output with my recent experiments:
{{{
PS C:\phab\windows-console> chcp
Active code page: 437
PS C:\phab\windows-console> stack exec -- console
Алексей Кулешевич
}}}
I'll add some code examples of proposed solution in the upcoming days.
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15118>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list