[GHC] #15118: Printing non-ASCII characters to console on Windows

GHC ghc-devs at haskell.org
Thu May 3 23:32:10 UTC 2018


#15118: Printing non-ASCII characters to console on Windows
-------------------------------------+-------------------------------------
           Reporter:  lehins         |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:  8.6.1
          Component:  Compiler       |           Version:  8.2.2
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 As part of an initiative of getting stack to work properly on Windows for
 users with international names
 (https://github.com/commercialhaskell/stack/issues/3988) and working on
 trying to find a fix for {{{ghc-pkg}}} - #15021 I discovered a weird
 behavior that have been known for a while and does affect other languages,
 not only Haskell.

 First of all here is the default behavior on Windows with Locale that
 isn't Cyrillic for this program:

 {{{
 main :: IO ()
 main = putStrLn "Алексей Кулешевич"
 }}}


 {{{
 PS C:\phab\windows-console> stack exec -- console
 console.EXE: <stdout>: commitBuffer: invalid argument (invalid character)
 }}}

 Now consider this program:

 {{{
 main :: IO ()
 main = do
   hSetEncoding stdout utf8
   putStrLn "Алексей Кулешевич"
 }}}

 Compiling and running it on Windows 7 with English locale results in:
 {{{
 PS C:\phab\windows-console> stack exec -- console
 Алексей Кулешевич
 PS C:\phab\windows-console> chcp 65001
 Active code page: 65001
 PS C:\phab\windows-console> stack exec -- console
 Алексей Кулешевич
 лешевич
 �ич

 }}}

 No knowledge of Russian is necessary in order to see that after the code
 page is set to {{{65001}}} there are characters printed to the console
 that don't belong there. That seems to be bug in Windows handling of
 unicode characters, since it's the exactly same result is `cmd` as well as
 Powershell and has been reported with other languages like Perl and Java.

 Worth noting that this also directly affects `ghc`, whenever
 {{{GHC_CHARENC}}} environment variable is set to {{{"UTF-8"}}}.

 Besides the bug being described above it is sad that we need to rely on
 both the code page and the handle encoding to be set correctly in order to
 even see the semi-correct output without a total program crash.

 The fix being proposed here is to use {{{WriteConsoleW}}} API call instead
 of writing to a handle, but only when the handle is actually a console and
 not pipe. This allows us to print unicode characters correctly without
 changing or relying on the setting of the current code page. Here is a
 sample output with my recent experiments:

 {{{
 PS C:\phab\windows-console> chcp
 Active code page: 437
 PS C:\phab\windows-console> stack exec -- console
 Алексей Кулешевич
 }}}

 I'll add some code examples of proposed solution in the upcoming days.

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/15118>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list