[GHC] #10762: On Windows, out-of-codepage characters can cause GHC build to fail

GHC ghc-devs at haskell.org
Sun Aug 9 10:58:01 UTC 2015


#10762: On Windows, out-of-codepage characters can cause GHC build to fail
-----------------------------------------+---------------------------------
              Reporter:  snoyberg        |             Owner:
                  Type:  bug             |            Status:  new
              Priority:  normal          |         Milestone:
             Component:  Compiler        |           Version:  7.10.2
              Keywords:                  |  Operating System:  Windows
          Architecture:  x86_64 (amd64)  |   Type of failure:  None/Unknown
             Test Case:                  |        Blocked By:
              Blocking:                  |   Related Tickets:
Differential Revisions:                  |
-----------------------------------------+---------------------------------
 You can see where this hit us recently on stack with issues
 [https://github.com/commercialhaskell/stack/issues/738 738] and
 [https://github.com/commercialhaskell/stack/issues/734 734]. To
 demonstrate, I'm attaching a UTF-8 encoded Haskell program with some
 Hebrew characters, and some warnings. The contents of that file are:

 {{{#!hs
 module Main
     ( main
     , שלום
     ) where

 main :: IO ()
 main = putStrLn שלום

 שלום = "shalom"
 }}}

 If I first set my codepage to 65001 (UTF-8), everything works as expected:

 {{{
 C:\Users\Michael\Desktop>chcp 65001
 Active code page: 65001

 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall -ddump-hi -ddump-to-file
 shalom.hs
 [1 of 1] Compiling Main             ( shalom.hs, shalom.o )

 shalom.hs:9:1: Warning:
     Top-level binding with no type signature: שלום :: [Char]
 Linking shalom.exe ...
 }}}

 However, if I set my codepage to 437 (US), both the warnings sent to the
 console, and the .hi dump file, cause GHC to exit prematurely:

 {{{
 C:\Users\Michael\Desktop>chcp 437
 Active code page: 437

 C:\Users\Michael\Desktop>ghc -fforce-recomp -Wall shalom.hs
 [1 of 1] Compiling Main             ( shalom.hs, shalom.o )

 shalom.hs:9:1: Warning:
     Top-level binding with no type signature: <stderr>: commitBuffer:
 invalid argument (invalid character)
 }}}

 {{{
 C:\Users\Michael\Desktop>chcp 437
 Active code page: 437

 C:\Users\Michael\Desktop>ghc -fforce-recomp -ddump-hi -ddump-to-file
 shalom.hs
 [1 of 1] Compiling Main             ( shalom.hs, shalom.o )
 shalom.dump-hi: commitBuffer: invalid argument (invalid character)
 }}}

 At the very least, I would argue that -ddump-to-file should always dump to
 the output files as UTF-8, as this is the most useful for tooling. Beyond
 that, there are a few options here:

 * Have all output- including to the console- go out as UTF-8. This may not
 play terribly nicely with consoles without setting the output codepage.
 * Provide a command line option or environment variable to specify "output
 as UTF-8."
 * More radical: change the default way that all Handles work so that UTF-8
 is the default, instead of paying attention to code pages and environment
 variables. Honestly, this is my preference, but it's a bigger discussion
 than this one bug.

 The workaround we've implemented in stack for now is setting the codepage
 to 65001 for the console while running stack. This is not ideal, since
 this is essentially a global setting for the entire console.

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/10762>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list