DBCS encoding support on Windows

Simon Peyton-Jones simonpj at microsoft.com
Thu Apr 25 12:17:38 CEST 2013


I was thinking of people who don't know what DBCS or a code page is.  But maybe they are going to be too clueless for comments to help!

S

From: omega.theta at gmail.com [mailto:omega.theta at gmail.com] On Behalf Of Max Bolingbroke
Sent: 24 April 2013 21:04
To: Simon Peyton-Jones
Cc: ghc-devs at haskell.org
Subject: Re: DBCS encoding support on Windows

The algorithm in the new module (GHC.IO.Encoding.CodePage.API) is rather intricate, so I've commented it quite thoroughly. The changes to other modules are minimal: we simply now use a real code page encoding instead of brokenly using latin1 when GHC doesn't have the code page built in, so there isn't much of a change to document.

Max

On 24 April 2013 08:12, Simon Peyton-Jones <simonpj at microsoft.com<mailto:simonpj at microsoft.com>> wrote:
Great stuff.

One thing: have you left enough documentation in the code that, when someone comes along in 3 years time, they can understand the problem and how you have dealt with it?  Lot of "Note [Blah]" stuff?  Or something.

Thanks

Simon

From: ghc-devs-bounces at haskell.org<mailto:ghc-devs-bounces at haskell.org> [mailto:ghc-devs-bounces at haskell.org<mailto:ghc-devs-bounces at haskell.org>] On Behalf Of Max Bolingbroke
Sent: 23 April 2013 21:29
To: ghc-devs at haskell.org<mailto:ghc-devs at haskell.org>
Subject: DBCS encoding support on Windows

Hi GHCers,

I've implemented support in GHC for extra Windows code pages on the branch "dbcs" of the base library.

The problem this solves is that currently users of Haskell on a Windows machine running in a locale which uses a double-byte code page such as CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows console in their native language. Unfortunately code page support is a prerequisite for getting this to work correctly because for all Microsoft's fine talk about Unicode being the future, the Windows console does not seem to support it properly - code pages are the only way to go for console input and output.

As the standard Windows locale encodings in many regions, these code pages are also the predominant method of encoding text files in many countries, so they are useful outside the console.

The solution is along the lines suggested in http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an iconv-like interface to Window's MultiByteToWideChar and WideCharToMultiByte APIs by the judicious use of binary search. In my branch, these APIs will be used whenever we don't have a built-in native Haskell TextEncoding for the code page (we used to fall back on using latin1 for such code pages).

Unless there are any objections I'll merge this into the base library main branch next week.

Cheers,
Max

_______________________________________________
ghc-devs mailing list
ghc-devs at haskell.org<mailto:ghc-devs at haskell.org>
http://www.haskell.org/mailman/listinfo/ghc-devs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130425/d5d52fff/attachment.htm>


More information about the ghc-devs mailing list