[Haskell-cafe] How to reverse ghc encoding of command line arguments

Ben Franksen ben.franksen at online.de
Mon Nov 17 23:41:57 UTC 2014


Donn Cave wrote:
> [... I said earlier ...]
>> I may be confused here - trying this out, I seem to be getting
>> garbage I don't understand from System.Environment getArgs.
> 
> So I returned to this out of curiosity, and specifically,
> System.Environment getArgs converts common accented characters
> in ISO-8859-1 command line arguments, into values in the
> high 0xDC00's.  Lower case umlaut u, for example, is 0xDCFC.
> These values, fed into Data.Text pack and encodeUtf8, seem
> to be garbage ... I get 3-byte UTF-8 that I highly doubt
> has anything to do with accented latin characters, actually
> the same "\239\191\189" even for different chars.
> 
> But the lower bytes looked like Unicode values, and if the
> upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works.
> 
> I'm no Unicode whiz, maybe this all makes sense?  I'm not
> inconvenienced by this myself, my interest is only academic,
> just wondering what the extra 0xDC00 bits are for.  And I
> should note that as far as I can make out, this doesn't match
> the remark at the beginning of this thread:  "... does *not*
> contain the Unicode code points of the characters the user has
> entered.  Instead the input bytes are mapped one-to-one to Char."
> I have GHC 7.8.3.

Hi Donn

I am sorry, I should have replied earlier here to say that I was *wrong*: 
GHC/base does not by default do what I claimed it does, as I learned later 
and you confirm now. It does that only if the program expressly demands it 
by specifying a so-called "char8" encoding, by initializing the global 
variable localeEncoding before the base library does it for you. With this 
you can override the user's locale as seen by GHC/base. I was working on 
Darcs and this is what Darcs does. But I was not aware of this hack and used 
to local reasoning in Haskell (doesn't Haskell claim to be a purely 
functional language?).

Sorry for the confusion. And thanks for confirming that GHC and the base 
library do the right thing (if we let them).

Cheers
Ben
-- 
"Make it so they have to reboot after every typo." -- Scott Adams




More information about the Haskell-Cafe mailing list