[Haskell-cafe] How to reverse ghc encoding of command line arguments

Donn Cave donn at avvanta.com
Mon Nov 17 21:11:39 UTC 2014


[... I said earlier ...]
> I may be confused here - trying this out, I seem to be getting
> garbage I don't understand from System.Environment getArgs.

So I returned to this out of curiosity, and specifically,
System.Environment getArgs converts common accented characters
in ISO-8859-1 command line arguments, into values in the
high 0xDC00's.  Lower case umlaut u, for example, is 0xDCFC.
These values, fed into Data.Text pack and encodeUtf8, seem
to be garbage ... I get 3-byte UTF-8 that I highly doubt
has anything to do with accented latin characters, actually
the same "\239\191\189" even for different chars.

But the lower bytes looked like Unicode values, and if the
upper 0xDC00 is cleared, Data.Text pack and encodeUtf8 works.  

I'm no Unicode whiz, maybe this all makes sense?  I'm not
inconvenienced by this myself, my interest is only academic,
just wondering what the extra 0xDC00 bits are for.  And I
should note that as far as I can make out, this doesn't match
the remark at the beginning of this thread:  "... does *not*
contain the Unicode code points of the characters the user has
entered.  Instead the input bytes are mapped one-to-one to Char."
I have GHC 7.8.3.

thanks,
	Donn


More information about the Haskell-Cafe mailing list