[Haskell-cafe] Has character changed in GHC 6.8?

Wed Jan 23 05:58:54 EST 2008

Peter Verswyvelen <bf3 at telenet.be> writes:

> No I just used wrong terminology. When I said unicode, I actually meant UCS-x,

You might as well say UCS-4, nobody uses UCS-2 anymore.  It's been
replaced by UTF-16, which gives you the complexity of UTF-8 without
being compact (for 99% of existing data), endianness-indifferent, or backwards
compatibe with ASCII. 

> and with multi-byte-string-thing I meant VARIABLE-length, sorry about that. I
> find variable length chars so much harder to use and reason about than the
> fixed length characters. UTF-x is a form of compression, which is
> understandable, but it is IMHO a burden (since it does not allow random access
> to the n-th character)

Do you really need that, though?  Most formats I know with enough structure
that you can pick up records by offset either encode the offsets
somewhere, or are restricted to ASCII, or both.

> Now I'm getting a bit confused here. To summarize, what encoding does GHC 6.8.2
> use for [Char]? UCS-32?

Internally, Haskell Chars are Unicode, and stores a code point as a
32bit (well, actually 21 bit or something) value.  One Char, one code
point. 

ByteString stores 8-bit "char"s, and the Char8 interface chops off the
top bits, essentially projecting codepoints down to the ISO-8859-1
(latin1) subset.

Externally, it depends on what IO library you use.

As for the command line, Ian's post links to:
  http://www.haskell.org/ghc/docs/6.8.2/html/users_guide/release-6-8-2.html

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants