Unicode again
Wolfgang Jeltsch
wolfgang@jeltsch.net
Sat, 5 Jan 2002 13:04:08 +0100
Hello,
there was some discussion about Unicode and the Char type some time ago. At
the moment I'm writing some Haskell code dealing with XML. The problem is
that there seems to be no consensus concerning Char so that it is difficult
for me to deal with the XML unicode issues appropriately. Is there any
option that is very likely to get into the Haskell report? According to my
memory the following were more or less propagated:
(1) range '\x0000' to '\xFFFF'; UTF-16 is used
(2) range '\x000000' to '\x10FFFF'; chars denote codepoints
(3) range '\x00000000' to '\x7FFFFFFF'; chars denote codepoints
GHC 5 seems to implement the second variant; Hugs still uses the poor range
of '\x00' to '\xFF'. What does nhc98 do?
My opinion is that using (1) is very, very bad. The name "Char" suggests
that values of the type are characters (glyphs). Well, even when using (2)
or (3) Char does not denote characters but codepoints, but this is closer
to denoting chars than (1). And memory usage shouldn't be an issue - a
concept beeing much better is IMO worth the higher memory use. I prefer (3)
over (2) because there is the possibility of expansion of the Unicode
character set in the future. Another solution would be to specify that the
upper bound of Char is initially '\x10FFFF' and shall be adapted as Unicode
evolves.
Also, as already said by another person, we should introduce a package
dealing with encoding and decoding of character strings to/from octet
streams. The encoding used with character I/O must be specified.
Any comments?
Wolfgang