Unicode again

Kent Karlsson kentk@md.chalmers.se
Tue, 15 Jan 2002 21:57:54 +0100


> -----Original Message-----
> From: haskell-admin@haskell.org [mailto:haskell-admin@haskell.org]On
> Behalf Of Wolfgang Jeltsch
> Sent: den 5 januari 2002 13:04
> To: The Haskell Mailing List
> Subject: Unicode again
> 
> 
> Hello,
> there was some discussion about Unicode and the Char type 
> some time ago. At
> the moment I'm writing some Haskell code dealing with XML. 
> The problem is
> that there seems to be no consensus concerning Char so that 
> it is difficult
> for me to deal with the XML unicode issues appropriately. Is there any
> option that is very likely to get into the Haskell report? 

I don't have any opinion on what is likely for Haskell here, but...

> According to my
> memory the following were more or less propagated:
>     (1) range '\x0000' to '\xFFFF'; UTF-16 is used
>     (2) range '\x000000' to '\x10FFFF'; chars denote codepoints
>     (3) range '\x00000000' to '\x7FFFFFFF'; chars denote codepoints


My suggestion here is (2a) range '\x0' to '\xD7FF' union '\xE000' to '\x10FFFF';
chars denote codepoints; the excluded subrange is for "surrogate" codes, they
are excluded from UTF-8 and UTF-32, and must occur in proper pairs in UTF-16.



> GHC 5 seems to implement the second variant; Hugs still uses 
> the poor range
> of '\x00' to '\xFF'. What does nhc98 do?
> My opinion is that using (1) is very, very bad. The name 

Not really.  Most processes on text must take context into account.
Even such a seemingly simple thing as counting what most users think
of as characters (or splitting strings at legitimate points) must take
context into account.  E.g. <a, combining ring above> is one character
in the view of the user (though not in the view of the character processing
programmer).  Everyone that is serious about Unicode and where efficiency
is also of concern(!) target UTF-16 (MacOS, Windows, Epoc, Java, Oracle, ...).
That does not necessarily mean that Haskell should follow suit.  

> "Char" suggests
> that values of the type are characters (glyphs). Well, even 

Characters and glyphs are very different concepts.  I will not go into
detail here, just note that there is not a 1-1 relationship between
characters and glyphs even in a single font for many scripts.

> when using (2)
> or (3) Char does not denote characters but codepoints, but 
> this is closer
> to denoting chars than (1).

Vary marginally.  But when looking at individual characters
(in the Unicode/10646 sense) UTF-32 is better.

> And memory usage shouldn't be an issue - a

For Haskell, I agree.

> concept beeing much 

s/much/marginally (for strings)/ 

> better is IMO worth the higher memory 
> use. I prefer (3)
> over (2) because there is the possibility of expansion of the Unicode
> character set in the future. 

Well, it is going in the other direction.  With amendment 1 to 10646-1:2000
the limit at 10FFFF is strengthened also in 10646, even though it is not
yet as absolute as in Unicode.

	/kent k


> Another solution would be to 
> specify that the
> upper bound of Char is initially '\x10FFFF' and shall be 
> adapted as Unicode
> evolves.
> Also, as already said by another person, we should introduce a package
> dealing with encoding and decoding of character strings to/from octet
> streams. The encoding used with character I/O must be specified.
> Any comments?
> 
> Wolfgang
> 
> _______________________________________________
> Haskell mailing list
> Haskell@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell