[Haskell-i18n] Surrogate pairs?
Ashley Yakeley
ashley@semantic.org
Tue, 20 Aug 2002 17:07:37 -0700
At 2002-08-20 04:34, Sven Moritz Hallberg wrote:
>I see. I find it pretty inconvenient to read the incremental changes in
>the different Unicode revisions.
Me too. I have the big blue book (Unicode Standard 3.0), but I have to
look at the updates for 3.1 and 3.2.
>I've not been able to find the exact
>place where they clarify the situation with surrogate pairs. I suppose
>what they were is now only a facet of UTF-16, is that correct?
I believe so.
>Anyway, as you put it, I take it that there should never be a character
>composed of two Chars.
That's not quite correct. Every code point is exactly one Char, but some
characters may be composed of more than one code point. For instance, '=E1'=
might be represented as
\#00E1 [LATIN SMALL LETTER A WITH ACUTE]
or
\#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]
> The wording in the report about 16 bits will go,
>and the Int representation of Char uses Unicode scalar values.
Currently GHC restricts Chars to [0,0x10FFFF], for instance:
Prelude> toEnum 0x0061 :: Char
'a'
Prelude> toEnum 0x10FFFF :: Char
'\1114111'
Prelude> toEnum 0x110000 :: Char
*** Exception: Prelude.chr: bad argument
Prelude>
I think this is correct behaviour.
--
Ashley Yakeley, Seattle WA