[Haskell-i18n] Surrogate pairs?

Ashley Yakeley ashley@semantic.org
Tue, 20 Aug 2002 17:07:37 -0700


At 2002-08-20 04:34, Sven Moritz Hallberg wrote:

>I see. I find it pretty inconvenient to read the incremental changes in
>the different Unicode revisions. 

Me too. I have the big blue book (Unicode Standard 3.0), but I have to 
look at the updates for 3.1 and 3.2.

>I've not been able to find the exact
>place where they clarify the situation with surrogate pairs. I suppose
>what they were is now only a facet of UTF-16, is that correct?

I believe so.

>Anyway, as you put it, I take it that there should never be a character
>composed of two Chars.

That's not quite correct. Every code point is exactly one Char, but some 
characters may be composed of more than one code point. For instance, '=E1'=
 
might be represented as

  \#00E1 [LATIN SMALL LETTER A WITH ACUTE]

or

  \#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]

> The wording in the report about 16 bits will go,
>and the Int representation of Char uses Unicode scalar values.

Currently GHC restricts Chars to [0,0x10FFFF], for instance:

  Prelude> toEnum 0x0061 :: Char
  'a'
  Prelude> toEnum 0x10FFFF :: Char
  '\1114111'
  Prelude> toEnum 0x110000 :: Char
  *** Exception: Prelude.chr: bad argument
  Prelude> 

I think this is correct behaviour.

-- 
Ashley Yakeley, Seattle WA