[Haskell-i18n] Surrogate pairs?

21 Aug 2002 12:04:27 +0200

On Wed, 2002-08-21 at 02:07, Ashley Yakeley wrote:
> >Anyway, as you put it, I take it that there should never be a character
> >composed of two Chars.
>=20
> That's not quite correct. Every code point is exactly one Char, but some=20
> characters may be composed of more than one code point. For instance, '=
=E1'=20
> might be represented as
>=20
>   \#00E1 [LATIN SMALL LETTER A WITH ACUTE]
>=20
> or
>=20
>   \#0061 [LATIN SMALL LETTER A] + \#0301 [COMBINING ACUTE ACCENT]

Oh yes, my wording was inaccurate. I agree with what you say in your
later message: These would be two different strings, seperate external
functions should be used to compose/decompose characters.

> > The wording in the report about 16 bits will go,
> >and the Int representation of Char uses Unicode scalar values.
>=20
> Currently GHC restricts Chars to [0,0x10FFFF], for instance:

Oh, right, I hadn't even tried that. I had just noticed that Hugs
rejects anything above \255.

>   Prelude> toEnum 0x0061 :: Char
>   'a'
>   Prelude> toEnum 0x10FFFF :: Char
>   '\1114111'
>   Prelude> toEnum 0x110000 :: Char
>   *** Exception: Prelude.chr: bad argument
>   Prelude>=20
>=20
> I think this is correct behaviour.

I agree. This reminds me that we have to spend some time thinking about
what guarantees the report should make with respect to valid values a
Char can have (think surrogates, noncharacters...).

Regards,
Sven Moritz