[Haskell-i18n] Surrogate pairs?

20 Aug 2002 13:34:11 +0200

On Tue, 2002-08-20 at 02:53, Ashley Yakeley wrote:
> At 2002-08-19 17:06, Sven Moritz Hallberg wrote:
> 
> >I just implemented a UTF-8 coder and decoder in Haskell. While reading
> >the Unicode standard I realized what someone had pointed out earlier
> >with respect to code values versus code points: Unicode, while "usually"
> >using 16-bit words, supports "surrogate pairs" to handle all 31 bits of
> >UCS-4.
> >
> >The report says, Char is a 16-bit Unicode value.
> 
> Right, sec. 6.1.2. But this should change. A Char should allow (and only 
> allow) values in the range [0,0x110000 - 1]. These are _Unicode scalar 
> values_ as defined in the standard sec. 3.7, D28. Unicode scalar values 
> are also known as "code positions" or "code points".

Oh, alright then, that's wonderful.

> > What's the stance on surrogate pairs?
> > How are we going to support those? My code currently
> >just errors "unsupported" when encountering a surrogate.
> 
> I think we should be working to the latest version of Unicode, 3.2.0. 
> <http://www.unicode.org/unicode/reports/tr28/>
> 
> If your UTF-8 decoder comes across a sequence apparently representing a 
> codepoint in the range [0xD800,0xDFFF], it should consider it 
> "ill-formed". This is a new thing in 3.2.

I see. I find it pretty inconvenient to read the incremental changes in
the different Unicode revisions. I've not been able to find the exact
place where they clarify the situation with surrogate pairs. I suppose
what they were is now only a facet of UTF-16, is that correct?

Anyway, as you put it, I take it that there should never be a character
composed of two Chars. The wording in the report about 16 bits will go,
and the Int representation of Char uses Unicode scalar values.

> If Chars are code points rather than 16-bit code values, then when your 
> UTF-8 decoder comes across a sequence representing a codepoint in the 
> range [0x10000,0x10FFFF], it should represent it as a single Char, not as 
> a surrogate pair of Chars. Surrogate pairs are for UTF-16, AFAIK they're 
> not supposed to exist as code points.

OK.

Thanks for the clarification,
Sven Moritz