[Haskell-i18n] Surrogate pairs?

Ashley Yakeley ashley@semantic.org
Mon, 19 Aug 2002 17:53:43 -0700


At 2002-08-19 17:06, Sven Moritz Hallberg wrote:

>I just implemented a UTF-8 coder and decoder in Haskell. While reading
>the Unicode standard I realized what someone had pointed out earlier
>with respect to code values versus code points: Unicode, while "usually"
>using 16-bit words, supports "surrogate pairs" to handle all 31 bits of
>UCS-4.
>
>The report says, Char is a 16-bit Unicode value.

Right, sec. 6.1.2. But this should change. A Char should allow (and only 
allow) values in the range [0,0x110000 - 1]. These are _Unicode scalar 
values_ as defined in the standard sec. 3.7, D28. Unicode scalar values 
are also known as "code positions" or "code points".

The current version of GHC does precisely this. I don't think UCS-4 is 
used anymore, all character assignments are to code points, 0 to 0x10FFFF.

> What's the stance on surrogate pairs?
> How are we going to support those? My code currently
>just errors "unsupported" when encountering a surrogate.

I think we should be working to the latest version of Unicode, 3.2.0. 
<http://www.unicode.org/unicode/reports/tr28/>

If your UTF-8 decoder comes across a sequence apparently representing a 
codepoint in the range [0xD800,0xDFFF], it should consider it 
"ill-formed". This is a new thing in 3.2.

If Chars are code points rather than 16-bit code values, then when your 
UTF-8 decoder comes across a sequence representing a codepoint in the 
range [0x10000,0x10FFFF], it should represent it as a single Char, not as 
a surrogate pair of Chars. Surrogate pairs are for UTF-16, AFAIK they're 
not supposed to exist as code points.

-- 
Ashley Yakeley, Seattle WA