[Haskell-i18n] Surrogate pairs?
Ashley Yakeley
ashley@semantic.org
Mon, 19 Aug 2002 17:53:43 -0700
At 2002-08-19 17:06, Sven Moritz Hallberg wrote:
>I just implemented a UTF-8 coder and decoder in Haskell. While reading
>the Unicode standard I realized what someone had pointed out earlier
>with respect to code values versus code points: Unicode, while "usually"
>using 16-bit words, supports "surrogate pairs" to handle all 31 bits of
>UCS-4.
>
>The report says, Char is a 16-bit Unicode value.
Right, sec. 6.1.2. But this should change. A Char should allow (and only
allow) values in the range [0,0x110000 - 1]. These are _Unicode scalar
values_ as defined in the standard sec. 3.7, D28. Unicode scalar values
are also known as "code positions" or "code points".
The current version of GHC does precisely this. I don't think UCS-4 is
used anymore, all character assignments are to code points, 0 to 0x10FFFF.
> What's the stance on surrogate pairs?
> How are we going to support those? My code currently
>just errors "unsupported" when encountering a surrogate.
I think we should be working to the latest version of Unicode, 3.2.0.
<http://www.unicode.org/unicode/reports/tr28/>
If your UTF-8 decoder comes across a sequence apparently representing a
codepoint in the range [0xD800,0xDFFF], it should consider it
"ill-formed". This is a new thing in 3.2.
If Chars are code points rather than 16-bit code values, then when your
UTF-8 decoder comes across a sequence representing a codepoint in the
range [0x10000,0x10FFFF], it should represent it as a single Char, not as
a surrogate pair of Chars. Surrogate pairs are for UTF-16, AFAIK they're
not supposed to exist as code points.
--
Ashley Yakeley, Seattle WA