[Haskell-cafe] surrogate code points in a Char
ekmett at gmail.com
Wed Nov 18 12:39:23 EST 2009
Enforcing a gap in the middle of the range of Char would be exceedingly
awkward to propagate through all of the libraries. Off the top of my head:
1.) Functions like succ and pred which currently work on Char as an
enumeration would have to jump over the gap, to be truly anal retentive
about the mapping
2.) The toEnum and fromEnum would need to make the gap vanish as well,
ruining the ability to treat toEnum/fromEnum as chr/ord
3.) Every application would take a performance hit
4.) What to do in the presence of an encoding error is even more uncertain.
All you can do is throw an exception that can only be caught in IO.
A couple of less defensible considerations:
5.) It would break alternative encodings like utf-8b which use the invalid
code points in the surrogate pair range to encode ill-formed bytes in the
input stream to allow 'cut and paste'-safe round tripping of
utf-8b->Char->utf-8b even in the presence of invalid binary data/codepoints.
6.) Not all data is properly encoded. Consider, Unicode data you get back
from Oracle, which isn't really encoded in UTF-8, but is instead CESU-8,
which encodes codepoints in the higher plane as a surrogate pair, then utf-8
encodes the surrogate pair.
So, I suppose the answer would be it is functioning as designed, because the
current behavior is the least bad option. =)
On Wed, Nov 18, 2009 at 10:28 AM, Manlio Perillo
<manlio_perillo at libero.it>wrote:
> The Unicode Standard (version 4.0, section 3.9, D31 - pag 76) says:
> """Because surrogate code points are not included in the set of Unicode
> scalar values, UTF-32 code units in the range 0000D800 .. 0000DFFF are
> However GHC does not reject this code units:
> Prelude> print '\x0000D800'
> Is this a correct behaviour?
> Note that Python, too (2.5.4, UCS4 build, Linux Debian), accept these
> code units.
> Thanks Manlio
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe