[Haskell-cafe] Re: String vs ByteString

Ketil Malde ketil at malde.org
Tue Aug 17 08:28:09 EDT 2010


Colin Paul Adams <colin at colina.demon.co.uk> writes:

>> Char is not an encoding, right?
>
>     Ivan> No, but in GHC at least it corresponds to a Unicode codepoint.
>
> I don't think this is right, or shouldn't be right, anyway.. Surely it
> stands for a character. Unicode codepoints include non-characters such
> as the surrogate codepoints used by UTF-16 to map non-BMP codepoints to
> pairs of 16-bit codepoints. 

  Prelude> (toEnum 0xD800) :: Char
  '\55296'

> I don't think you ought to be able to see a surrogate codepoint as a Char.

This is a bit confusing.  From the Unicode glossary:

- Character. (1) The smallest component of written language that has
semantic value; refers to the abstract meaning and/or shape, rather than
a specific shape (see also glyph), though in code tables some form of
visual representation is essential for the reader’s understanding. (2)
Synonym for abstract character. (3) The basic unit of encoding for the
Unicode character encoding. (4) The English name for the ideographic
written elements of Chinese origin. [See  ideograph (2).] 

- Code Point. (1) Any value in the Unicode codespace; that is, the range
of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4,
Characters and Encoding.) (2) A value, or position, for a character, in
any coded character set.

>From Wikipedia on UTF-16:

Unicode and ISO/IEC 10646 do not, and will never, assign characters to
any of the code points in the U+D800–U+DFFF range, so an individual code
unit from a surrogate pair does not ever represent a character. 

So:

A Char holds a code point, that is, a value from 0 to 0x10FFFF16.  Some
of these values do not correspond to Unicode characters.

As far as I can tell, a surrogate pair in UTF-16 is both two (surrogate)
code points of two bytes each, as well as a single code point encoded as
four bytes.  Implementations seem to differ about what the length of
a string containing surrogate pairs is.

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants


More information about the Haskell-Cafe mailing list