String != [Char]

Mon Mar 26 15:22:23 CEST 2012

On Mon, Mar 26, 2012 at 7:29 AM, Christian Siefkes
<christian at siefkes.net> wrote:
> On 03/26/2012 01:26 PM, Gabriel Dos Reis wrote:
>> It is not the precision of Char or char that is the issue here.
>> It has been clarified at several points that Char is not a Unicode character,
>> but a Unicode code point.  Not every Unicode code point represents a
>> Unicode code character, and not every sequence of Unicode code points
>> represents a character or a sequence of Unicode character.
>
> What do you mean? Every Unicode character corresponds to one code point,

Yes, but this correspondence is not a bijection -- a great source of
confusion that
permeates lot of discussions about Unicode characters and texts,
including this one
(and a previous regarding the Haskell Report.)  Very much heart breaking :-(

> and
> every code point in the range 0 to 0x10FFFF (excluding the range 0xD800 to
> 0xDFFF which is reserved for surrogate pairs in UTF-16, and a handful of
> "noncharacters", see
> http://en.wikipedia.org/wiki/Mapping_of_Unicode_characters#Special_code_points
> ) corresponds to one character.
>
> Maybe your criticism is that Char does not explicitly prevent these special
> code points from being assigned? While true, that seems a relatively minor
> matter. Moreover, a future revision of the Haskell standard could easily
> declare that a assigning a "forbidden" character results in an error/bottom
> if that is so desired.

It is not just a matter of clarification that certain things are
forbidden.   I believe
it would be a great mistake to qualify it as minor. How do you handle
normalization
if you expose the texts as sequence of unrelated code points that can be freely
taken apart and combined?

- Gaby