String != [Char]

Sun Mar 25 01:54:00 CET 2012

On Sat, Mar 24, 2012 at 7:16 PM, Johan Tibell <johan.tibell at gmail.com> wrote:
> On Sat, Mar 24, 2012 at 4:42 PM, Gabriel Dos Reis
> <gdr at integrable-solutions.net> wrote:
>> Hmm, std::u16string, std::u23string, and std::wstring are C++ standard
>> types to process Unicode texts.
>
> Note that at least u16string is too small to encode all of Unicode and
> wstring might be as 16 bits is not enough to encode all of Unicode.
>

I think there is a confusion here.  A Unicode character is an abstract
entity.  For it to exist in some concrete form in a program, you need
an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
whether it can be used in a representation of a Unicode text, just like
uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
despite it being only 8-bit wide.   You do not need to make the
character type exactly equal to the type of the individual element
in the text representation.

Now, if you want to make a one-to-one correspondence between
individual elements in a std::basic_string and a Unicode character,
you would of course go for char32_t, which might be wasteful
depending on the circumstances.  Text processing languages like Perl
have long decided to de-emphasize one-character-at-a-time processing.
For most common cases, it is just inefficient.  But, I also understand
that the efficiency argument may not be strong in the context of Haskell.
However, I believe a particular attention must be paid to the correctness
of the semantics.

Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
as far as text processing goes; you also need a localization at the
minimum.  It is the
combination of the two that gives some meaning to text representation
and operations.

I have been following the discussion, but I don't see anything said
about locales.

-- Gaby