String != [Char]

Sun Mar 25 03:51:50 CEST 2012

On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
<gdr at integrable-solutions.net> wrote:
> I think there is a confusion here.  A Unicode character is an abstract
> entity.  For it to exist in some concrete form in a program, you need
> an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
> whether it can be used in a representation of a Unicode text, just like
> uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
> despite it being only 8-bit wide.   You do not need to make the
> character type exactly equal to the type of the individual element
> in the text representation.

Well, if you have a >21-bit type you can declare its value to be a
Unicode code point (which are numbered.) Using a char* that you claim
contain utf-8 encoded data is bad for safety, as there is no guarantee
that that's indeed the case.

> Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
> as far as text processing goes; you also need a localization at the
> minimum.  It is the
> combination of the two that gives some meaning to text representation
> and operations.

text does that via ICU. Some operations would be possible without
using the locale, if it wasn't for those Turkish i:s. :/

-- Johan