String != [Char]

Sun Mar 25 04:19:08 CEST 2012

On Sat, Mar 24, 2012 at 8:51 PM, Johan Tibell <johan.tibell at gmail.com> wrote:
> On Sat, Mar 24, 2012 at 5:54 PM, Gabriel Dos Reis
> <gdr at integrable-solutions.net> wrote:
>> I think there is a confusion here.  A Unicode character is an abstract
>> entity.  For it to exist in some concrete form in a program, you need
>> an encoding.  The fact that char16_t is 16-bit wide is irrelevant to
>> whether it can be used in a representation of a Unicode text, just like
>> uint8_t (e.g. 'unsigned char') can be used to encode Unicode string
>> despite it being only 8-bit wide.   You do not need to make the
>> character type exactly equal to the type of the individual element
>> in the text representation.
>
> Well, if you have a >21-bit type you can declare its value to be a
> Unicode code point (which are numbered.)

That is correct.  Because not all Unicode points represent characters,
and not all Unicode code point sequences represent valid characters,
even if you have that >21-bit type T, the list type [T] would still not be a
good string type.

> Using a char* that you claim
> contain utf-8 encoded data is bad for safety, as there is no guarantee
> that that's indeed the case.

Indeed, and that is why a Text should be an abstract datatype, hiding
the concrete implementation away from the user.

>> Note also that an encoding itself (whether UTF-8, UTF-16, etc.) is insufficient
>> as far as text processing goes; you also need a localization at the
>> minimum.  It is the
>> combination of the two that gives some meaning to text representation
>> and operations.
>
> text does that via ICU. Some operations would be possible without
> using the locale, if it wasn't for those Turkish i:s. :/

yeah, 7 bits should be enough for every character ;-)

-- Gaby