Why are strings linked lists?

Glynn Clements glynn.clements at virgin.net
Sat Nov 29 10:42:39 EST 2003


ketil+haskell at ii.uib.no wrote:

> >> What Unicode support?
> 
> >> Simply claiming that values of type Char are Unicode characters
> >> doesn't make it so.
> 
> > Just because some implementations lack toUpper etc. doesn't mean
> > they all do.  
> 
> I think the point is that for toUpper etc to be properly Unicoded,
> they can't simply look at a single character.  IIRC, there are some
> characters that expand to two characters when the case is changed, and
> then there's titlecase and so on.

If that was the extent of the problems, I wouldn't be describing
Unicode support as "non-existent".

Note that ANSI C9X doesn't handle the first problem either:

       7.25.3.1.1  The towlower function

               #include <wctype.h>
               wint_t towlower(wint_t wc);

       7.25.3.1.2  The towupper function

               #include <wctype.h>
               wint_t towupper(wint_t wc);

And it only handles the second problemm (title case) insofar that it
provides a generic transformation mechanism:

       7.25.3.2  Extensible wide-character case mapping functions

       [#1] The functions wctrans and towctrans provide  extensible
       wide-character mapping as well as case mapping equivalent to
       that performed by the functions described  in  the  previous
       subclause (7.25.3.1).

       7.25.3.2.1  The towctrans function

               #include <wctype.h>
               wint_t towctrans(wint_t wc, wctrans_t desc);

       7.25.3.2.2  The wctrans function

               #include <wctype.h>
               wctrans_t wctrans(const char *property);

Whilst a title-case transformer is the most obvious application of
this, nothing in the standard specifies this.

> toUpper etc. are AFAIK only implemented correctly for a small (but
> IMHO probably the useful) subset of characters.

Yes; so it may as well have just defined Char as an 8-bit ISO Latin-1
character.

Actually, US-ASCII (i.e. the same behaviour as ANSI C with the C/POSIX
locale) would arguably have been a better choice. At least that won't
fail quite so badly if you use e.g. toUpper on a string which is
actually in e.g. ISO Latin-2; the case may be wrong, but at least it
will be the correct letter.

-- 
Glynn Clements <glynn.clements at virgin.net>


More information about the Haskell mailing list