CWString

Glynn Clements glynn.clements at virgin.net
Thu Aug 28 06:24:31 EDT 2003


John Meacham wrote:

> > > > Sure, but as I've been saying, the implementation of glibc doesn't do
> > > > this.  In the C or POSIX locale, the ctype macros only recognise ASCII.
> > >  
> > > > Should this be considered a bug in glibc?
> > > 
> > > hmm.. how odd. I would consider it a bug, I think. I don't have a copy
> > > of the ISO spec handy but will be sure to look up whether that is
> > > conforming... It is certainly a malfeature if it is not a bug...
> > 
> > It certainly isn't a violation of ANSI/ISO C; that simply states that
> > "The behavior of these functions is affected by the LC_CTYPE category
> > of the current locale". It's perfectly legal for the implementation to
> > use different wide encodings depending upon the locale.
> 
> no, glibc #defines __STDC_ISO_10646__ so wchar_t's are guarenteed to
> hold UCS4 values always independent of locale.

OK; although the draft which I have only says:

       __STDC_ISO_10646__ A  decimal  constant  of the form yyyymmL |
                          (for  example,  199712L),   intended   to |
                          indicate  that values of type wchar_t are |
                          the   coded   representations   of    the |
                          characters   defined  by  ISO/IEC  10646, |
                          along with all amendments  and  technical |
                          corrigenda  as  of the specified year and |
                          month.

That's the only reference to that macro in the entire document. It
doesn't explicitly contradict (or even reference) the comments about
the semantics of the <wctype.h> functions.

> the LC_CTYPE only affects
> what multibyte encoding is used. What was curious was that the character
> classification routines changed behavior based on LC_CTYPE (despite the
> encoding still being UCS4)
> 
> this might make sense for the classification routines dealing with upper
> and lower case actually, since I believe that that might depend on the
> language you are expressing.  however, other character classification
> routines (such as wcwidth) should not depend on the current locale. 

There are some variations between wcwidth() implementations; e.g. 
the XFree86 version of xterm includes two implementations, and the
comment:

 * The following functions are the same as mk_wcwidth() and
 * mk_wcwidth_cjk(), except that spacing characters in the East Asian
 * Ambiguous (A) category as defined in Unicode Technical Report #11
 * have a column width of 2. This variant might be useful for users of
 * CJK legacy encodings who want to migrate to UCS without changing
 * the traditional terminal character-width behaviour. It is not
 * otherwise recommended for general use.

I suppose that it's possible that some systems might wish to make the
behaviour locale-dependent.

However, this is all a long way from the glibc behaviour, i.e. that
for the C/POSIX locale, and for locales without an LC_CTYPE data file,
everything outside of the ASCII range is undefined (not a member of
any category, not translated by towupper() etc).

> it is unclear what the correct thing for an haskell implementation to
> do. possibilities are:
> 1) determine some locale independent semantics for the classification
> functions and implement that
> 2) guarentee the validity of character classification routines only when
> the character is representable in the current locale
> 3) link against another library such as libunicode which provides its
> own classification routines (this could be done optionally at compile
> time...)
> 
> split the classification routines into locale dependent and independent
> ones, guarentee the locale independent ones will always work and one of
> the two above solutions for the rest...
> 
> In any case, solution 2 seems to be what we have now, which is probably
> an okay interim solution as
> long as we add a isRepresentable to determine if a Char can be expressed
> in the current locale and whether we can trust the cclasification
> functions... I have an implementation of one in the CWString library I
> posted earlier...
> 
> in any case, anything is better than the current 'ignore the locale'
> situation :)

Not necessarily. E.g. there are reasons why most programs don't just
call setlocale(LC_ALL, "") to make everything behave according to the
locale settings.

I18N complicates many things sufficiently that I would favour forcing
the programmer to explicitly ask for it (i.e. don't change the
semantics of any existing functions, but provide new functions for use
in internationalised code).

-- 
Glynn Clements <glynn.clements at virgin.net>



More information about the FFI mailing list