CWString
Glynn Clements
glynn.clements at virgin.net
Thu Aug 28 06:24:31 EDT 2003
John Meacham wrote:
> > > > Sure, but as I've been saying, the implementation of glibc doesn't do
> > > > this. In the C or POSIX locale, the ctype macros only recognise ASCII.
> > >
> > > > Should this be considered a bug in glibc?
> > >
> > > hmm.. how odd. I would consider it a bug, I think. I don't have a copy
> > > of the ISO spec handy but will be sure to look up whether that is
> > > conforming... It is certainly a malfeature if it is not a bug...
> >
> > It certainly isn't a violation of ANSI/ISO C; that simply states that
> > "The behavior of these functions is affected by the LC_CTYPE category
> > of the current locale". It's perfectly legal for the implementation to
> > use different wide encodings depending upon the locale.
>
> no, glibc #defines __STDC_ISO_10646__ so wchar_t's are guarenteed to
> hold UCS4 values always independent of locale.
OK; although the draft which I have only says:
__STDC_ISO_10646__ A decimal constant of the form yyyymmL |
(for example, 199712L), intended to |
indicate that values of type wchar_t are |
the coded representations of the |
characters defined by ISO/IEC 10646, |
along with all amendments and technical |
corrigenda as of the specified year and |
month.
That's the only reference to that macro in the entire document. It
doesn't explicitly contradict (or even reference) the comments about
the semantics of the <wctype.h> functions.
> the LC_CTYPE only affects
> what multibyte encoding is used. What was curious was that the character
> classification routines changed behavior based on LC_CTYPE (despite the
> encoding still being UCS4)
>
> this might make sense for the classification routines dealing with upper
> and lower case actually, since I believe that that might depend on the
> language you are expressing. however, other character classification
> routines (such as wcwidth) should not depend on the current locale.
There are some variations between wcwidth() implementations; e.g.
the XFree86 version of xterm includes two implementations, and the
comment:
* The following functions are the same as mk_wcwidth() and
* mk_wcwidth_cjk(), except that spacing characters in the East Asian
* Ambiguous (A) category as defined in Unicode Technical Report #11
* have a column width of 2. This variant might be useful for users of
* CJK legacy encodings who want to migrate to UCS without changing
* the traditional terminal character-width behaviour. It is not
* otherwise recommended for general use.
I suppose that it's possible that some systems might wish to make the
behaviour locale-dependent.
However, this is all a long way from the glibc behaviour, i.e. that
for the C/POSIX locale, and for locales without an LC_CTYPE data file,
everything outside of the ASCII range is undefined (not a member of
any category, not translated by towupper() etc).
> it is unclear what the correct thing for an haskell implementation to
> do. possibilities are:
> 1) determine some locale independent semantics for the classification
> functions and implement that
> 2) guarentee the validity of character classification routines only when
> the character is representable in the current locale
> 3) link against another library such as libunicode which provides its
> own classification routines (this could be done optionally at compile
> time...)
>
> split the classification routines into locale dependent and independent
> ones, guarentee the locale independent ones will always work and one of
> the two above solutions for the rest...
>
> In any case, solution 2 seems to be what we have now, which is probably
> an okay interim solution as
> long as we add a isRepresentable to determine if a Char can be expressed
> in the current locale and whether we can trust the cclasification
> functions... I have an implementation of one in the CWString library I
> posted earlier...
>
> in any case, anything is better than the current 'ignore the locale'
> situation :)
Not necessarily. E.g. there are reasons why most programs don't just
call setlocale(LC_ALL, "") to make everything behave according to the
locale settings.
I18N complicates many things sufficiently that I would favour forcing
the programmer to explicitly ask for it (i.e. don't change the
semantics of any existing functions, but provide new functions for use
in internationalised code).
--
Glynn Clements <glynn.clements at virgin.net>
More information about the FFI
mailing list