More Unicode nit-picking
Colin Paul Adams
colin@colina.demon.co.uk
19 Oct 2001 06:09:09 +0100
I have accidentally noticed another problem in the revised Haskell
report (actually the library report, this time).
in module Char, toLower and toUpper appear to be under-specified (and
indeed, the whole module looks a little suspect)
Exactly what is wrong with them I find hard to say, and even harder to
say what the cure should be, nevertheless:
Firstly:
"This module offers only a limited view of the
full Unicode character set; the full set of Unicode character
attributes is not accessible in this library."
I take it this implies that operations in this module apply to the
ENTIRE Unicode character set, so:
"Function toUpper converts a letter to the corresponding
upper-case letter, leaving any other character unchanged. Any
Unicode letter which has an upper-case equivalent is
transformed. Similarly, toLower converts a letter to the
corresponding lower-case letter, leaving any other character
unchanged."
But this seems to assume there is a one-to-one mapping of upper-case
to lower-case equivalent, and vice-versa. Apparently this is not
so. (I'm only going on hearsay - something I saw on the xml-dev list
yesterday - I haven't checked any of this).
It seems that whilst the Unicode database's definitions of whether or
not a character is upper/lower/title case are normative, the mappings
from upper to lower case are only suggestive.
This is because it depends upon language conventions as to how the
mapping is done. In Turkish for instance, I is not the upper-case
equivalent of i, and vice-versa (apparently there is a dotted i, and a
non-dotted i, and likewise for I).
So it would seem that the definitions:
"-- Case-changing operations
toUpper :: Char -> Char
toUpper = primUnicodeToUpper
toLower :: Char -> Char
toLower = primUnicodeToLower"
just beg the question - what should the primUnicodeToLower/ToUpper
operations actually do?
Should they be locale sensitive?
--
Colin Paul Adams
Preston Lancashire