Rewrite of Data.Char library?

Sun Oct 25 15:04:03 EDT 2009

On Thu, Oct 22, 2009 at 07:56:56PM -0700, Ahn, Ki Yung wrote:
> Ahn, Ki Yung 쓴 글:
> > In the #haskell IRC channel, we just had a discussion on Data.Char
> > predicates such as isAlpha, isUpper, isLower.  The implementation of
> > Data.Char is not Haskell 98 since Char specification in Haskell 98 only
> > covers latin1.

Char in Haskell98 covers Unicode too;
http://haskell.org/onlinereport/char.html says:

    Function toUpper converts a letter to the corresponding upper-case
    letter, leaving any other character unchanged. Any Unicode letter
    which has an upper-case equivalent is transformed. Similarly,
    toLower converts a letter to the corresponding lower-case letter,
    leaving any other character unchanged.

> > However, current predicates are confusing and intuitive
> > properties does not hold.  One example is this:
> > 
> > [17:53:32] <newsham> > let cs = [minBound..maxBound]; us = filter
> > isUpper cs; ls = filter isLower cs in take 5 $ (map toUpper ls) \\ us
> > [17:53:33] <lambdabot>   "\170\186\223I\312"
> > 
> > isLower '\170' == True  but you can't turn that into an uppercase
> > letter.  isUpper '170' == '\170'.

What behaviour would you expect?

> Another problem is that, in the Haskell 98 Report, isAlpha is defined as
> isLower or isUpper.  This is different from the current implementation.
> What isAlhpa is categorizing is all the "Letter" categories.

Right, we have:

isLower = "Letter, Lowercase"

isUpper = "Letter, Uppercase" or "Letter, Titlecase"

isAlpha = "Letter, Lowercase" or
          "Letter, Uppercase" or "Letter, Titlecase" or
          "Letter, Modifier" or "Letter, Other"

The report says:
    any alphabetic character which is not lower case is treated as upper
    case (Unicode actually has three cases: upper, lower, and title"
and defines:
    isAlpha c =  isUpper c || isLower c
so the implementation is not consistent with the language definition. I
wouldn't like to say which is "wrong", though (but I would guess "both"
:-)  I think it would be great if someone were to design a new interface
that provided something closer to the Unicode spec, perhaps in
Data.Char.Unicode; we could make the current interface a layer on top).

> So, wouldn't it be better to keep isAlpha to follow the definition of
> the Haskell 98 report, and just define a new predicate called isLetter
> if needed?

If your idea is to improve the handling of '\170' then this won't help.
'\170' is "Letter, Lowercase".

Thanks
Ian