Unicode and is*

Marcin 'Qrczak' Kowalczyk qrczak@knm.org.pl
4 May 2001 15:50:37 GMT


Fri, 4 May 2001 15:20:02 +0100, Ian Lynagh <igloo@earth.li> pisze:

> Is there a reason why isUpper and isLower include all unicode
> characters of the appropriate class but isDigit is only 0..9?

There are also other weirdnesses, e.g. isSpace is specified to work
only on ISO-8859-1 spacing characters, and caseless letters are
considered uppercase.

> Are there any Haskell unicode libraries around?

http://www.sourceforge.net/projects/qforeign/

It provides module Char replacement with more full character
properties, charset conversion framework, and inefficient IO wrapper
with transparent conversion of Handle data.

It also provides FFI support, but it's almost the same as in ghc-5.00
now, except that string conversion between C and Haskell takes
character encoding into account.

It's undocumented except some comments, sorry.

QForeign works with ghc >= 4.08 and nhc98 >= 1.01, but depends on
Unicode support in the compiler: some bits are disabled when the
compiler doesn't support character literals above '\255' (i.e. nhc98
compiled with ghc < 5.00 and probably with itself), the whole Unicode
story is disabled when Char has 8 bits (i.e. ghc < 5.00).

> And is the implementation of unicode support for GHC being
> discussed/developed anywhere?

I'm afraid they are not discussed. I received no feedback on Unicode
features in qforeign. They are experimental; nobody told me that they
are good or wrong, or how to design them better. They are changing
very slowly now.

I think they will find their way to ghc when they are finished
(with total reimplementation of IO, integrating the conversion with
buffering, and also using the new FFI), but it would be great if they
represented some consensus rather than my thoughts alone.

For example I don't like the HsAndC variant of implmenetation of a
conversion (ConvBase), but some tests suggest that it's really best
to use a Haskell implementation of UTF-32 when called from Haskell
and a C implementation when called to work on character arrays,
so I don't know how to do it more clearly.

-- 
 __("<  Marcin Kowalczyk * qrczak@knm.org.pl http://qrczak.ids.net.pl/
 \__/
  ^^                      SYGNATURA ZASTĘPCZA
QRCZAK