Let's get this finished

Mon Jan 8 11:11:04 EST 2001

On Tue, 9 Jan 2001, Manuel M. T. Chakravarty wrote:

> Where are these variants?  In QForeign?

Yes.

QForeign is being continuously updated in CVS:
cvs -d:pserver:anonymous at cvs.qforeign.sourceforge.net:/cvsroot/qforeign \
    login
(empty password)
cvs -d:pserver:anonymous at cvs.qforeign.sourceforge.net:/cvsroot/qforeign \
    -z3 get qforeign

> The interface for CString that was under discussion here didn't say
> anything about conversions?

Yes.

> Fair enough.  So, basically what I am saying, then, is that
> as long as it is not clear how the conversion interface
> looks in detail, let's talk about two different conversions
> that we definitely know we are going to need: the
> toLatin1/fromLatin1 conversion and the localOut/localIn
> conversion.  I want the standard FFI CString interface to
> support these two.  The rest we can add later, but I don't
> want to be restricted to the std conversion only.

OK, I agree that it should be introduced early. I didn't propose them
because even the type of conversions was not completely clear.

As long as there are no conversions all over the libraries and I/O and
filenames default to ISO-8859-1, toLocal and fromLocal would probably be
equivalent to toLatin1 and fromLatin1. When everything is ready to accept
conversions, they will switch to system-dependent default local byte
encoding. On Unices: based on locale.

> BTW, how efficient is the code for toLatin1/fromLatin1?

fromLatin1's inner loop is:
    map (chr . fromIntegral) :: [Byte] -> String

toLatin1's inner loop is worse. It must check for valid characters. My
interface allows users of a conversion to see if there was an error, and
also to get the converted string even if there was errors (with
problematic places marged appropriately for the target encoding):
    conv :: String -> ([Word8], Bool)
    conv [] = ([], True)
    conv (ch:s)
        | ch <= '\xFF' = (fromIntegral (ord ch)  : rest, good)
        | otherwise    = (fromIntegral (ord '?') : rest, False)
        where
        (rest, good) = conv s

If it's measured to be too slow, there could be
    unsafeToLatin1
which works correctly as long as all characters are valid...

> [Why not call it toLocal/fromLocal?].

OK, these names are better.

> If we have the distinction anyway, making it a little more flexible
> and adding CustomCConv seems sensible.

Conv is intentionally abstract. There is
    iconv :: (Storable from, Storable to)
          => String -> String -> [to] -> IO (Conv from to)

on systems which have iconv (it's specified in Single Unix Spec,
implemented in Linux' glibc and I guess it is not available natively on
Windows; there are portable implementations for Unices and I don't know if
they work on Windows). Parameters are: name of input encoding, name of
output encoding, string to use to mark errors.

A problem with using iconv directly is that there is no portable way to
know either the name of a supported Unicode flavour ("UTF-8" is a good
guess but it's inefficient, and not all iconv implementations provide
UCS-4 in native endianness - glibc-2.1.3 does not, glibc-2.2 does) or the
name of the default local encoding (there is nl_langinfo (CODESET) but
AFAIK it's not available on BSD). In practice it often must be composed
with the appropriate conversion which matches some iconv's Unicode flavour
with an array of 32-bit Chars. Nevertheless it does provide its own
database of charsets identified by names and is already implemented.

Currently I have only iconv to implement the default local encoding, with
appropriate autoconf magic to find a common language with it. I also had
wcsrtombs and mbsrtowcs, but people said it's less portable because
wchar_t needs not to be Unicode. Windows needs its own implementation,
which I could take e.g. from Python's sources, but I don't have a place to
test it.

> Maybe your conversion library could, then, have a function like
> 
>   registerConv :: String -> IO (Conv Char Byte) -> IO ()
> 
> which allows me to give symbolic names to conversions.

A central database can be built around conversions available as values in
the program. But it would be a bad idea to take it as the basic
identification and require registering to use a conversion. Since there
can be various schemes of encoding names (MIME, iconv's names) and not all
encodings must have names in all schemes (I doubt MIME talks about
"UCS-4 in native endianness", which corresponds to the array of Chars),
I would not encourage to work in terms of names.

I've implemented conversions for simple ISO-8859-x, CP-xxxx, CP-xxx,
KOI8-R and Mac encodings (excluding those which need special bidi or
multibyte processing). They are implemented in Haskell. iconv is
implemented in C. Both kinds of implementations are handled during
String <-> CString conversions in a way which I believe is as efficient
as possible (roughly).

OTOH QForeign's experimental converting IO replacement, IOConv, is not
efficient at all for conversions implemented in C.

> > We already have mallocArray0 and pokeArray0. You only have to cast
> > characters to [CChar].
> 
> Sure - but why not have this as predefined functions in
> CString?  That's all I am proposing.

I would not encourage people to skip the conversion and produce code which
works only for ISO-8859-1. Latin1 is just one of many encodings.

Since string handling in Haskell is already inefficient, I hope that 
adding conversions would not make a big relative difference. It would
be a different story if strings could be passed to C functions without
marshalling.

> BTW, do you know Pango <http://www.pango.org/>? 

Not yet.

-- 
Marcin 'Qrczak' Kowalczyk