Let's get this finished

Sun Jan 7 22:55:31 EST 2001

qrczak at knm.org.pl (Marcin 'Qrczak' Kowalczyk) wrote,

> Sun, 07 Jan 2001 13:15:21 +1100, Manuel M. T. Chakravarty <chak at cse.unsw.edu.au> pisze:
> 
> > > When someone really wants to use mallocCString and pokeCString now
> > > (knowing that there is a little point of doing that in the case of
> > > conversions), he can use mallocArray0 and pokeArray0, after casting
> > > characters of the string to [CChar].
> > 
> > To be honest, I don't like this.  It is nice having the interface
> > such that we can switch to using conversions at some point, but
> > I still want to be able to conveniently deal with 8bit characters
> > (because this is what many C libraries use).  So, I want a fast and
> > convenient interface to 8bit strings *in addition* to the interface
> > that can deal with conversions.  In particular this means that
> > I don't want to deal with CChar in the Haskell interface only to
> > circumvent conversion.
> 
> I understand everything except the last sentence. Why it is bad to
> deal with CChar in Haskell?
>
> It could be confusing if some String values represented texts in
> Unicode and others - in the C's encoding. (Especially if the programmer
> uses ISO-8859-1 for C encoding and does not care about the difference,
> and then somebody using ISO-8859-7 tries to run his code!)
> 
> IMHO most strings on which C functions work (those ending with
> '\0') are either in the default local encoding (if they are texts
> in a natural language or filenames) or more rarely ASCII (if they
> are e.g. names of mail headers, identifiers in a C program, or
> commandline switches of some program). Sometimes the encoding is
> specified explicitly by the protocol or is stored in data itself.
> 
> For ASCII the default local encoding can be used too, with a speed
> penalty; practically used encodings are ASCII-compatible. You can
> explicitly specify fromLatin1 or toLatin1 if you really want C
> characters to map to Haskell's '\0'..'\255' - it should be faster
> (does not call iconv or the like). You can also use CChar.

The speed penalty is exactly what I am worried about.  What
you are proposing - if I understand you correctly - is to
use Unicode whatever encodings on the Haskell side
exclusively and each Haskell<->C conversion of a String has
to go through a conversion.  Then, as you say, poke and some
other routines make no sense on Strings because of the
varying string length in different encodings.  Ok, I got
that.

Now what I am thinking is that this will be even slower than
the whole business is already.  So, an all Unicode Haskell
will be even slower than it is now.

Strings are used for two purposes in programs: (1) To
represent natural language and (2) to represent unstructured
program data.  For the first case, we have to take the
performance penalty if we want the benefit of handling
non-ASCII languages.  For the second case, however, I think
we don't need it.

Take for example (and it is not a very good example) the Tk
binding for Haskell.  It accesses the Tk widget set by
constructing Tcl commands at runtime and sending them to the
Tcl interpreter via a pipe.  That's already pretty
inefficient.  Now when each of these commands has to go
through a Unicode conversion, things will get even worse.

Another example is configuration management in libraries
like the Gnome library.  A program can dump its session data
into an ASCII file using these libraries, so that it doesn't
have to mantain its own preferences and resource files.  Do
we really want all this stuff to go through the converter?

Furthermore, to be honest, I am not really sure why we have
to do the conversion anyway.  When I am having a Haskell
program like [1]

  main = putStrLn "今日は"

then, there are two possibilities.  Either I have a system
configured with the locale jp_JP and I happen to run this
Haskell program in kterm or an Mule/(X)Emacs subshell, or I
will get mojibake[2] anyway.  No amount of conversion is
going to change that.  So, what exactly do I get for the
performance penalty that conversion incurs?

How about having an interface where the String marshalling
functions take an additional argument

  data CConv = NoCConv			-- handle as 8bit chars
	     | StdCConc			-- standard conversion
	     | CustomCConv String	-- special conversions

Then, it is up to the programmer to decide whether to use
conversion.  The idea of the last variant would be that in
your conversion library, I can give conversions a name and
identify them by that name.  This way the CString wouldn't
depend on the exact conversion interface, but still would be
open to the addition of new conversions.  Routines like
mallocCString and pokeCString would only make sense for
`NoCConv', then.

Cheers,
Manuel

[1] I hope your mail reader can handle iso-2022-jp :-)

[2] Mojibake is the Japanese term for Japanese text
    displayed through software that cannot handle it.
    Mojibake is written as "文字化け" in Japanese and if
    your mail reader can't handle Japanese, you'll see just
    that ;-)