[Haskell-cafe] Hugs vs GHC (again) was: Re: Some random newbiequestions

Sun Jan 9 14:03:44 EST 2005

"Simon Marlow" <simonmar at microsoft.com> writes:

>  - Do the character class functions (isUpper, isAlpha etc.) work
>    correctly on the full range of Unicode characters?  This is true in
>    Hugs.  It's true with GHC on some systems (basically we were lazy
>    and used the underlying C library's support here, which is patchy).

It's not obvious what the predicates should really mean, e.g. should
isDigit and isHexDigit include non-ASCII digits or should isSpace
include non-breaking space characters. Haskell 98 report gives some
guidelines which don't necessarily coincide with the C practice nor
with expectations from Unicode people.

Once this is agreed, it would be easy to make scripts which generate
C code from UnicodeData.txt tables from Unicode. I think table-driven
predicates and toUpper/toLower should better be implemented in C;
Haskell is not good at static constant tables with numbers.

Another issue is that the set of predicates provided by Haskell 98
library report is not enough to implement e.g. a Haskell 98 lexer,
which needs "any Unicode symbol or punctuation". Case mapping would
be better done string -> string rather than character -> character;
this breaks a long established Haskell interface. Case mapping is
locale-sensitive (in very minor ways). Haskell doesn't provide
algorithms like normalization or collation. In general the Haskell 98
interface is not enough for complex Unicode processing.

>  - Can you do String I/O in some encoding of Unicode?  No Haskell
>    compiler has support for this yet, and there are design decisions
>    to be made.

The problem with designing an API of recoders is that depending on
whether the recoder is implemented in Haskell or interfaced from C, it
needs different data representation. Pure Haskell recoders prefer lazy
lists of characters or bytes (except that a desire to detect source
errors or characters unavailable in the target encoding breaks this),
while high performance C prefers pointers to buffers with chunks of
text.

Transparent recoding makes some behavior hard to express. Imagine
parsing HTTP headers followed by "\r\n\r\n" and a binary file. If you
read headers line by line and decoding is performed in blocks, then
once you determine where the headers end it's too late to find the
start of the binary file: a part of it has already been decoded into
text. You have to determine the end of the headers while working with
bytes, not characters, and only convert the first part. Not performing
the recoding in blocks is tricky if the decoder is implemented in C.
Giving 1-byte buffers for lots of iconv() calls is not nice.

Or imagine parsing a HTML file with the encoding specified inside
it in a <meta> element. Switching the encoding in the middle is
incompatible with buffering. Maybe the best option is to parse the
beginning in ISO-8859-1 just to determine the encoding, and then
reparse everything again once the encoding is known.

If characters are recoded automatically on I/O, one is tempted to
extend the framework for other conversions like compression, line
ending convention, HTML character escaping etc.

>  - What about Unicode FilePaths?  This was discussed a few months ago
>    on the haskell(-cafe) list, no support yet in any compiler.

Nobody knows what the semantics should be.

I've once written elsewhere a short report about handling filename
encodings in various languages and environments which use Unicode as
their string representation. Here it is (I've been later corrected
that Unicode non-characters are valid in UTF-x):

I describe here languages which exclusively use Unicode strings.
Some languages have both byte strings and Unicode strings (e.g. Python)
and then byte strings are generally used for strings exchanged with
the OS, the programmer is responsible for the conversion if he wishes
to use Unicode.

I consider situations when the encoding is implicit. For I/O of file
contents it's always possible to set the encoding explicitly somehow.

Corrections are welcome. This is mostly based on experimentation.

Java (Sun)
----------

Strings are UTF-16.

Filenames are assumed to be in the locale encoding.

a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.

b) Creating. Characters which cannot be converted are replaced by "?".

Command line arguments and standard I/O are treated in the same way.

Java (GNU)
----------

Strings are UTF-16.

Filenames are assumed to be in Java-modified UTF-8.

a) Interpreting. If a filename cannot be converted, a directory listing
   contains a null instead of a string object.

b) Creating. All Java characters are representable in Java-modified UTF-8.
   Obviously not all potential filenames can be represented.

Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.

Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".

C# (mono)
---------

Strings are UTF-16.

Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.

a) Interpreting. If a filename cannot be converted, it's skipped in
   a directory listing.

   The documentation says that if a filename, a command line argument
   etc. looks like valid UTF-8, it is treated as such first, and
   MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
   The reality seems to not match this (mono-1.0.5).

b) Creating. If UTF-8 is used, Non-characters are converted to
   pseudo-UTF-8, U+0000 throws an exception (System.ArgumentException:
   Path contains invalid chars), paired surrogates are treated
   correctly, and an isolated surrogate causes an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL)
aborting...

Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
[Invalid UTF-8]
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.

Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, non-characters and unpaired surrogates are converted to
pseudo-UTF-8.

Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.

Perl
----

Depending on the convention used by a particular function and on
imported packages, a Perl string is treated either as Perl-modified
Unicode (with character values up to 32 bits or 64 bits depending on
the architecture) or as an unspecified locale encoding. It has two
internal representations: ISO-8859-1 and Perl-modified UTF-8 (with
an extended range).

If every Perl string is assumed to be a Unicode string, then filenames
are effectively ISO-8859-1.

a) Interpreting. Characters up to 0xFF are used.

b) Creating. If the filename has no characters above 0xFF, it is
   converted to ISO-8859-1. Otherwise it is converted to Perl-modified
   UTF-8 (all characters, not just those above 0xFF).

Command line arguments and standard I/O are treated in the same way,
i.e. ISO-8859-1 on input and a mixture of ISO-8859-1 and UTF-8 on
output, depending on the contents.

This behavior is modifiable by importing various packages and using
interpreter invocation flags. When Perl is told that command line
arguments are UTF-8, the behavior for strings which cannot be
converted is inconsistent: sometimes it's treated as ISO-8859-1,
sometimes an error is signalled.

Haskell
-------

Haskell nominally uses Unicode. There is no conversion framework
standarized or implemented yet though. Implementations which support
more than 256 characters currently assume ISO-8859-1 for filenames,
command line arguments and all I/O, taking the lowest 8 bits of a
character code on output.

Common Lisp: Clisp
------------------

Common Lisp standard doesn't say anything about string encoding.
In Clisp strings are UTF-32 (internally optimized as UCS-2 and
ISO-8859-1 when possible). Any character code up to U+10FFFF is
allowed, including non-characters and isolated surrogates.

Filenames are assumed to be in the locale encoding.

a) Interpreting. If a byte cannot be converted, an exception is thrown.

b) Creating. If a character cannot be converted, an exception is thrown.

Kogut (my language; this is the current state - may be changed)
-----

Strings are UTF-32 (internally optimized as ISO-8859-1 when possible).
Currently any character code up to U+10FFFF is allowed, including
non-characters and isolated surrogates.

Filenames are assumed to be in the locale encoding. I plan to add an
environment variable which can override this default. A program can
itself set the encoding to something else, perhaps locally during
execution of some code. It can use a conversion which puts U+FFFD / "?"
instead of throwing an exception on error, or which does something else.

a) Interpreting. If a byte cannot be converted, an exception is thrown.

b) Creating. If a character cannot be converted, an exception is thrown.
   U+0000 terminates the filename (this should be fixed).

Command line arguments and standard I/O are treated in the same way.

GNOME
-----

GNOME uses UTF-8 internally, or sometimes byte strings in other
encodings. I guess filenames are passed as byte strings. AFAIK
sometimes filenames are expressed as URLs, even internally when it's
invisible to the user, and then various unsafe bytes are escaped as
two hex digits preceded by the percent sign. From the programmer's
point of view the original byte strings are generally used. Filename
encoding matters for the display though, so here I describe the user's
point of view.

If the environment variable G_FILENAME_ENCODING is present, it
specifies the encoding of filenames, unless it is @locale which means
the encoding of the locale. If it's not present but G_BROKEN_FILENAMES
is present, filenames are assumed to be in the locale encoding.
If neither variable is present, filenames are assumed to be in UTF-8.

a) Interpreting. If a filename cannot be converted from the selected
   encoding, all non-ASCII bytes are shown as octal numbers preceded
   by the backslash, as hex numbers preceded by the percent sign, or
   as question marks, depending on the situation (I can observe all
   three cases in gedit). What is physically stored is the byte string
   and the file is opened successfully.

b) Creating. If a character cannot be represented, the application
   refuses to save the file until a good filename is entered.

Mozilla
-------

I don't know how it handles filenames internally. From the user's
point of view it matters how it presents a local directory listing.

Filenames are assumed to be in the locale encoding.

If a filename cannot be converted, it's skipped. If it can be
converted but contains characters like 0x80-0x9F in ISO-8859-2,
they are displayed as question marks and the file is inaccessible.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/