[Haskell-cafe] invalid character encoding
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Sat Mar 19 13:18:36 EST 2005
Wolfgang Thaller <wolfgang.thaller at gmx.net> writes:
> Also, IIRC, Java strings are supposed to be unicode, too -
> how do they deal with the problem?
Filenames are assumed to be in the locale encoding.
a) Interpreting. Bytes which cannot be converted are replaced by U+FFFD.
b) Creating. Characters which cannot be converted are replaced by "?".
Command line arguments and standard I/O are treated in the same way.
Filenames are assumed to be in Java-modified UTF-8.
a) Interpreting. If a filename cannot be converted, a directory listing
contains a null instead of a string object.
b) Creating. All Java characters are representable in Java-modified UTF-8.
Obviously not all potential filenames can be represented.
Command line arguments are interpreted according to the locale.
Bytes which cannot be converted are skipped.
Standard I/O works in ISO-8859-1 by default. Obviously all input is
accepted. On output characters above U+00FF are replaced by "?".
Filenames use the list of encodings from the MONO_EXTERNAL_ENCODINGS
environment variable, with UTF-8 implicitly added at the end. These
encodings are tried in order.
a) Interpreting. If a filename cannot be converted, it's skipped in
a directory listing.
The documentation says that if a filename, a command line argument
etc. looks like valid UTF-8, it is treated as such first, and
MONO_EXTERNAL_ENCODINGS is consulted only in remaining cases.
The reality seems to not match this (mono-1.0.5).
b) Creating. If UTF-8 is used, U+0000 throws an exception
(System.ArgumentException: Path contains invalid chars), paired
surrogates are treated correctly, and an isolated surrogate causes
an internal error:
** ERROR **: file strenc.c: line 161 (mono_unicode_to_external): assertion failed: (utf8!=NULL)
Command line arguments are treated in the same way, except that if an
argument cannot be converted, the program dies at start:
Cannot determine the text encoding for argument 1 (xxx\xb1\xe6\xea).
Please add the correct encoding to MONO_EXTERNAL_ENCODINGS and try again.
Console.WriteLine emits UTF-8. Paired surrogates are treated
correctly, unpaired surrogates are converted to pseudo-UTF-8.
Console.ReadLine interprets text as UTF-8. Bytes which cannot be
converted are skipped.
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
More information about the Haskell-Cafe