[Haskell-cafe] invalid character encoding
Einar Karttunen
ekarttun at cs.helsinki.fi
Sat Mar 19 04:36:56 EST 2005
Wolfgang Thaller <wolfgang.thaller at gmx.net> writes:
> In what way is ISO-2022 non-reversible? Is it possible that a ISO-2022
> file name that is converted to Unicode cannot be converted back any
> more (assuming you know for sure that it was ISO-2022 in the first
> place)?
I am no expert on ISO-2022 so the following may contain errors,
please correct if it is wrong.
ISO-2022 -> Unicode is always possible.
Also Unicode -> ISO-2022 should be always possible, but is a relation
not a function. This means there are an infinite? ways of encoding a
particular unicode string in ISO-2022.
ISO-2022 works by providing escape sequences to switch between different
character sets. One can freely use these escapes in almost any way you
wish. Also ISO-2022 makes a difference between the same character in
japanese/chinese/korean - which unicode does not do.
See here for more info on the topic:
http://www.ecma-international.org/publications/files/ecma-st/ECMA-035.pdf
Also trusting system locale for everything is problematic and makes
things quite unbearable for I18N. e.g. on my desktop 95% of things run
with iso-8859-1, 3% of things use utf-8 and a few apps use EUC-JP...
Using filenames as opaque blobs causes the least problems. If the
program wishes to display them in a graphical environment then they have
to be converted to a string, but very many apps never display the
filenames...
- Einar Karttunen
More information about the Haskell-Cafe
mailing list