[Haskell-cafe] File path programme
Glynn Clements
glynn at gclements.plus.com
Sun Jan 30 10:13:52 EST 2005
robert dockins wrote:
> > I don't pretend to fully understand various unicode standard but it
> > seems to me that these problems are deeper than file path library. The
> > equation (decode . encode)
> > /= id seems confusing for me. Can you give me an example when this
> > happen?
>
> I am pretty sure that ISO 2022 encoded strings can have multiple ways to
> express the same unicode glyphs. This means that any sensible relation
> between IS0 2022 strings and unicode strings maps more than one ISO 2022
> string onto the same unicode string. The inverse is therefore not a
> function. To make it a function one of the possibly several encodings
> of the unicode string will have to be chosen. So you have a ISO 2022
> string A which is decoded to a unicode string U. We reencode U to an
> ISO 2022 string B. It may be that A /= B. That is the problem.
Exactly.
And it isn't a theoretical issue. E.g. in an environment where EUC-JP
is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1),
or they may not (because G1 is assumed to contain JISX0208 initally).
More generally, ISO-2022 strings frequently contain redundant
character-set switching sequences, so conversion to unicode and back
again typically won't yield the original sequence of bytes.
> The various UTF encodings do not have this particular problem; if a UTF
> string is valid, then it is a unique representation of a unicode string.
Except that there are some ad-hoc extensions, e.g. the UTF-8 variant
used by both Java and Tcl permits NUL characters to be embedded in
NUL-terminated UTF-8 strings by encoding them as a two-byte sequence
(which is invalid in UTF-8 proper).
--
Glynn Clements <glynn at gclements.plus.com>
More information about the Haskell-Cafe
mailing list