[Haskell-cafe] File path programme

Glynn Clements glynn at gclements.plus.com
Sun Jan 30 10:13:52 EST 2005


robert dockins wrote:

> > I don't pretend to fully understand various unicode standard but it
> > seems to me that these problems are deeper than file path library. The
> > equation (decode . encode)
> > /= id seems confusing for me. Can you give me an example when this
> > happen? 
> 
> I am pretty sure that ISO 2022 encoded strings can have multiple ways to 
> express the same unicode glyphs.  This means that any sensible relation 
> between IS0 2022 strings and unicode strings maps more than one ISO 2022 
> string onto the same unicode string.  The inverse is therefore not a 
> function.  To make it a function one of the possibly several encodings 
> of the unicode string will have to be chosen.  So you have a ISO 2022 
> string A which is decoded to a unicode string U.  We reencode U to an 
> ISO 2022 string B.  It may be that A /= B.  That is the problem.

Exactly.

And it isn't a theoretical issue. E.g. in an environment where EUC-JP
is used, filenames may begin with <ESC>$)B (designate JISX0208 to G1),
or they may not (because G1 is assumed to contain JISX0208 initally).

More generally, ISO-2022 strings frequently contain redundant
character-set switching sequences, so conversion to unicode and back
again typically won't yield the original sequence of bytes.

> The various UTF encodings do not have this particular problem; if a UTF 
> string is valid, then it is a unique representation of a unicode string.

Except that there are some ad-hoc extensions, e.g. the UTF-8 variant
used by both Java and Tcl permits NUL characters to be embedded in
NUL-terminated UTF-8 strings by encoding them as a two-byte sequence
(which is invalid in UTF-8 proper).

-- 
Glynn Clements <glynn at gclements.plus.com>


More information about the Haskell-Cafe mailing list