[Haskell] System.FilePath survey
Wolfgang Thaller
wolfgang.thaller at gmx.net
Wed Feb 8 14:03:01 EST 2006
Ben Rudiak-Gould wrote:
> The point is that different things are natively handled in
> different formats under different OSes, e.g.
>
> Posix NT Win9x
>
> pathnames bytes UTF-16 locale
> command line bytes UTF-16 locale
> file contents bytes bytes bytes
> pipes/sockets bytes bytes bytes
Add to that:
Mac OS X
pathnames UTF-8
command line UTF-8
It's POSIX (or mostly POSIX), but the encoding for path names is
always guaranteed to be UTF-8. For the default file system type, HFS
+, it is actually stored on disk as UTF-16. Arbitrary strings of
bytes are not allowed.
For POSIX systems, I'd also like to observe the following:
1) Widely used languages and libraries like Java and GTK+ assume that
all file names and command lines are encoded in the system locale, or
at least that they can all be converted to unicode strings.
2) Command lines are usually entered as TEXT on a terminal and are
therefore encoded in whatever encoding the terminal uses.
3) None of the recent linux distributions I have installed did
anything but set up a UTF-8 based system.
So I think we should try hard to avoid introducing any additional
complexity, like filename ADTs used for program arguments, to deal
with the small minority of systems where file names cannot be
converted to unicode. Maybe it's possible to use some user-defined
unicode code points to achieve a lossless conversion of arbitrary
byte strings to unicode? I mean, byte strings that are valid in the
system encoding would get transcoded correctly, and invalid bytes
would get mapped to some extra code points so that they can be
converted back if necessary.
Cheers,
Wolfgang
More information about the Libraries
mailing list