[Haskell] System.FilePath survey

Wolfgang Thaller wolfgang.thaller at gmx.net
Wed Feb 8 14:03:01 EST 2006


Ben Rudiak-Gould wrote:

> The point is that different things are natively handled in  
> different formats under different OSes, e.g.
>
>                  Posix       NT             Win9x
>
> pathnames        bytes       UTF-16         locale
> command line     bytes       UTF-16         locale
> file contents    bytes       bytes          bytes
> pipes/sockets    bytes       bytes          bytes

Add to that:

Mac OS X

pathnames	UTF-8
command line	UTF-8

It's POSIX (or mostly POSIX), but the encoding for path names is  
always guaranteed to be UTF-8. For the default file system type, HFS 
+, it is actually stored on disk as UTF-16. Arbitrary strings of  
bytes are not allowed.

For POSIX systems, I'd also like to observe the following:

1) Widely used languages and libraries like Java and GTK+ assume that  
all file names and command lines are encoded in the system locale, or  
at least that they can all be converted to unicode strings.

2) Command lines are usually entered as TEXT on a terminal and are  
therefore encoded in whatever encoding the terminal uses.

3) None of the recent linux distributions I have installed did  
anything but set up a UTF-8 based system.

So I think we should try hard to avoid introducing any additional  
complexity, like filename ADTs used for program arguments, to deal  
with the small minority of systems where file names cannot be  
converted to unicode. Maybe it's possible to use some user-defined  
unicode code points to achieve a lossless conversion of arbitrary  
byte strings to unicode? I mean, byte strings that are valid in the  
system encoding would get transcoded correctly, and invalid bytes  
would get mapped to some extra code points so that they can be  
converted back if necessary.

Cheers,

Wolfgang


More information about the Libraries mailing list