Proposal #3456: Add FilePath -> String decoder

Yitzchak Gale gale at sefer.org
Wed Aug 26 09:14:54 EDT 2009


Johan Tibell wrote:
> Perhaps the only solution is to have
> System.FilePath.Posix.toString and System.FilePath.Windows.toString
> with different type signatures.

I'm not sure there's any point. As Duncan pointed out,
we are not just talking about the file system, we are
talking about interaction between the file system and
a user interface - how file paths should appear to
users. So it also depends on what UI you are using.

For example, GTK2 on Unix always uses UTF-8
to display file paths no matter what the current locale -
unless you've set a certain environment variable.

Most X terminals display file paths using the current
locale.

I'm not sure what the current situation is in Qt.

On Mac OS X, HFS+ stores file names as UTF-16, and
file paths in POSIX calls are interpreted as UTF-8. But
canonical Unicode is used, so the actual file path might
not be the same as what you provided if it includes
combining characters.

I think that Windows also converts the file path
to (some kind of) canonical Unicode in the presence of
combining characters.

So we should probably add stringToFilePath as well -
encode on vanilla POSIX, canoncialize and
encode on Mac OS X, canonicalize on Windows.
We need to research exactly which canonical form
is used on each platform. Unfortunately, that may
depend upon the file system. Also, based on past
experience, I fear that on Windows "canonical"
may mean something different than anything
published.

I am now beginning to lean towards Ketil's suggestion
that on POSIX platforms we should always use
UTF-8. We then need a prominent warning in the
documentation that if you need something else,
like the current locale, decode it yourself.

Note that it is becoming increasingly rare for people
to use non-UTF-8 locales anywhere in the world,
and even then it's likely ignored by many UIs.
So I'm inclined against cluttering the API with
convenience functions for other encodings, as Johan
is suggesting.

As a way forward - I propose:

1. Accept Judah's patch, modified always to use UTF-8.

2. Add strident warnings in the documentation that:

   o If you need a different encoding on POSIX, do it
     yourself.

   o If FilePath does not come from the file
     system, it may not match the actual file path used
     in the file system due to Unicode canonicalization.

3. Open a feature request for stringToFilePath as
   described above.

Regards,
Yitz


More information about the Libraries mailing list