[Haskell-cafe] Encoding-aware System.Directory functions
John Millikin
jmillikin at gmail.com
Thu Mar 31 05:51:32 CEST 2011
On Wednesday, March 30, 2011 9:07:45 AM UTC-7, Michael Snoyman wrote:
>
> Thanks to you (and everyone else) for the informative responses. For
> now, I've simply hard-coded in UTF-8 encoding for all non-Windows
> systems. I'm not sure how this will play with OSes besides Windows and
> Linux (especially Mac), but it's a good stop-gap measure.
>
> Linux, OSX, and (probably?) FreeBSD use UTF8. It's *possible* for a Linux
file path to contain arbitrary bytes, but every application I've ever seen
just gives up and writes [[invalid character]] symbols when confronted with
such.
OSX's chief weirdness is that its GUI programs swap ':' and '/' when
displaying filenames. So the file "hello:world.txt" will show up as
"hello/world.txt" in Finder. It also performs Unicode normalization on your
filenames, which is mostly harmless but can have unexpected results on
unicode-naïve applications like rsync.** I don't know how its normalization
interacts with invalid file paths, or whether it even allows such paths to
be written.
Window's weirdness is its multi-root filesystem, and also that it
distinguishes between absolute and non-relative paths. The Windows path
"/foo.txt" is *not* absolute and *not* relative. I've never been able to
figure out how Windows does Unicode; it seems to have a half-dozen APIs for
it, all subtly different, and not a single damn one displays anything but
"???????.txt" when I download anything east-Asian.
I *do* think it would be incredibly useful to provide alternatives to
> all the standard operations on FilePath which used opaque datatypes
> and properly handles filename encoding. I noticed John Millikin's
> system-filepath package[1]. Do people have experience with it? It
> seems that adding a few functions like getDirectoryContents, plus
> adding a version of toString which performs some character decoding,
> would get us pretty far.
>
system-filepath was my frustration with the somewhat bizarre behavior of
some functions in "filepath"; I designed it to match the Python os.path API
pretty closely. I don't think it has any client code outside of my ~/bin ,
so changing its API radically shouldn't cause any drama.
I'd prefer filesystem manipulation functions be put in a separate library
(perhaps "system-directory"?), to match the current filepath/directory
split.
If it's to contain encoding-aware functions, I think they should be
Text-only. The existing String-based are just to interact with legacy
functions in System.IO, and should be either renamed to "toChar8/fromChar8"
or removed entirely. My vote to the second -- if someone needs Char8
strings, they can convert from the ByteString version explicitly.
--------------------------------------
-- | Try to decode a FilePath to Text, using the current locale encoding. If
-- the filepath is invalid in the current locale, it is decoded as ASCII and
-- any non-ASCII bytes are replaced with a placeholder.
--
-- The returned text is useful only for display to the user. It might not be
-- possible to convert back to the same or any 'FilePath'.
toText :: FilePath -> Text
-- | Try to encode Text to a FilePath, using the current locale encoding. If
-- the text cannot be represented in the current locale, returns 'Nothing'.
fromText :: Text -> Maybe FilePath
--------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20110330/051686d5/attachment-0001.htm>
More information about the Haskell-Cafe
mailing list