[Haskell-cafe] invalid character encoding
wolfgang.thaller at gmx.net
Sat Mar 19 18:56:17 EST 2005
>> Also, IIRC, Java strings are supposed to be unicode, too -
>> how do they deal with the problem?
> Files are represented by instances of the File class:
> The documentation for the File class doesn't mention encoding issues
> at all.
... which led me to conclude that they don't deal with the problem
>> I think that if we wait long enough, the filename encoding problems
>> will become irrelevant and we will live in an ideal world where
>> actually works. Maybe next year, maybe only in ten years.
> Maybe not even then. If Unicode really solved encoding problems, you'd
> expect the CJK world to be the first adopters, but they're actually
> the least eager; you are more likely to find UTF-8 in an
> English-language HTML page or email message than a Japanese one.
Hmm, that's possibly because english-language users can get away with
just marking their ASCII files as UTF-8. But I'm not arguing files or
HTML pages here, I'm only concerned with filenames. I prefer unicode
nowadays because I was born within a hundred kilometers of the "border"
between ISO-8859-1 and ISO-8859-2. I need 8859-1 for German-language
texts, but as soon as I write about where I went for vacation, I need a
few 8859-2 characters. So 8-byte encodings didn't cut it, and nobody
ever tried to sell ISO-2022 to me, so unicode was the only alternative.
So you've now convinced me that there is a considerable number of
computers using ISO-2022, where there's more than one way to encode the
same text (how do people use this from the command line??). There is
also multi-user systems where the user's don't agree on a single
encoding. I still reserve the right to call those systems messed-up,
but that's just my personal opinion and "reality" couldn't care less
about what I think.
So, as I don't want to stick with the status quo forever (lists of
bytes that pretend to be lists of unicode chars, even on platforms
where unicode is used anyway), how about we get to work - what do we
I don't think we want a type class here, a plain (abstract) data type
> data File
Obviously, we'll need conversion from and to C strings. On Mac OS X,
they'd be guaranteed to be in UTF-8.
> withFilePathCString :: String -> (CString -> IO a) -> IO a
> fileFromCString :: CString -> IO File
We will need functions for converting to and from unicode strings. I'm
pretty sure that we want to keep those functions pure, otherwise
they'll be very annoying to use.
> fileFromPath :: String -> File
Any impure operations that might be needed to decide how to encode the
file name will have to be delayed until the File is actually used.
> fileToPath :: File -> String
Same here: any impure operation necessary to convert the File to a
unicode string needs to be done when the file is created.
What about failure? If you go from String to File, errors should be
reported when you actually access the file. At an earlier time, you
can't know whether the file name is valid (e.g. if you mount a
"classic" HFS volume on Mac OS X, you can only create files there whose
names can be represented in the volume's file name encoding - but you
only find that out once you try to create a file).
For going from File to String, I'm not so sure, but I would be very
annoyed if I had to deal with a Maybe String return type on platforms
where it will always succeed. Maybe there should be separate functions
for different purposes - i.e. for display, you'd use a File -> String
function that will silently use '?'s when things can't be decoded, but
in other situations you might use a File -> Maybe String function and
check for Nothing.
If people want to implement more sophisticated ways of decoding file
names than can be provided by the library, they'd get the C string and
do the same things.
Of course, there should also be lots of other useful functions that
make it more or less unnecessary to deal with path names directly in
More information about the Haskell-Cafe