[Haskell-cafe] Re: Unicode strings and runCommand / runProcess

S. Astanin s.astanin at gmail.com
Mon Apr 26 07:32:51 EDT 2010


> > >> Actually, the behavior of openFile when given a String with characters >
> > >> 0xFF is also completely undocumented.  I am not sure what it does with
> > >> that.  It should probably be the same as runCommand, whatever it is.

Actually, the behaviour of openFile is known to be platform-dependent.
According to Simon Marlow,

> Be careful with FilePaths. On Windows they are interpreted as Unicode,
> on Unix they are interpreted as [Word8], by taking the low 8 bits of
> each Char. So if you always encode FilePaths to UTF-8, that will break
> on Windows. Fixing FilePaths is a high priority.

The last sentence gives me some hope.

http://ghcmutterings.wordpress.com/2009/09/30/heads-up-what-you-need-to-know-about-unicode-io-in-ghc-6-12-1/#comment-61

> But truncation makes impossible to pass non ASCII strings portably. They
> should be encoded there is no easy way to do so.
>
> Actually problem is use of strings. String is sequence of _characters_ and
> program talk to outside world using sequence of bytes. I think that right (but
> impossible) way to solve this problem is to use separate data types for file
> path, command line arguments.

I think that Strings should be used _only_ for characters (code
points).
Using the same data type for encoded/truncated data is dangerous. Most
of the
current problems with Unicode is due to the fact that Strings could
turn
out to be anything (are not strictly typed from this point of view).
Hence, the runtime checks and hacks like isUTF8Encoded :: String ->
Bool,
encodeString :: String -> String and decodeString :: String ->
String...

So I absolutely support that truncating is wrong. Expecting encoded
data
in Strings is wrong too. So the only option (except changing the
standard library and introducing a new type for FilePath), is to do
all
necessary conversions inside openFile and similar functions.

> I think there are two alternatives. One is to encode/decode strings using
> current locale and provide [Word8] based variants. Main problem is that
> seeming innocent actions like getting directory content could crash program
> (exception )

Actually, any IO action is unpredictible. So trying to get directory
contents can produce an error (for various reasons, e.g. permission
denied). If it reports an error when there are filenames not
presentable
in the current locale (e.g. contain invalid UTF-8 sequences in UTF-8
locale), the problem is likely to be the wrong locale settings. What's
the
problem with an exception?

I think [Word8] variants for those who wants to deal with such cases
(guess file system encoding etc.) is enough.

> Another options is to provide function to encode/decode strings. This is ugly
> and mix strings which hold characters and string which hold bytes and
> completely unhaskellish but it seems there is no good solution.

This is ugly, because it's impossible to know if a String is already
encoded or not. This is ugly because application code will be polluted
with conditional compilation to be cross-platform (or worse, people
will
forget to write cross-platform code in _some_ cases).

> Also truncation could have security implications. It makes almost impossible
> to escape dangerous characters robustly. Consider following code. This is more
> matter of speculations than real threat but nevertheless:

Nice example. It shows that escaping should be the last step.

S.


More information about the Haskell-Cafe mailing list