[Haskell-cafe] File path programme

Krasimir Angelov kr.angelov at gmail.com
Fri Jan 28 04:48:57 EST 2005


On Thu, 27 Jan 2005 16:31:21 -0500, robert dockins
<robdockins at fastmail.fm> wrote:
> > I don't pretend to fully understand various unicode standard but it
> > seems to me that these problems are deeper than file path library. The
> > equation (decode . encode)
> > /= id seems confusing for me. Can you give me an example when this
> > happen?
> 
> I am pretty sure that ISO 2022 encoded strings can have multiple ways to
> express the same unicode glyphs.  This means that any sensible relation
> between IS0 2022 strings and unicode strings maps more than one ISO 2022
> string onto the same unicode string.  The inverse is therefore not a
> function.  To make it a function one of the possibly several encodings
> of the unicode string will have to be chosen.  So you have a ISO 2022
> string A which is decoded to a unicode string U.  We reencode U to an
> ISO 2022 string B.  It may be that A /= B.  That is the problem.
> 
> The various UTF encodings do not have this particular problem; if a UTF
> string is valid, then it is a unique representation of a unicode string.
> However, decoding is still a partial function and can fail.
> 
> A discussion about this problem floated around on this list several
> months ago.
> 
> > What can we do when the file name is passed as command line
> > argument to program? We need to convert String to FilePath after all.
> 
> Then we can parse the unicode and hope that nothing bad happens; the
> majority of the time, we will be OK.  Or we can make the RTS allow
> access to the raw bytes of the program arguments, env variables, etc,
> and actually do the right thing.

This means that all unicode languages, I have used before (Java,C#),
are broken too. In this case I agree that special data type might be
better. The development of the new FilePath should come together with
the new unicode aware I/O library.

I agree with David Roundy that the internal representation of FilePath
should be compact as mush as posible. PackedString uses UArray Int
Char to store strings and we can use  UArray Int Word8 or even
ByteArray#.

Under Windows nearly all API functions have two versions: ANSI and
Unicode (16-bit). Under WinNT+ each ANSI function is just a wrapper
around its Unicode friend and the wrapper simply converts the passed
strings. It was said that paths under Windows are [Word16] while in
Posix they are [Word8]. This is true but in order to take advantages
of this we need to use the native Windows API in the new I/O library.
Another advantage of this is that in such way we can use the native
non-blocking I/O under Windows.

Cheers,
  Krasimir


More information about the Haskell-Cafe mailing list