FilePath as ADT

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Sat Feb 4 10:05:57 EST 2006


Axel Simon <A.Simon at kent.ac.uk> writes:

> The solution of representing a file name abstractly is also used by
> the Java libraries.

I think it is not. Besides using Java UTF-16 strings for filenames,
there is the File class, but it also uses Java strings. The
documentation of listFiles() says that each resulting File is made
using the File(File, String) constructor. The GNU Java implementation
uses a single Java string inside it.

On Windows the OS uses UTF-16 strings natively rather than byte
sequences. UTF-16 and Unicode is almost interconvertible (modulo
illegal sequences of surrogates), while converting between UTF-16
and byte sequences is messy. This means that unconditionally using
Word8 as the representation of filenames would be bad.

I don't know a good solution.

                          *       *       *

Encouraged by Mono, for my language Kogut I adopted a hack that
Unicode people hate: the possibility to use a modified UTF-8 variant
where byte sequences which are illegal in UTF-8 are decoded into
U+0000 followed by another character. This encoding is used as the
default encoding instead of the true UTF-8 if the locale says that
UTF-8 should be used and a particular environment variable is set
(KO_UTF8_ESCAPED_BYTES=1).

The encoding has the following properties:

- Any byte sequence is decodable to a character sequence, which
  encodes back to the original byte sequence.

- Different character sequences encode to different byte sequences
  (the U+0000 escape is valid only when it would be necessary).

- It coincides with UTF-8 for valid UTF-8 byte sequences not
  containing 0x00, and character sequences not containing U+0000.

It's a hack, and doesn't address other encodings than UTF-8, but it
was good enough for me; it allows to maintain the illusion that OS
strings are character strings. Alternatives were:

* Use byte strings and character strings in different places,
  sometimes using a different type depending on the OS (Windows
  filenames would be character strings).

  Disadvantages: It's hard to write a filename to a text file.
  The API is more complex. The programmer must too often care
  about the kind of a string.

* Fail when encountering byte strings which can't be decoded.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Haskell-prime mailing list