FilePath as ADT

Marcin 'Qrczak' Kowalczyk qrczak at knm.org.pl
Mon Feb 6 08:46:10 EST 2006


Ben Rudiak-Gould <Benjamin.Rudiak-Gould at cl.cam.ac.uk> writes:

> I don't like the idea of using U+0000, because it looks like an ASCII
> control character, and in any case has a long tradition of being used
> for something else. Why not use a code point that can't result from
> decoding a valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8,

It is legal. It's meaningless for data exchange, but OSes don't
prevent creating a file with UTF-8-encoded U+FFFF in its name,
and a true UTF-8 decoder interprets that byte sequence as U+FFFF.

U+0000 and surrogates are the only code points which can't appear
in true UTF-8-encoded filenames, and thus using them is necessary
to be fully compatible with true UTF-8.

> Or you could use values from U+DC00 to U+DFFF,

Right, but somehow I like U+0000 more.

> A much cleaner solution would be to reserve part of the private use
> area, say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF).

This would not be fully compatible with true UTF-8, because these
characters already have a representation in UTF-8.

> There's a lot to be said for any encoding, however nasty, that at
> least takes ASCII to ASCII.

Right, but '\0' can't appear in filenames.

My conversion routines for strings exchanged with C assume that
the default encoding leaves ASCII except NUL unchanged. NUL has
to be special-cased anyway because in most cases it's disallowed
in a C string. So the fast path checks whether all characters
are U+0001..U+007F, and if so, the string is used directly by C
(my representation of strings uses one byte per character with '\0'
at the end if the string has no characters above U+00FF). Otherwise
it's encoded using the dynamically specified default encoding,
and there is an additional check whether the *resulting* string
contains no '\0', which is an error.

Conversion of file contents doesn't take shortcuts, doesn't assume
anything about ASCII compatibility. It always works on buffers
containing 4-byte characters.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/


More information about the Haskell-prime mailing list