FilePath as ADT
Ben Rudiak-Gould
Benjamin.Rudiak-Gould at cl.cam.ac.uk
Sun Feb 5 23:02:03 EST 2006
Marcin 'Qrczak' Kowalczyk wrote:
> Encouraged by Mono, for my language Kogut I adopted a hack that
> Unicode people hate: the possibility to use a modified UTF-8 variant
> where byte sequences which are illegal in UTF-8 are decoded into
> U+0000 followed by another character.
I don't like the idea of using U+0000, because it looks like an ASCII
control character, and in any case has a long tradition of being used for
something else. Why not use a code point that can't result from decoding a
valid UTF-8 string? U+FFFF (EF BF BF) is illegal in UTF-8, for example, and
I don't think it's legal UTF-16 either. This would give you round-tripping
for all legal UTF-8 and UTF-16 strings.
Or you could use values from U+DC00 to U+DFFF, which definitely aren't legal
UTF-8 or UTF-16. There's plenty of room there to encode each invalid UTF-8
byte in a single word, instead of a sequence of two words.
A much cleaner solution would be to reserve part of the private use area,
say U+109780 through U+1097FF (DBE5 DF80 through DBE5 DFFF). There's a
pretty good chance you won't collide with anyone. It's too bad Unicode
hasn't set aside 128 code points for this purpose. Maybe we should grab some
unassigned code points, document them, and hope it catches on.
There's a lot to be said for any encoding, however nasty, that at least
takes ASCII to ASCII. Often people just want to inspect the ASCII portions
of a string while leaving the rest untouched (e.g. when parsing
"--output-file=¡£ª±ïñ¹!.txt"), and any encoding that permits this is good
enough.
> Alternatives were:
>
> * Use byte strings and character strings in different places,
> sometimes using a different type depending on the OS (Windows
> filenames would be character strings).
>
> * Fail when encountering byte strings which can't be decoded.
Another alternative is to simulate the existence of a UTF-8 locale on Win32.
Represent filenames as byte strings on both platforms; on NT convert between
UTF-8 and UTF-16 when interfacing with the outside; on 9x either use the
ANSI/OEM encoding internally or convert between UTF-8 and the ANSI/OEM
encoding. I suppose NT probably doesn't check that the filenames you pass to
the kernel are valid UTF-16, so there's some possibility that files with
illegal names might be accessible to other applications but not to Haskell
applications. But I imagine such files are much rarer than Unix filenames
that aren't legal in the current locale. And you could still use the
private-encoding trick if not.
-- Ben
More information about the Haskell-prime
mailing list