[Haskell-cafe] Writing binary files?

Sat Sep 18 04:58:21 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

>> Ok, but let it be in addition to, not instead treating them as
>> character strings.
>
> Provided that you know the encoding, nothing stops you converting
> them to strings, should you have a need to do so.

There are already APIs which use Strings for filenames. I meant to
keep them, let them use a program-settable encoding which defaults to
the locale encoding - this is the only sane interpretation of this
interface on Unix I can imagine. And in addition to them we may have
APIs which use byte strings, for those who prefer the ability to
handle all filenames to using a uniform string representation inside
the program.

>> Such encodings are not suitable for filenames.
>
> Regardless of whether they are "suitable", they are used.

Usage of ISO-2022 as filename encoding is a bad and unsupported idea.
The '/' byte does not necessarily mean that the '/' character is
there, so some random subset of characters is excluded. statefulness
means that the same filename may be interpreted as different
characters depending on context.

There is no need to support ISO-2022 as filename encoding in languages
and tools. The fact that some tool doesn't support ISO-2022 in
filenames is not a flaw in the tool, so there is no need to check
what happens when filenames are represented in ISO-2022. If they are,
someone should fix his system.

> I haven't addressed any of the other stuff about ISO-2022, as it isn't
> really relevant. Whether ISO-2022 is good or bad doesn't matter; what
> matters is that it is likely to remain in use for the foreseeable
> future.

For transportation, not for the locale encoding nor for filenames.
There are no ISO-2022 locales. A program may support it when data it
operates on requests recoding between explicit encodings, e.g. if it's
found in an email, but there is no need to support it as the default
encoding of a program (which e.g. withCString function should use).

>> IMHO it's more important to make them compatible with the
>> representation of strings used in other parts of the program.
>
> Why?

To limit conversion hassle to I/O, instead of scattering it through
the program when filenames and other strings are met.

>> But otherwise programs would continuously have bugs in handling text
>> which is not ISO-8859-1, especially with multibyte encoding where
>> pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.
>
> Why?

Because some channels talk in terms of characters, or bytes in a known
encoding, instead of bytes in an implicit encoding. E.g. most display
channels, apart from raw stdin/stdout and narrow character ncurses;
many Internet protocols, apart from irc; .NET and Java; file formats
like XML; some databases.

And the world is slowly shifting to have more such channels, which
replace byte streams in an implicit encoding, because after reaching
a critical mass (where encodingless channels don't get in the middle
way, losing information about the encoding or losing some characters)
they make miltilingual handling more robust.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/