[Haskell-cafe] Writing binary files?

Glynn Clements glynn.clements at virgin.net
Fri Sep 17 20:34:36 EDT 2004


Marcin 'Qrczak' Kowalczyk wrote:

> > What I'm suggesting in the above is to sidestep the encoding issue
> > by keeping filenames as byte strings wherever possible.
> 
> Ok, but let it be in addition to, not instead treating them as
> character strings.

Provided that you know the encoding, nothing stops you converting them
to strings, should you have a need to do so.

> >> Processing data in their original byte encodings makes supporting
> >> multiple languages harder. Filenames which are inexpressible as
> >> character strings get in the way of clean APIs. When considering only
> >> filenames, using bytes would be sufficient, but in overall it's more
> >> convenient to Unicodize them like other strings.
> >
> > It also harms reliability. Depending upon the encoding, two distinct
> > byte strings may have the same Unicode representation.
> 
> Such encodings are not suitable for filenames.

Regardless of whether they are "suitable", they are used.

> For me ISO-2022 is a brain-damaged concept and should die.

Well, it isn't likely to.

I haven't addressed any of the other stuff about ISO-2022, as it isn't
really relevant. Whether ISO-2022 is good or bad doesn't matter; what
matters is that it is likely to remain in use for the foreseeable
future.

> >> Such tarballs are not portable across systems using different encodings.
> >
> > Well, programs which treat filenames as byte strings to be read from
> > argv[] and passed directly to open() won't have any problems with this.
> 
> The OS itself may have problems with this; only some filesystems
> accept arbitrary bytes apart from '\0' and '/' (and with the special
> meaning for '.'). Exotic characters in filenames are not very
> portable.

No, but most Unix programs manage to handle them without problems.

> >> A Haskell program in my world can do that too. Just set the encoding
> >> to Latin1.
> >
> > But programs should handle this by default, IMHO.
> 
> IMHO it's more important to make them compatible with the
> representation of strings used in other parts of the program.

Why?

> > Filenames are, for the most part, just "tokens" to be passed around.
> 
> Filenames are often stored in text files,

True.

> whose bytes are interpreted as characters.

Sometimes true, sometimes not.

Where filenames occur in data files, e.g. configuration files, the
program which reads the configuration file typically passes the bytes
directly to the OS without interpretation.

> Applying QP to non-ASCII parts of filenames is suitable
> only if humans won't edit these files by hand.

Who said anything about QP?

> >> > My specific point is that the Haskell98 API has a very big problem due
> >> > to the assumption that the encoding is always known. Existing
> >> > implementations work around the problem by assuming that the encoding
> >> > is always ISO-8859-1.
> >> 
> >> The API is incomplete and needs to be enhanced. Programs written using
> >> the current API will be limited to using the locale encoding.
> >
> > That just adds unnecessary failure modes.
> 
> But otherwise programs would continuously have bugs in handling text
> which is not ISO-8859-1, especially with multibyte encoding where
> pretending that ISO-8859-2 is ISO-8859-1 too often doesn't work.

Why?

> I can't switch my environment to UTF-8 yet precisely because too many
> programs were written with the attitude you are promoting: they don't
> care about the encoding, they just pass bytes around.

That's all that many programs should be doing.

> Bugs range from small annoyances like tabular output which doesn't
> line up, through mangled characters on a graphical display, to
> full-screen interactive programs being unusable on a UTF-8 terminal.

IOW:

1. display doesn't work correctly,
2. display doesn't work correctly, and
3. display doesn't work correctly.

You keep citing cases involving graphical display as a reason why all
programs should be working with characters all of the time.

I haven't suggested that programs should never deal with characters,
yet you keep insinuating that is my argument, then proceed to attack
it.

-- 
Glynn Clements <glynn.clements at virgin.net>


More information about the Haskell-Cafe mailing list