[Haskell-cafe] Writing binary files?

Mon Sep 13 07:01:58 EDT 2004

Marcin 'Qrczak' Kowalczyk wrote:

> >> 1. API for manipulating byte sequences in I/O (without representing
> >>    them in String type).
> >
> > Note that this needs to include all of the core I/O functions, not
> > just reading/writing streams. E.g. FilePath is currently an alias for
> > String, but (on Unix, at least) filenames are strings of bytes, not
> > characters. Ditto for argv, environment variables, possibly other
> > cases which I've overlooked.
> 
> They don't hold binary data; they hold data intended to be interpreted
> as text.

No. They frequently hold data intended to be passed to system
functions which interpret them simply as bytes, without regard to
encoding.

> If the encoding of the text doesn't agree with the locale,
> the environment setup is broken and 'ls' and 'env' misbehave on an
> UTF-8 terminal.

ls and env just write bytes to stdout (which may or may not refer to
the terminal). A particular terminal may not display them correctly,
but that's a separate issue.

Unless you are the sole user of a system, you have no control over
what filenames may occur on it (and even if you are the sole user, you
may wish to use packages which don't conform to your rules). As
environment variables frequently contain pathnames, this fact may get
propagated to the environment (however, system directories are usually
restricted to ASCII, so this aspect is less likely to be an issue).

> A program can explicitly set the default encoding to ISO-8859-1 if it
> wishes to do something in a broken environment.
> 
> >> 4. Libraries are reviewed to ensure that they work with various
> >>    encoding settings.
> >
> > There are limits to the extent to which this can be achieved. E.g. 
> > what happens if you set the encoding to UTF-8, then call
> > getDirectoryContents for a directory which contains filenames which
> > aren't valid UTF-8 strings?
> 
> The library fails. Don't do that. This environment is internally
> inconsistent.

Call it what you like, it's a reality, and one which programs need to
deal with.

Most programs don't care whether any filenames which they deal with
are valid in the locale's encoding (or any other encoding). They just
receive lists (i.e. NUL-terminated arrays) of bytes and pass them
directly to the OS or to libraries.

> > I feel that the default encoding should be one whose decoder cannot
> > fail, e.g. ISO-8859-1.
> 
> But filenames on my filesystem and most file contents are *not*
> encoded in ISO-8859-1. Assuming that they are ISO-8859-1 is plainly
> wrong.

For the most part, assuming that they are encoded in *any* coding
system is wrong.

However, If you treat them as ISO-8859-* (it doesn't matter which one,
so long as you're consistent), the Haskell I/O functions will at least
pass them through unmodified. Consider a trivial "cp" program:

	main = do
		[src, dst] <- getArgs
		text <- readFile src
		writeFile dst text

If the assumed encoding is ISO-8859-*, this program will work
regardless of the filenames which it is passed or the contents of the
file (modulo the EOL translation on Windows). OTOH, if it were to use
UTF-8 (e.g. because that was the locale's encoding), it wouldn't work
correctly if either filename or the file's contents weren't valid
UTF-8.

> > You should have to explicitly request the use of the locale's
> > encoding (analogous to calling setlocale(LC_CTYPE, "") at the start
> > of a C program; there's a good reason why C doesn't do this without
> > being explicitly told to).
> 
> C usually uses the paradigm of representing text in their original
> 8-bit encodings. This is why getting C programs to work in a UTF-8
> locale is such a pain. Only some programs use wchar_t internally.

Many C programs don't care about encodings. It's only if you actually
have to interpret the bytes (e.g. ctype.h, strcoll) that encodings
start to matter. At which point, you have to know the encoding.

> Java and C# uses the paradigm of representing text in Unicode
> internally, recoding it on boundaries with the external world.
> 
> The second paradigm has a cost that you must be aware what encodings
> are used in texts you manipulate.

And that cost can be a pretty high; e.g. gratuitously failing in the
case where you have no idea which encoding is used but where you
shouldn't actually need to know.

> Locale gives a reasonable default
> for simple programs which aren't supposed to work with multiple
> encodings, and it specifies the encoding of texts which don't have an
> encoding specified elsewhere (terminal I/O, filenames, environment
> variables).

More accurately, it specifies which encoding to assume when you *need*
to know the encoding (i.e. ctype.h etc), but you can't obtain that
information from a more reliable source.

My central point is that the existing API forces the encoding to be an
issue when it shouldn't be.

> ncurses wide character API is still broken. I reported bugs, the
> author acknowledged them, but hasn't fixed them. (Attributes are
> ignored on add_wch; get_wch is wrong for non-ASCII keys pressed if
> the locale is different from ISO-8859-1 and UTF-8.) It seems people
> don't use that API yet, because C traditionally uses the model of
> representing texts in byte sequences. But the narrow character API
> of ncurses is unusable with UTF-8 - this is not an implementation
> limitation but inherent limitation of the interface.

Well, to an extent it is an implementation issue. Historically, curses
never cared about encodings. A character is a byte, you draw bytes on
the screen, curses sends them directly to the terminal.

The terminal's encoding has to match that used by the program or it
displays incorrectly. The curses library neither knows nor cares about
the encoding of either the terminal or the application.

Furthermore, the curses model relies upon monospaced fonts, and falls
down once you encounter CJK text (where a "monospaced" font means one
whose glyphs are an integer multiple of the cell size, not necessarily
a single cell).

Extending something like curses to handle encoding issues is far from
trivial; which is probably why it hasn't been finished yet.

> > I.e. Char, String, string literals, and the I/O functions in Prelude,
> > IO etc should all be using bytes, with a distinct wide-character API
> > available for people who want to make the (substantial) effort
> > involved in writing (genuinely) internationalised programs.
> 
> This would cause excessive duplication of APIs. Look, Java and C#
> don't do that. Only file contents handling needs a byte API, because
> many files don't contain text.

Interfacing to (byte-oriented) OS functions also needs a byte API if
you want them to work reliably.

> This would imply isAlpha :: Char -> IO Bool.

Yes. You cannot determine whether a *byte* is alphabetical, only
characters.

Although, if you're going to have implicit String -> [Word8]
converters, there's no reason why you can't do the reverse, and have
isAlpha :: Word8 -> IO Bool. Although, like ctype.h, this will only
work for single-byte encodings.

> > Right now, the attempt at providing I18N "for free", by defining Char
> > to mean Unicode, has essentially backfired, IMHO.
> 
> Because it needs to be accompanied with character recoders, both
> invoked explicitly (also lazily) and attached to file handles, and
> with a way to obtain recoders for various encodings.
> 
> Assuming that the same encoding is used everywhere and programs can
> just copy bytes without interpreting them no longer works today.

It works in plenty of situations.

And it doesn't assume that the same encoding is used everywhere; it
just assumes that the encoding is irrelevant except when there's a
reason to the contrary.

> A mail client is expected to respect the encoding set in headers.

A client typically needs to know the encoding in order to display the
text.

As a counter-example, a mail *server* can do its job without paying
any attention to the encodings used. It can also handle non-MIME email
(which doesn't specify any encoding) regardless of the encoding.

> > Oh, and because bytes are being stored in Chars, the type system won't
> > help if you neglect to decode a string, or if you decode it twice.
> 
> This is why I said "1. API for manipulating byte sequences in I/O
> (without representing them in String type)".

Yes. But that API also needs to include functions such as those in the
Directory and System modules. It isn't just about reading and writing
streams. Most of the Unix API (kernel, libc, and many standard
libraries) is byte-oriented rather than character-oriented.

> > 2. If you assume ISO-8859-1, you can always convert back to Word8 then
> > re-decode as UTF-8. If you assume UTF-8, anything which is neither
> > UTF-8 nor ASCII will fail far more severely than just getting the
> > collation order wrong.
> 
> If I know the encoding, I should set the I/O handle to that encoding
> in the first place instead of reinterpreting characters which have
> been read using the default.

And if you don't know the encoding?

> > Personally, I would take the C approach: redefine Char to mean a byte
> > (i.e. CChar), treat string literals as bytes, keep the existing type
> > signatures on all of the existing Haskell98 functions, and provide a
> > completely new wide-character API for those who wish to use it.
> 
> Well, this is the paradigm which has problems in different areas.
> It will often break in UTF-8 locale, it needs isAlpha :: Char -> IO Bool,
> and it's painful to support multiple encodings.

Agreed. But writing programs which support I18N, multi-byte encodings,
wide character sets (>256 codepoints) and the like on an OS whose core
API is byte-oriented involves work.

And it can't all be hidden within a library. Some of the work falls on
the application programmers, who have to deal with determining the
correct encoding in each situation, converting between encodings,
handling encoding and decoding failures (e.g. when you encounter a
Unicode filename but the terminal only has Latin1), and so on.

> Char is *the* new API. What is missing is byte API in areas which work
> with arbitrary binary data (mostly file contents).

And the ability to actually use any encoding except ISO-8859-1 in any
meaningful way. I.e. encoders/decoders for other encodings, along with
the means to specify which encoding to use for functions which need to
perform encoding or decoding.

> > My main concern is that someone will get sick of waiting and make the
> > wrong "fix", i.e. keep the existing API but default to the locale's
> > encoding, so that every simple program then has to explicitly set it
> > back to ISO-8859-1 to get reasonable worst-case behaviour.
> 
> Supporting byte I/O and supporting character recoding needs to be done
> before this.

My view is that, right now, we have the worst of both worlds, and
taking a short step backwards (i.e. narrow the Char type and leave the
rest alone) is a lot simpler (and more feasible) than the long journey
towards real I18N.

More generally, this is the most intrusive example of a common problem
with too many Haskell libraries, i.e. exporting an interface which is
too high-level and glosses over too many detail. But this isn't some
obscure third-party libray. This is the Haskell98 standard library;
some of it's in the Prelude.

-- 
Glynn Clements <glynn.clements at virgin.net>