[Haskell-cafe] Maybe bytes are text (was Re: Writing binary files?)

Thu Sep 16 09:53:57 EDT 2004

On Thu, 16 Sep 2004, Udo Stenzel wrote:

> Having a seperate byte based api is far better.  If you don't know the
> encoding, all you have is bytes, no text.

Okay, after reading large chunks of this discussion, I'm going to rock the
boat a bit by suggesting that bytes *are* text, and *do* belong in the
Char type, and hence that the current binary file API is actually correct,
after a fashion. In fact, I think that we can resolve many of the problems
of this thread by abandoning the conceptual distinction between characters
and bytes.

Suppose I invoke

    gcc -o XXX YYY.c

where XXX and YYY are strings of Japanese characters. It has been pointed
out that if GCC treats its filename arguments as opaque byte strings to be
passed back to the appropriate file opening functions, then it will work
even if the current locale isn't Japanese. But that's only true on Posix-
like systems. On NT, filenames are made of Unicode code points, and argv
is encoded according to the current locale. If GCC uses argv, it will fail
on the example above. I've run into this problem many times on my desktop
XP box, which uses a US-English locale but contains some filenames with
Japanese characters in them.

But in any case GCC's arguments aren't really opaque: it needs to check
each argument to see if it's an option, and it needs to look at the
extensions of files like YYY.c to figure out which subprogram to invoke.
Nevertheless, the opaque-filename approach does work on Posix, because --
this is the important bit -- the characters GCC cares about (like '-',
'o', '.', and 'c') have the same representation in every encoding. In
other words, the character encoding is neither transparent nor opaque to
GCC, but sort of "band-limited": it can understand the values from 0 to
127, but the higher values are mysterious to it. They could be Latin-1
code points; they could be EUC half-characters; they could be Unicode code
points. It doesn't know, and it doesn't *need* to know. It will fail if
given an encoding which doesn't follow this rule (e.g. EBCDIC).

We can make GCC (were it implemented in Haskell) work with all filenames
on both major platforms without platform-specific code by representing
command-line arguments and pathnames as Strings = [Char]s, where Char is
defined as the byte values 0-255 on Posix, but the UTF-16 values on Win32.

Clearly this is very fragile, but the type system provides a solution:

    newtype {- TransASCIIEncoding a => -} Char a = Chr Word32

    type String a = [Char a]

    class TransASCIIEncoding a where
      maxValueUsedByEncoding :: Word32

    instance TransASCIIEncoding Unicode where ...
    instance TransASCIIEncoding UTF16 where ...
    instance TransASCIIEncoding UTF8 where ...
    instance TransASCIIEncoding GenericByte where ...

    'x' :: Char a
    '\u1234' :: Char Unicode
    '\q789' :: Char WeirdCompilerSupportedEncoding

    instance (TransASCIIEncoding a) => Bounded (Char a) where
      minBound = Chr 0
      maxBound = Chr maxValueUsedByEncoding

    class CharTranscoding a b where
      transcode :: CharacterString a

    ord :: Character a -> Maybe Int  -- Nothing if arg isn't ASCII
    ordUnicode :: Character Unicode -> Int

Obvious problems: backward compatibility and codings like ISO 2022 and
Shift-JIS which break the fundamental assumption. I don't think either
problem is fatal. A more flexible subtyping mechanism would be nice, so
that (e.g.) byte-writing functions could take any Char type with a
sufficiently small maxValue.

-- Ben

[Haskell-cafe] Maybe bytes *are* text (was Re: Writing binary files?)

[Haskell-cafe] Maybe bytes are text (was Re: Writing binary files?)