[Haskell-cafe] Re: getting crazy with character encoding

Aaron Denney wnoise at ofb.net
Thu Sep 13 06:10:22 EDT 2007


On Thu, Sep 13, 2007 at 11:07:03AM +0200, Stephane Bortzmeyer wrote:
> On Thu, Sep 13, 2007 at 12:23:33AM +0000,
>  Aaron Denney <wnoise at ofb.net> wrote 
>  a message of 76 lines which said:
> 
> > the characters read and written should correspond to the native
> > environment notions and encodings.  These are, under Unix,
> > determined by the locale system.
> 
> Locales, while fine for things like the language of the error messages
> or the format to use to display the time, are *not* a good solution
> for things like file names and file contents.

I never claimed it was a good system, merely that it was the system.
Yes, serious applications should use byte oriented I/O and explicitly
manage character sets when necessary.  STDIO in general and terminal
interaction in particular should use the locale selected by the user.

> Even on a single Unix machine (without networking), there are
> *several* users. Using the locale to find out the charset used for a
> file name won't work if these users use different locales.
> 
> Same thing for file contents. The charset used must be marked in the
> file (XML...) or in the metadata, somehow.

For file system and network access, the justification is a bit more
clouded, but the interfaces there _should not_ be character interfaces.
Character interfaces are _lies_; Word8s are what actually get passed,
and trying to treat them as unicode characters with any fixed mapping
breaks.  At best we get an extremely leaky abstraction.

Filesystems are not uniform across systems, yet Haskell tries to present
a uniform view that manages to capture exactly no existing system.

File contents (almost) everywhere are streams of bytes (ignoring, say,
old record-based OSes, palm databases, and mac resource forks etc.)
Almost all file systems use a hierarchical directory system, but with
significant differences.  Under unixes the names are NUL-terminated
bytestrings that can't contain slashes.  New Macs and Windows have
specific character encodings (UTF-8, and UTF-16, respectively).  DOS,
old Macs, and windows have multiple roots and various directory
seperators and forbidden characters.

Trying to specify some API that is usable for robust programs that work
on any of these is hard.  I'd actually have preferred that the standard
didn't even try, and instead provided system-specific annexes.
Then an external library that was freer to evolve could try to solve
the problem of providing a uniform interface that would not defy
platform expectations.

-- 
Aaron Denney
-><-


More information about the Haskell-Cafe mailing list