[Haskell-cafe] Re: Writing binary files?

Wed Sep 15 17:23:34 EDT 2004

Glynn Clements <glynn.clements at virgin.net> writes:

> The RTS doesn't know the encoding. Assuming that the data will use the
> locale's encoding will be wrong too often.

If the program wants to get bytes, it should get bytes explicitly, not
some sort of pseudo-Unicode String.

> Like so many other people, you're making an argument based upon
> fiction (specifically, that you have a closed world where everything
> always uses the same encoding) then deeming anyone who is unable to
> maintain the fiction to be "wrong".

Everything's fine here with LANG=de_AT.utf8.  And I can't recall
having any problems with it.  But well, YMMV.

> No. If a program just passes bytes around, everything will work so
> long as the inputs use the encoding which the outputs are assumed to
> use. And if the inputs aren't in the "correct" encoding, then you have
> to deal with encodings manually regardless of the default behaviour.

The only programs that just pass bytes around that come to mind are
basic Unix utilities.  Basically everything else will somehow process
the data.

> Sorting according to codepoints inevitably involves decoding. However,
> getting the order wrong is usually considered less problematic than
> failing outright.

But more difficult to debug.

> Tough. You already have it, and will do for the foreseeable future. 
> Many existing APIs (including the core Unix API), protocols and file
> formats are defined in terms of byte strings with no encoding
> specified or implied.

Guess why I like Haskell (the language; the implementations are not up
to that ideal yet).

> I'd like to. But many of the functions which provide or accept binary
> data (e.g. FilePath) insist on represent it using Strings.

Good point.  Adding functions that accept bytes instead of strings
would be a major undertaking.

> I18N is inherently difficult. Lots of textual data exists in lots of
> different encodings, and the encoding is frequently unspecified.

That's the problem with the current API.  You can neither easily
read/write bytes nor strings in a specified encoding.

> The problem is that we also need to fix them to handle *no encoding*.

That's binary data. (assuming you didn't want to say 'unknown')

> Also, binary data and text aren't disjoint. Everything is binary; some
> of it is *also* text.

Simon's new-io proposal does this very nicely.  Stdin is by default a
binary stream and you can obtain a TextInputStream for it using either
the locale's encoding or a specified encoding.  That's the way I'd
like it to be.

> Or out of getDirectoryContents, getArgs, getEnv etc. Or to pass a list
> of Word8s to a handle, or to openFile, getEnv etc.

That's a real issue.  Adding new functions with a bin- is the only
solution that comes to my mind.

> The point is that, currently, you can't. Nothing in the core Haskell98
> API actually uses Word8, it all uses Char/String.

That's the intent of this thread. :-)

> Because we don't have an "oracle" which will magically determine the
> encoding for you.

That "oracle" is called locale setting.  If I want to read text and
can't determine the encoding by other ways (protocol spec, ...), then
it's what the user set his locale setting to.

> If you want text, well, tough; what comes out most system calls and
> core library functions (not just read()) are bytes.

Which need to be interpreted by the program depending on where these
bytes come from.

> There isn't any magic wand which will turn them into characters
> without knowing the encoding.

If I know the encoding, I should be able to set it.  If I don't, it's
the locale setting.

>> > completely new wide-character API for those who wish to use it.
>> 
>> Which would make it horrendously difficult to do even basic I18N.
>
> Why?

Having different types for single-byte and multi-byte strings together
with seperate functions to handle them (that's what I assume you mean
by a new wide-character API) with single-byte strings being the
preferred one (the cause of being a seperate API) would make sorting,
upper/lower case testing etc. not exactly easier.

> I know. That's what I'm saying. The problem is that the broken "code"
> is the Haskell98 API.

No, it's not broken.  It just has some missing features (i.e. I/O /
env functions accepting bytes instead of strings).

>>  String's are a list of unicode characters, [Word8] is a
>> list of bytes.
>
> And what comes out of (and goes into) most core library functions is
> the latter.

Strictly speaking, the former comes out with the semantics of the latter. :-)
Maybe bugs should be filed?

      Gabriel.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 188 bytes
Desc: not available
Url : http://www.haskell.org//pipermail/haskell-cafe/attachments/20040915/8e780945/attachment.bin