I Hate IO

Simon Marlow simonmar@microsoft.com
Thu, 9 Aug 2001 11:01:07 +0100

> 1. A file is not a stream. It really isn't anything like a=20
> stream. Sure,=20
> you can _make_ a stream based on a file but that's a=20
> different thing. A=20
> file is a list (ignoring for the moment meta-information),=20
> accessible at=20
> any point. By contrast, streams access either incoming or outgoing=20
> entities, optionally with "end of stream" support. For=20
> incoming, one may=20
> 'skip' but not 'seek'. For outgoing, one may send a series of=20
> predefined=20
> 'zero' values. Call that "seek" if you want, I don't.

I couldn't agree more.  I came across exactly these issues while
rewriting GHC's I/O library recently, so now GHC's Handle type
internally has two constructors:

  - FileHandle (a handle to a file, seekable, with a single file
    pointer and a single buffer that contains either pending read
    or write data but not both).

  - DuplexHandle (a read/write stream, not seekable, with two
    completely independent buffers, and the two ends can be closed

strangely, a FileHandle is also used for a uni-directional stream,
because it only needs a single buffer.

For Concurrent Haskell there's a locking issue here: you can't expect
two threads to read and write simultaneously to the same file, but it is
entirely reasonable for two threads to be simultaneously reading and
writing on the same socket.  Hence a FileHandle has a single lock, and a
DuplexHandle has one lock for each channel.  In effect there's one lock
per buffer.

> 2. A file is not made of "Char"s. A file is made of octets ("bytes"),=20
> i.e. Word8s. What is a "Char" anyway? Sometimes it's a seven- or=20
> eight-bit quantity with a _vague_ implication of interpretation as=20
> textual character; sometimes it's a 16-, 20.087- or 31-bit=20
> quantity with=20
> a much stronger implication of interpretation as textual character=20
> (strictly, Unicode "codepoint"). Is an ASCII 'r' the same as=20
> an EBCDIC=20
> 'r'? Or is an ASCII code 57 the same as an EBCDIC code 57?
> As for streams, mostly they are streams of octets. But of=20
> course streams=20
> of anything might be useful.

There's an implicit conversion step, between whatever is the on-disk
encoding of character streams and Unicode.  GHC currently only supports
a straightforward ISO 8851 encoding.

I agree there ought to be a way to get at the raw bytes too.

> 3. Output streams ("sinks") are different from input streams=20
> ("sources").=20
> That POSIX entity known as "standard output" is a sink of octets. A=20
> "bi-directional" stream such as a TCP connection is nothing=20
> but a source=20
> and a sink considered together. Indeed, for TCP it's possible to send=20
> "end of [outgoing] stream" without affecting the incoming=20
> stream. This is=20
> rather different from a contrived "file-access"-type stream, where=20
> reading and writing are operations affect each other.

Yup, see above.

> There is no such thing as "I/O" unless, as in Haskell, one=20
> means _all_=20
> imperative action. There are various entities out in the=20
> world accessible=20
> in a variety of different ways. Sources, sinks, lists, etc.=20
> are but some=20
> of the models useful for accessing them.

Ok, I agree so far.  Are you suggesting the IO library should be
changed?  How?

I considered providing a different API for bidirectional streams, or
perhaps requiring that bidirectional streams use separate Handles for
read and write, but came to the conclusion that the user really doesn't
care whether under the hood a single Handle is using separate buffers
for read and write or just a single buffer, how much locking is going on
or whatever.  The fact that these things are awkward to implement
shouldn't show through in the library interface.

It's definitely more convenient from the programmer's point of view to
be able to use the *same* handle object for both read and write,
otherwise you have to explain to people why they can have a read/write
file handle but not a read/write handle for a TCP socket.