Proposal for a new I/O library design

Sun, 27 Jul 2003 21:35:41 -0700 (PDT)

The other day I was reading the Haskell i18n debate in the list archives,
and started thinking about possible replacements for the existing Haskell
file I/O model.

It occurred to me that the Haskell community has really dropped the ball
on this one. Haskell's design has always emphasized doing the right thing,
not merely doing the thing that everyone else happens to be doing. It's
that philosophy that led to the invention of the monadic I/O model, among
other things. And yet, what do we choose for our I/O primitives? The same
old crocks that everyone else was using. We open and close files (whatever
that's supposed to mean); we expose file handles to the user; we even
maintain a current position in the file, which is an unnecessary global
state variable if I've ever seen one.

The proposal below is the result of a few hours spent thinking about how
the file system would be accessed if it were actually implemented in
Haskell, instead of behind a weird C API. I'm very interested in hearing
comments and criticism. In particular, I want to know if there's enough
interest in this model that I should actually try to implement it.

The most important idea in this design as far as i18n is concerned is the
separation of random-access files from input and output streams. Most of
the ugliness of the usual file I/O interface comes from conflating these
three concepts, which are almost totally unrelated. In particular, there's
no need in this model to worry about the meaning of reading or seeking in
a "text file." Text encoding and decoding apply to streams, not files. To
read text from a file you layer an input stream on it, apply a text parser
to that, and read characters. If you need to "seek" to a new location, you
create a new stream which starts at that location in the underlying file.

> module System.ProposedNewIOModel (...) where

I assume that all I/O occurs in terms of octets. I think that this holds
true of every platform on which Haskell is implemented or is likely to be
implemented.

> type Octet = Word8

File offsets are 64 bits on all platforms. This model never uses negative
offsets, so there's no need for a signed type. (But perhaps it would be
better to use one anyway?) BlockLength should be something appropriate to
the architecture's address space.

> type FilePos = Word64
> type BlockLength = Int

A value of type File represents a file, which is essentially a resizable
strict array of octets. Two values of type File compare equal if they are
the same file -- that is, if they have the same contents and changes to
one also appear in the other.

("File" is a bad name for this. For one thing, NTFS and HFS can associate
more than one chunk of data with each directory entry, and "file" usually
refers to all the chunks together. "Fork" would be more accurate, but it
doesn't sound much like what it's supposed to represent.)

(Should files be buffered at the application level? I'm not convinced it's
necessary. Streams will be buffered, of course.)

> data File	-- abstract

A value of type InputStream or OutputStream represents an input or output
stream: that is, an octet source or sink. Two InputStreams or
OutputStreams compare equal iff reading/writing one also reads/writes the
other.

(Should I call these "ports" instead of "streams"? How about "OctetSource"
and "OctetSink"?)

> data InputStream	-- abstract
> data OutputStream	-- abstract

Fundamental operations on files. XXX represents some sort of memory buffer
(an IOUArray or a Ptr?). All reads and writes supply absolute offsets.

(Note that there's no such thing as reading fewer bytes than requested --
the read either succeeds or it throws an exception. This may be
problematic because the file size could decrease between a call to
fGetSize and a later call to fRead. One solution is to make fRead treat
bytes beyond EOF as zeroes, which is consistent with the usual behavior
for sparse files. fWrite should probably expand the file automatically if
asked to write beyond its bounds, for similar reasons.)

(A serious unaddressed problem here is locking. There should be a way to
e.g. prevent someone else from writing to a file while I'm appending to
it, to avoid race conditions. But for whatever reason, the usual OS
interface only allows this kind of lock to be acquired when opening the
file -- and I can't generally reopen the file because its name might have
changed. Meanwhile, another kind of lock, the kind I don't want, can be
acquired and released on an already-opened file. I don't know why it's
done this way, or how to work around it.)

(And what about asynchronous I/O?)

> fGetSize    :: File -> IO FilePos
> fSetSize    :: File -> FilePos -> IO ()
> fRead       :: File -> FilePos -> BlockLength -> XXX -> IO ()
> fWrite      :: File -> FilePos -> BlockLength -> XXX -> IO ()
> fCheckRead  :: File -> FilePos -> BlockLength -> IO Bool
> fCheckWrite :: File -> FilePos -> BlockLength -> IO Bool

Fundamental operations on streams. "Maybe Octet" is supposed to represent
"Octet or EOS," though I'm not sure this is enough for proper EOS
handling.

isPeek might be useful for text parsers. isUnGet is another possibility --
it's more versatile but harder to implement efficiently. Or this may be a
moot point because text parsers will probably want to use isGetBlock
anyway for efficiency.

> isGet      :: InputStream -> IO (Maybe Octet)
> isPeek     :: InputStream -> IO (Maybe Octet)
> isGetBlock :: InputStream -> BlockLength -> XXX -> IO BlockLength
>	-- efficiency hack
>
> osPut      :: OutputStream -> Octet -> IO ()
> osPuts     :: OutputStream -> [Octet] -> IO ()
> osPutBlock :: OutputStream -> BlockLength -> XXX -> IO ()
> osFlush    :: OutputStream -> IO ()

Standard streams. (How to deal with line buffering?)

> stdin          :: InputStream
> stdout, stderr :: OutputStream

Streams can be layered on top of files. Each stream keeps track of its own
independent position within the file. The effects of overlapped stream
reading and writing on the same part of a file are unspecified, since
streams will be buffered in practice.

> fileToInputStreamFrom  :: File -> FilePos -> (IO?) InputStream
> fileToOutputStreamFrom :: File -> FilePos -> (IO?) OutputStream
>
> fileToInputStream f = fileToInputStreamFrom f 0
> fileToOutputStreamAppend f =
>   fGetSize f >>= fileToOutputStreamFrom f

InputStreams can be read as lazy lists. You're not allowed to use the
InputStream again after you call this.

> isGetContents :: InputStream -> IO [Octet]

A value of type Directory represents a directory, which is essentially a
mutable associative array from names (Unicode strings) to
File/Directory/Stream values plus some metadata.

This is supposed to represent a directory *independently of its name*, so
it's robust against changes at higher levels of the filesystem hierarchy.
I don't know if this can be implemented on Posix or Win32. (The old Mac OS
and the NT API do support it.)

> data Directory	-- abstract

Operations on directories. I suggest dGetContentsWithMetadata rather than
dLookupMetadata because using the latter to scan a directory on a FAT32
filesystem takes O(n^2) time.

> dGetContents      :: Directory -> IO [String]
> dGetContentsWithMetadata :: Directory -> IO [(String,???)]
> dLookup           :: Directory -> String -> IO (Either File Directory)
>	-- Also should handle named pipes here
> dCreateFile       :: Directory -> String -> (metadata?) -> IO File
> -- etc...

Pathnames don't fit well with this model, but they're not going to go
away, so I provide these functions. If the pathname argument names a
directory, that directory and Nothing are returned. Otherwise, a directory
and string are returned such that a subsequent call to dLookup or
dCreateFile with these values will refer to the same thing as the original
pathname. lookupPathname is dLookupPathname from the current directory.

> dLookupPathname     :: Directory -> String -> IO (Directory, Maybe String)
> lookupPathname      :: String -> IO (Directory, Maybe String)
> getCurrentDirectory :: IO Directory	-- no corresponding set!

Convenient shortcuts for common cases.

> lookupFileByPathname :: String -> IO File
> lookupInputStreamByPathname :: String -> IO InputStream
>	-- at least as likely to succeed as lookupFileByPathname

-- Ben