Proposal for a new I/O library design

Tim Sweeney tim@epicgames.com
Mon, 28 Jul 2003 12:56:04 -0500


Ben,

I live in a different universe, but over here I prefer to represent files
purely as memory-mapped objects.  In this view, there is no difference
between a read-only file and an immutable array of bytes (a byte being a
natural number between 0 to 255).  A read-write file is then equivalant to a
mutable array (or a reference to a mutable array on a heap) of the same.
Treating these all as heap references tends to be cleaner, because you can
compare the references for equality, which is significant even for read-only
files, because two files which contain the same exact data are not
necessarily the same file, whereas opening the same file in two different
places should result in equal references.

This approach greatly simplifies lots of things now that all modern
operating systems can perform file-mapping efficiently with the virtual
memory subsystem paging pieces in and out as necessary.  It gets rid of
pieces of information which are redundent from the low-level file system
point of view (the file handle itself, the current file pointer, etc).

The typical C/Unix approach is to deal with network (TCP or UDP) connections
as streams, too.  Obviously, memory-mapped files aren't a good way of
exposing them -- doing so would require buffering all past data, as well as
blocking when waiting on yet-unreceived data (when you really want to be
able to query whether there is incoming data available).  Instead, I prefer
to conceptualize network connections as a socket / packet-based interface,
with functions to open/close sockets, send a complete packet (being an array
of bytes) to a socket, receive a packet from a socket, and query packet
availability.  With this approach, there is no redundency or missing
information; everything that is observable from the protocol point of view
is an observable in the language interface, and nothing more.

In this manner, it's possible to get rid of all remnants of Unix-like
streams from a language's IO interface.

-Tim

----- Original Message -----
From: "Ben Rudiak-Gould" <benrg@dark.darkweb.com>
To: <haskell@haskell.org>
Sent: Sunday, July 27, 2003 11:35 PM
Subject: Proposal for a new I/O library design


> The other day I was reading the Haskell i18n debate in the list archives,
> and started thinking about possible replacements for the existing Haskell
> file I/O model.
>
> It occurred to me that the Haskell community has really dropped the ball
> on this one. Haskell's design has always emphasized doing the right thing,
> not merely doing the thing that everyone else happens to be doing. It's
> that philosophy that led to the invention of the monadic I/O model, among
> other things. And yet, what do we choose for our I/O primitives? The same
> old crocks that everyone else was using. We open and close files (whatever
> that's supposed to mean); we expose file handles to the user; we even
> maintain a current position in the file, which is an unnecessary global
> state variable if I've ever seen one.
>
> The proposal below is the result of a few hours spent thinking about how
> the file system would be accessed if it were actually implemented in
> Haskell, instead of behind a weird C API. I'm very interested in hearing
> comments and criticism. In particular, I want to know if there's enough
> interest in this model that I should actually try to implement it.
>
> The most important idea in this design as far as i18n is concerned is the
> separation of random-access files from input and output streams. Most of
> the ugliness of the usual file I/O interface comes from conflating these
> three concepts, which are almost totally unrelated. In particular, there's
> no need in this model to worry about the meaning of reading or seeking in
> a "text file." Text encoding and decoding apply to streams, not files. To
> read text from a file you layer an input stream on it, apply a text parser
> to that, and read characters. If you need to "seek" to a new location, you
> create a new stream which starts at that location in the underlying file.

>
> > module System.ProposedNewIOModel (...) where
>
> I assume that all I/O occurs in terms of octets. I think that this holds
> true of every platform on which Haskell is implemented or is likely to be
> implemented.
>
> > type Octet = Word8
>
> File offsets are 64 bits on all platforms. This model never uses negative
> offsets, so there's no need for a signed type. (But perhaps it would be
> better to use one anyway?) BlockLength should be something appropriate to
> the architecture's address space.
>
> > type FilePos = Word64
> > type BlockLength = Int
>
> A value of type File represents a file, which is essentially a resizable
> strict array of octets. Two values of type File compare equal if they are
> the same file -- that is, if they have the same contents and changes to
> one also appear in the other.
>
> ("File" is a bad name for this. For one thing, NTFS and HFS can associate
> more than one chunk of data with each directory entry, and "file" usually
> refers to all the chunks together. "Fork" would be more accurate, but it
> doesn't sound much like what it's supposed to represent.)
>
> (Should files be buffered at the application level? I'm not convinced it's
> necessary. Streams will be buffered, of course.)
>
> > data File -- abstract
>
> A value of type InputStream or OutputStream represents an input or output
> stream: that is, an octet source or sink. Two InputStreams or
> OutputStreams compare equal iff reading/writing one also reads/writes the
> other.
>
> (Should I call these "ports" instead of "streams"? How about "OctetSource"
> and "OctetSink"?)
>
> > data InputStream -- abstract
> > data OutputStream -- abstract
>
> Fundamental operations on files. XXX represents some sort of memory buffer
> (an IOUArray or a Ptr?). All reads and writes supply absolute offsets.
>
> (Note that there's no such thing as reading fewer bytes than requested --
> the read either succeeds or it throws an exception. This may be
> problematic because the file size could decrease between a call to
> fGetSize and a later call to fRead. One solution is to make fRead treat
> bytes beyond EOF as zeroes, which is consistent with the usual behavior
> for sparse files. fWrite should probably expand the file automatically if
> asked to write beyond its bounds, for similar reasons.)
>
> (A serious unaddressed problem here is locking. There should be a way to
> e.g. prevent someone else from writing to a file while I'm appending to
> it, to avoid race conditions. But for whatever reason, the usual OS
> interface only allows this kind of lock to be acquired when opening the
> file -- and I can't generally reopen the file because its name might have
> changed. Meanwhile, another kind of lock, the kind I don't want, can be
> acquired and released on an already-opened file. I don't know why it's
> done this way, or how to work around it.)
>
> (And what about asynchronous I/O?)
>
> > fGetSize    :: File -> IO FilePos
> > fSetSize    :: File -> FilePos -> IO ()
> > fRead       :: File -> FilePos -> BlockLength -> XXX -> IO ()
> > fWrite      :: File -> FilePos -> BlockLength -> XXX -> IO ()
> > fCheckRead  :: File -> FilePos -> BlockLength -> IO Bool
> > fCheckWrite :: File -> FilePos -> BlockLength -> IO Bool
>
> Fundamental operations on streams. "Maybe Octet" is supposed to represent
> "Octet or EOS," though I'm not sure this is enough for proper EOS
> handling.
>
> isPeek might be useful for text parsers. isUnGet is another possibility --
> it's more versatile but harder to implement efficiently. Or this may be a
> moot point because text parsers will probably want to use isGetBlock
> anyway for efficiency.
>
> > isGet      :: InputStream -> IO (Maybe Octet)
> > isPeek     :: InputStream -> IO (Maybe Octet)
> > isGetBlock :: InputStream -> BlockLength -> XXX -> IO BlockLength
> > -- efficiency hack
> >
> > osPut      :: OutputStream -> Octet -> IO ()
> > osPuts     :: OutputStream -> [Octet] -> IO ()
> > osPutBlock :: OutputStream -> BlockLength -> XXX -> IO ()
> > osFlush    :: OutputStream -> IO ()
>
> Standard streams. (How to deal with line buffering?)
>
> > stdin          :: InputStream
> > stdout, stderr :: OutputStream
>
> Streams can be layered on top of files. Each stream keeps track of its own
> independent position within the file. The effects of overlapped stream
> reading and writing on the same part of a file are unspecified, since
> streams will be buffered in practice.
>
> > fileToInputStreamFrom  :: File -> FilePos -> (IO?) InputStream
> > fileToOutputStreamFrom :: File -> FilePos -> (IO?) OutputStream
> >
> > fileToInputStream f = fileToInputStreamFrom f 0
> > fileToOutputStreamAppend f =
> >   fGetSize f >>= fileToOutputStreamFrom f
>
> InputStreams can be read as lazy lists. You're not allowed to use the
> InputStream again after you call this.
>
> > isGetContents :: InputStream -> IO [Octet]
>
> A value of type Directory represents a directory, which is essentially a
> mutable associative array from names (Unicode strings) to
> File/Directory/Stream values plus some metadata.
>
> This is supposed to represent a directory *independently of its name*, so
> it's robust against changes at higher levels of the filesystem hierarchy.
> I don't know if this can be implemented on Posix or Win32. (The old Mac OS
> and the NT API do support it.)
>
> > data Directory -- abstract
>
> Operations on directories. I suggest dGetContentsWithMetadata rather than
> dLookupMetadata because using the latter to scan a directory on a FAT32
> filesystem takes O(n^2) time.
>
> > dGetContents      :: Directory -> IO [String]
> > dGetContentsWithMetadata :: Directory -> IO [(String,???)]
> > dLookup           :: Directory -> String -> IO (Either File Directory)
> > -- Also should handle named pipes here
> > dCreateFile       :: Directory -> String -> (metadata?) -> IO File
> > -- etc...
>
> Pathnames don't fit well with this model, but they're not going to go
> away, so I provide these functions. If the pathname argument names a
> directory, that directory and Nothing are returned. Otherwise, a directory
> and string are returned such that a subsequent call to dLookup or
> dCreateFile with these values will refer to the same thing as the original
> pathname. lookupPathname is dLookupPathname from the current directory.
>
> > dLookupPathname     :: Directory -> String -> IO (Directory, Maybe
String)
> > lookupPathname      :: String -> IO (Directory, Maybe String)
> > getCurrentDirectory :: IO Directory -- no corresponding set!
>
> Convenient shortcuts for common cases.
>
> > lookupFileByPathname :: String -> IO File
> > lookupInputStreamByPathname :: String -> IO InputStream
> > -- at least as likely to succeed as lookupFileByPathname
>
> -- Ben
>
> _______________________________________________
> Haskell mailing list
> Haskell@haskell.org
> http://www.haskell.org/mailman/listinfo/haskell