[Haskell-cafe] Layered I/O

Wed Sep 15 13:01:27 EDT 2004

oleg at pobox.com writes:

> The discussion of i18n i/o highlighted the need for general overlay
> streams. We should be able to place a processing layer onto a handle
> -- and to peel it off and place another one. The layers can do
> character encoding, subranging (limiting the stream to the specified
> number of basic units), base64 and other decoding, signature
> collecting and verification, etc.

My language Kogut <http://kokogut.sourceforge.net/> uses the following
types:

BYTE_INPUT - abstract supertype of a stream from which bytes can be read
CHAR_INPUT, BYTE_OUTPUT, CHAR_OUTPUT - analogously

The above types support i/o in blocks only (an array of bytes / chars
at a time). In particular resizable byte arrays and character arrays
are input and output streams.

BYTE_INPUT_BUFFER - transforms a BYTE_INPUT to another BYTE_INPUT,
   providing buffering, unlimited lookahead and unlimited "unreading"
   (putback)
CHAR_INPUT_BUFFER - analogously; in addition provides function which
   read a line at a time
BYTE_OUTPUT_BUFFER - transforms a BYTE_OUTPUT to another BYTE_OUTPUT,
   providing buffering and explicit flushing
CHAR_OUTPUT_BUFFER - analogously; in addition provides optional
   automatic flushing after outputting full lines

The above types provide i/o in blocks and in individual characters,
and in lines for character buffers. They should be used as the last
component of a stack.

BYTE_FILTER - defines how a sequence of bytes is transformed to
   another sequence of bytes, by providing a function which transforms
   a block at a time; it consumes some part of input, produces some
   part of output, and tells whether it stopped because it wants more
   input or because it wants more room in output; throws exception
   on errors
CHAR_FILTER - analogously, but for characters
ENCODER - analogously, but transforms characters into bytes
DECODER - analogously, but transforms bytes into characters

The above are only auxiliary types which just do the conversion on a
block, not streams.

BYTE_INPUT_FILTER - a byte input which uses another byte input and
   applies a byte filter to each block read
CHAR_INPUT_FILTER - a char input which uses another char input and
   applies a char filter to each block read
INPUT_DECODER - a char input which uses a byte input and applies
   a decoder to each block read

The above types support i/o in blocks only.

BYTE_OUTPUT_FILTER, CHAR_OUTPUT_FILTER, OUTPUT_ENCODER -
   analogously, but for output

ENCODING - a supertype whic denotes an encoding in an abstract way.
   STRING is one of its subtypes (would be "instance" in Haskell)
   which currently means iconv-implemented encoding. There are also
   singleton types for important encodings implemented directly.
   There is a function which yields a new (stateful) encoder from an
   encoding, and another which yields a decoder, but encoding is what
   is used as an optional argument to the function which opens a file
   or converts between a standalone string and byte array.

REPLACE_CODING_ERRORS - transforms an encoding to a related encoding
   which substitutes U+FFFD on decoding, and '?' on encoding, instead
   of throwing an exception on error.

A similar transformer which e.g. produces &#12345; for unencodable
characters could be written too (not implemented yet).

COPYING_FILTER - filter which dumps data passed through it to another
   output stream
APPEND_INPUT - concatenates several input streams into one
NULL_OUTPUT - /dev/null

The above types come in BYTE and CHAR flavors.

FLUSHING_OTHER - a byte input which reads data from another byte
   input, but flushes some specified output stream before each input
   operation; it's used on the *bottom* of stdin stack and flushes the
   *top* of stdout stack, so alternating input and output on
   stdin/stdout comes in the right order even if partial lines are
   output and without explicit flushing

RAW_FILE - a byte input and output at the same time, a direct
   interface to the OS

Some functions and other values:

TextReader - transforms a byte input to a character input by stacking
   a decoder (for the specified or default encoding), a filter for
   newlines (not implemented yet), and char input buffer (with the
   specified or default buffer size)
TextWriter - analogously, for output

OpenRawFile, CreateRawFile - opens a raw file handle, has various
   options (read, write, create, truncate, exclusive, append, mode).

OpenTextFile - a composition of OpenRawFile and TextReader which
   splits optional arguments to both, depending on where they apply
CreateTextFile - a composition of CreateRawFile and TextWriter

BinaryReader, BinaryWriter - only does buffering, has a slightly
   different interface than ByteInputBuffer and ByteOutputBuffer
OpenBinaryFile, CreateBinaryFile - analogously

RawStdIn, RawStdOut, RawStdErr - raw files
StdOut - RawStdOut, transformed by TextWriter with automatic flushing
   after lines turned on (it's normally off by default)
StdErr - similar
StdIn - RawStdIn, transformed by FlushingOther on StdOut, transformed
   by TextReader

At program exit StdOut and StdErr are flushed automatically.

Some of these types would correspond to classes in Haskell, together
with a type with existential qualifier. Representing streams as
records of functions is not sufficient because a given type of streams
may offer additional operations not provided by a generic interface.

Byte and char versions would often be parametrized instead of using
separate types. I didn't perform the on-the-fly translation of this
description to Haskell idioms to avoid errors.

> 	HTTP/1.1 200 Have it
> 	Content-type: text/plain; charset=iso-2022-jp
> 	Content-length: 12345
> 	Date: Tuesday, August 13, 2002
> 	<empty-line>
>
> To read the response line and the content headers, our stream must be
> in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current
> locale). The body of the message is encoded in iso-2022-jp.

It's tricky to implement that using my scheme because decoding is
performed before buffering, so if we read it line by line and reach
the end of headers, a part of the data has already been read and
decoded using a wrong encoding.

The simplest way is probably to apply buffering, use lookahead
(an input buffer supports the interface of collections for lookahead)
to locate the end of headers, move headers into a separate array of
bytes leaving the rest in the buffered stream, put a text reader with
the encoding set to Latin1 on the array with headers, parse headers,
and put a text reader with the appropriate encoding on the rest of the
stream.

This causes double buffering of the rest of the stream, but avoiding
it is harder and perhaps not worth the effort (requires peeking into
the array used in buffers, to concatenate it with the rest of the
stream).

This leaves the problem with stopping the conversion after 12345
bytes. For that, if data needs to be processed lazily, I would
implement a custom stream type which reads up to given number of bytes
from an underlying stream and then signals end of data. It would be
put below the decoder. After it finishes, the original stream is left
in the right position and can be read further.

If data doesn't need to be processed lazily, it's simpler: one can
read 12345 bytes into an array and convert them off-line.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak at knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/