[Haskell-cafe] Layered I/O
Marcin 'Qrczak' Kowalczyk
qrczak at knm.org.pl
Wed Sep 15 13:01:27 EDT 2004
oleg at pobox.com writes:
> The discussion of i18n i/o highlighted the need for general overlay
> streams. We should be able to place a processing layer onto a handle
> -- and to peel it off and place another one. The layers can do
> character encoding, subranging (limiting the stream to the specified
> number of basic units), base64 and other decoding, signature
> collecting and verification, etc.
My language Kogut <http://kokogut.sourceforge.net/> uses the following
BYTE_INPUT - abstract supertype of a stream from which bytes can be read
CHAR_INPUT, BYTE_OUTPUT, CHAR_OUTPUT - analogously
The above types support i/o in blocks only (an array of bytes / chars
at a time). In particular resizable byte arrays and character arrays
are input and output streams.
BYTE_INPUT_BUFFER - transforms a BYTE_INPUT to another BYTE_INPUT,
providing buffering, unlimited lookahead and unlimited "unreading"
CHAR_INPUT_BUFFER - analogously; in addition provides function which
read a line at a time
BYTE_OUTPUT_BUFFER - transforms a BYTE_OUTPUT to another BYTE_OUTPUT,
providing buffering and explicit flushing
CHAR_OUTPUT_BUFFER - analogously; in addition provides optional
automatic flushing after outputting full lines
The above types provide i/o in blocks and in individual characters,
and in lines for character buffers. They should be used as the last
component of a stack.
BYTE_FILTER - defines how a sequence of bytes is transformed to
another sequence of bytes, by providing a function which transforms
a block at a time; it consumes some part of input, produces some
part of output, and tells whether it stopped because it wants more
input or because it wants more room in output; throws exception
CHAR_FILTER - analogously, but for characters
ENCODER - analogously, but transforms characters into bytes
DECODER - analogously, but transforms bytes into characters
The above are only auxiliary types which just do the conversion on a
block, not streams.
BYTE_INPUT_FILTER - a byte input which uses another byte input and
applies a byte filter to each block read
CHAR_INPUT_FILTER - a char input which uses another char input and
applies a char filter to each block read
INPUT_DECODER - a char input which uses a byte input and applies
a decoder to each block read
The above types support i/o in blocks only.
BYTE_OUTPUT_FILTER, CHAR_OUTPUT_FILTER, OUTPUT_ENCODER -
analogously, but for output
ENCODING - a supertype whic denotes an encoding in an abstract way.
STRING is one of its subtypes (would be "instance" in Haskell)
which currently means iconv-implemented encoding. There are also
singleton types for important encodings implemented directly.
There is a function which yields a new (stateful) encoder from an
encoding, and another which yields a decoder, but encoding is what
is used as an optional argument to the function which opens a file
or converts between a standalone string and byte array.
REPLACE_CODING_ERRORS - transforms an encoding to a related encoding
which substitutes U+FFFD on decoding, and '?' on encoding, instead
of throwing an exception on error.
A similar transformer which e.g. produces 〹 for unencodable
characters could be written too (not implemented yet).
COPYING_FILTER - filter which dumps data passed through it to another
APPEND_INPUT - concatenates several input streams into one
NULL_OUTPUT - /dev/null
The above types come in BYTE and CHAR flavors.
FLUSHING_OTHER - a byte input which reads data from another byte
input, but flushes some specified output stream before each input
operation; it's used on the *bottom* of stdin stack and flushes the
*top* of stdout stack, so alternating input and output on
stdin/stdout comes in the right order even if partial lines are
output and without explicit flushing
RAW_FILE - a byte input and output at the same time, a direct
interface to the OS
Some functions and other values:
TextReader - transforms a byte input to a character input by stacking
a decoder (for the specified or default encoding), a filter for
newlines (not implemented yet), and char input buffer (with the
specified or default buffer size)
TextWriter - analogously, for output
OpenRawFile, CreateRawFile - opens a raw file handle, has various
options (read, write, create, truncate, exclusive, append, mode).
OpenTextFile - a composition of OpenRawFile and TextReader which
splits optional arguments to both, depending on where they apply
CreateTextFile - a composition of CreateRawFile and TextWriter
BinaryReader, BinaryWriter - only does buffering, has a slightly
different interface than ByteInputBuffer and ByteOutputBuffer
OpenBinaryFile, CreateBinaryFile - analogously
RawStdIn, RawStdOut, RawStdErr - raw files
StdOut - RawStdOut, transformed by TextWriter with automatic flushing
after lines turned on (it's normally off by default)
StdErr - similar
StdIn - RawStdIn, transformed by FlushingOther on StdOut, transformed
At program exit StdOut and StdErr are flushed automatically.
Some of these types would correspond to classes in Haskell, together
with a type with existential qualifier. Representing streams as
records of functions is not sufficient because a given type of streams
may offer additional operations not provided by a generic interface.
Byte and char versions would often be parametrized instead of using
separate types. I didn't perform the on-the-fly translation of this
description to Haskell idioms to avoid errors.
> HTTP/1.1 200 Have it
> Content-type: text/plain; charset=iso-2022-jp
> Content-length: 12345
> Date: Tuesday, August 13, 2002
> To read the response line and the content headers, our stream must be
> in a ASCII, Latin-1 or UTF-8 encoding (regardless of the current
> locale). The body of the message is encoded in iso-2022-jp.
It's tricky to implement that using my scheme because decoding is
performed before buffering, so if we read it line by line and reach
the end of headers, a part of the data has already been read and
decoded using a wrong encoding.
The simplest way is probably to apply buffering, use lookahead
(an input buffer supports the interface of collections for lookahead)
to locate the end of headers, move headers into a separate array of
bytes leaving the rest in the buffered stream, put a text reader with
the encoding set to Latin1 on the array with headers, parse headers,
and put a text reader with the appropriate encoding on the rest of the
This causes double buffering of the rest of the stream, but avoiding
it is harder and perhaps not worth the effort (requires peeking into
the array used in buffers, to concatenate it with the rest of the
This leaves the problem with stopping the conversion after 12345
bytes. For that, if data needs to be processed lazily, I would
implement a custom stream type which reads up to given number of bytes
from an underlying stream and then signals end of data. It would be
put below the decoder. After it finishes, the original stream is left
in the right position and can be read further.
If data doesn't need to be processed lazily, it's simpler: one can
read 12345 bytes into an array and convert them off-line.
__("< Marcin Kowalczyk
\__/ qrczak at knm.org.pl
More information about the Haskell-Cafe