Streams internal structure
Simon Marlow
simonmarhaskell at gmail.com
Fri Apr 21 11:37:44 EDT 2006
So I think I've extracted the important points here:
1. seek needs to flush the buffer, so it needs to know whether it is on
a buffered handle or not. My plan was to do it like this:
class Seekable s where
seek :: s -> SeekLocation -> IO ()
instance Seekable s => Seekable (BufferedInputStream s) where
seek (BufferedInputStream s buf) loc = do
.. flush the buffer ..
seek s loc
2. performance of putByte, getByte. You have the ByteStream interface,
as far as I can tell, just to improve the performance of these
operations. I didn't have such a class because I believe that
shovelling large numbers of bytes is more important from a performance
perspective than shovelling one byte at a time. For instance, I gave
the byte reader that works on any InputStream earlier:
-- | Reads a single 'Word8' from a stream.
streamGet :: InputStream s => s -> IO Word8
streamGet s =
alloca $ \p -> do
r <- streamReadBufferNonBlocking s 1 p
if r == 0 then ioe_EOF else peek p
Still, you can have full speed single byte operations by just adding
methods to the MemInputStream/MemOutputStream class:
class MemInputStream s where
withStreamInputBuffer :: s -> (Ptr Word8 -> Int -> IO (a,Int)) -> IO a
memStreamGet :: s -> IO Word8
memStreamGet s =
withStreamInputBuffer s $ \p i -> do
x <- peek p
return (x, i+1)
(the default obviously isn't very fast, but individual instances would
provide faster implementations).
3. code explosion.
Firstly, removing ByteStream will reduce the code explosion.
In my design I had separate types for BufferedInputStream and
BufferedOutputStream, and there is hardly any duplication because
buffering input is just different to buffering output. Read/write
buffering is just a combination of these two.
I still don't see why you need another layer of buffering over a memory
stream. It supports direct operations on the memory already. As I said
above, if you want fast(er) single byte read on these, then add it to
the MemInputStream class.
There is some boilerplate: the need to translate Stream, Seekable,
Buffered, InputBuffered etc. through upper layers to lower layers means
some boilerplate wrapper instances. But I don't think that's a big
deal, and splitting up these classes hasn't made it any worse.
Admittedly though, separating input and output has doubled the amount of
boilerplate wrappers. I don't see a way around that, other than using
some preprocessing or TH to generate the instances.
--------------
So here's how I see the layering for a file reading stream using iconv:
file or socket (fd-based) :: FileInputStream
instance of: InputStream, Stream, Seekable
byte buffer :: BufferedInputStream
instance of: MemInputStream, Stream,
Seekable, Buffered, InputBuffered
iconv :: BufferedTextInputStream
instance of: MemTextInputStream, TextInputStream, Stream,
Seekable, Buffered, InputBuffered
lock :: LockedTextInputStream
instance of: MemTextInputStream, TextInputStream, Stream,
Seekable, Buffered, InputBuffered
Two layers of buffering aren't strictly necessary, because the byte
buffer will usually be emptied immediately by the decoder, except that
it might leave a few bytes when there isn't a complete character to
decode. It is simpler to do it this way though, because the decoding
layer then works with any MemInputStream (eg. a memory mapped file or
ByteString, as well as a buffered file), and it doesn't lose any efficiency.
A memory-mapped file becomes a MemInputStream directly, so it only gets
3 layers instead of 4, similarly for a ByteString or UArray Word8.
I haven't said what the MemTextInputStream or TextInputStream classes
look like. I imagine something like this:
class MemTextInputStream s where
withCharInputBuffer :: s -> (Ptr Char -> Int -> IO (a,Int)) -> IO a
class TextInputStream s where
getChar :: s -> IO Char
getContents ...
getLine ...
A StringReader can be made an instance of TextInputStream directly.
So do you see any problem with this design? Performance should be fine:
getContents reads directly from the buffer filled by iconv, and all
other operations are done at buffer sizes. Fast byte and multibyte
operations are supported on MemInputStreams so fast Binary I/O and
serialisation/deserialisation are possible.
Cheers,
Simon
Bulat Ziganshin wrote:
> Hello Simon,
>
> Friday, April 21, 2006, 2:20:26 PM, you wrote:
>
>
>>>well, your last answer shows that you don't understand my problems.
>>>i'm entirely want to have precise classes, but when i run into
>>>IMPLEMENTATION, Haskell restrictions bite me again and again.
>
>
>>Ok, I think you need to describe these problems in more detail. The
>>message you just wrote describes the structure of the library which I
>>think is mostly fine, and corresponds fairly well with what I had in mind.
>
>
>>Could you provide a .zip or .tar.gz instead?
>
>
> yes, the last version is http://freearc.narod.ru/StreamsBeta.zip
>
>
>>vPutBuf should be non-blocking too.
>
>
> so it should return number of bytes written? and it should be named
> vPutBufNonBlocking? you are already said in February that vGetBuf
> should be used in buffering transformer gently - i.e. i should use
> just the returned number of bytes and don't enforce filling of whole
> buffer. how should be the policy of using vPutBufNonBlocking? what
> should be vGEtBuf/vPutBuf (i.e. functions provided to users) - are
> they should be blocking or non-blocking?
>
>
>>which is equivalent to my MemInputStream/MemOutputStream. I believe the
>>with-style interface that I use is better though.
>
>
> main benefit of my structure is that MemoryStream and BlockStream are
> very close so i hope to implement common buffering layer. your variant
> makes this much harder. on the other side, i don't see any benefits in
> using your scheme
>
>
>>>each buffering transformer implements ByteStream via BlockStream or
>>>MemoryStream:
>
>
>>I didn't have an equivalent to this class in my design. Why is it
>>necessary? vGetByte/vPutByte can be implemented for an arbitrary
>>BlockStream or indeed a MemoryStream.
>
>
> :) yes, you never mind about implementation issues! how the naked FD, for
> example, can implement vGetByte? i'm not Copperfield to extract buffer
> from nowhere. Buffering transformer is not just a bunch of functions,
> it's a whole DATA STRUCTURE what joins stream handle and buffer
> pointers together! see my previous letter what described this
>
> the main idea of the whole Streams library is what low-level stream
> types don't carry any data that will be used only in high-level
> routines. FD don't carry buffer or CharEncoding information. it's just
> can read or write block of data - it's all! all other data items are
> added by corresponding transformers. so FD CAN'T implement vGetByte
> with buffering and its implementation without buffering will be imho
> just source of errors - users will constantly asking why it is so
> slow. moreover i can't add vGetByte/vPutByte to the BlockStream or
> MemoryStream class (because both can implement it), so it anyway
> should be in separate class
>
>
>>>instance (MemoryStream IO h) => ByteStream IO (BufferedMemoryStream h)
>
>
>>what's a BufferedMemoryStream for? Isn't a memory stream already
>>buffered by definition?
>
>
> we just use word "buffering" in different meanings. for me buffering
> means that underlying stream give us data in large enough chunks and
> this can be implemented inefficiently. buffering transformer efficiently
> implements byte-oriented or char-oriented operations by working with
> buffer (either provided by MemoryStream or allocated explicitly for
> BlockStream) and make rather infrequent calls to the block i/o
> operations of underlying stream
>
> so, for me MemoryStream is not buffered, it just support block i/o
> operations in slightly different way than BlockStream. going your way,
> as i understand, we should call withStreamInputBuffer on each i/o
> operation what mean that withStreamInputBuffer should be implemented
> very efficiently in each MemoryStream. well, it's also possible but i
> prefer to use the same algorithms (and i hope to join implementations)
> for byte&char i/o over both MemoryStreams and BlockStreams
>
> on the other side, now i think about making BlockStream->MemoryStream
> transformer (it should just alloc it's own buffer) and then your's
> withStreamInputBuffer may be used for any I/O through buffer (i.e.
> both for BlockStreams and MemoryStreams). it will be great for implementing
> functions requiring lookahead such as vGetLine
>
>
>>>class (Stream m h) => TextStream m h | h->m where
>>> vGetChar :: h -> m Char
>>> vGetLine :: h -> m String
>>> vGetContents :: h -> m String
>>> vPutChar :: h -> Char -> m ()
>>> vPutStr :: h -> String -> m ()
>
>
>>Ok, I wouldn't put all those method in the class, I think.
>
>
> it is because you never mind about fast implementation :) i already
> moved vPutStrLn/vPrint out of this class.
>
>
>>Also it is
>>necessary to have buffering at the TextStream level, as Marcin pointed
>>out. I haven't thought through the design carefully here.
>
>
> implementation is rather obvious and therefore boring :) we should
> convert data to the UCS-4 and then work with buffer containing 4-byte
> chars. are you not done this in 6.6 compiler when you read
> UTF-8 sources? we will lose vTell operation (on the other side, we
> anyway lose it on text streams in windows :) )
>
>
>
>>>all these can be named a canonical Streams hierarchy and it already
>>>works.
>
>
>>So what are the problems you were referring to?
>
>
> i said only about working things. as you see, i don't mentioned,
> for example, InTextStream or SeekableStream classes, because i had
> problems with implementing this
>
>
> >> 2) separation of Stream classes make some automatic definitions
> >> impossible. for example, released version contains vGetBuf
> >> implementation that is defined via vGetChar and works ok for streams
> >> that provide only vGetChar as base function.
>
> it's a minor problem, so you can skip it. just for completeness: my
> first lib contained the following definitions:
>
> class Stream m h where
> vGetChar :: ...
> vGetPuf :: ...
>
> -- default implementations
> vGetPuf = ... some code using vGetChar
>
> instance Stream IO StringBuffer
> vGetChar = ...
> instance Stream IO StringReader
> vGetChar = ...
>
> and vGetBuf for StringBuffer and StringReader was defined
> automatically. try to make this effect for current design. it should
> define:
>
> instance (TextStream m h) => BlockStream m h
> vGetPuf = ... some code using vGetChar
>
> but such code seen by Haskell as potential overlapping instance
> definition
>
>
>
>>>i will be glad to add BufferedStream and SeekableStream classes
>>>and split BlockStream..TextStream to the reading and writing ones, but
>>>this is, as i said, limited by implementation issues
>
>
>>which implementation issues?
>
>
> well, let's we have 3 different FDs - ReadFD, WriteFD and ReadWriteFD.
> it's essentially the same Ints, just supporting different sets of I/O
> operations:
>
> instance InBlockStream ReadFD where ... -- implements vGetBuf
> instance InBlockStream ReadWriteFD where ...
> instance OutBlockStream WriteFD where ... -- implements vPutBuf
> instance OutBlockStream ReadWriteFD where ...
>
> now i'm going to implement InByteStream interface (vGetByte) and
> OutByteStream (vPutByte) for buffered FDs. i want to use
> the same data type constructor to buffer all 3 types of FD streams.
> well, we can write something like:
>
> instance (InBlockStream h) => InByteStream (Buffered h) where
> vGetByte = ... something involving vGetBuf at last
>
> instance (OutBlockStream h) => OutByteStream (Buffered h) where
> vPutByte = ... something involving vPutBuf at last
>
> at the first look, it seems fine. but we also need to implement some
> common functionality:
>
> instance (Stream h) => Stream (Buffered h) where
> vSeek = ...
>
> let's examine vSeek implementation. it should flush buffer on output
> streams, i.e. it should call vPutBuf. but that's impossible - vPutBuf
> is not in Streams dictionary! well, let's replicate implementation of
> Stream class:
>
> instance (InBlockStream h) => Stream (Buffered h) where
> ...
> instance (OutBlockStream h) => Stream (Buffered h) where
> ...
>
> but this, besides repeating almost the same code two times, again creates
> overlapped instances! what instance should use compiler for ReadWriteFD?
>
> yes, i can create 3 different data types for buffering input, output
> and input-output streams but this means even more code duplication.
> and multiplying these 3 i/o modes to the 3 buffering transformers i
> already have, i got limitless code growth!
>
>
> the same problem strikes me on each transformation stage:
> BlockStream->ByteStream, ByteStream->TextStream,
> ByteStream->BitsStream. it also bites me when i try to separate
> SeekableStream operations - vSeek operation required to discard input
> buffer, what is required in vSetBuf, for example. implementation of
> different Stream operations contains numerous dependencies from each
> other. may be it is great that i discover these dependencies at the early
> development stage instead of bothering with numerous errors when
> library will be really used in various environments
>
> btw, that you think about specifying buffering mode at the Stream
> creation? so vSetBuf will just gone away. to be exact, it will be
> still possible to use different buffering schemes, by creating new
> buffered Streams over the one raw Stream:
>
> h <- openRawFD "text"
> h1 <- withBuffering (BlockBuffering 512) h
> ... working with h1
> vFlush h1 -- discards buffer and returns `h` pointer to exact the
> -- position of first byte that is not consumed by `h1`
> h2 <- withBuffering LineBuffering h
> ... working with h2
>
>
More information about the Libraries
mailing list