[Haskell-cafe] Re: Streams: the extensible I/O library

Tue Feb 21 07:06:57 EST 2006

Bulat Ziganshin wrote:

> Wednesday, February 08, 2006, 2:58:30 PM, you wrote:
> SM> I would prefer to see more type structure, rather than putting
> SM> everything in the Stream class.  You have classes ByteStream, 
> SM> BlockStream etc, but these are just renamings of the Stream class. There 
> SM> are many compositions that are illegal, but we don't find out until 
> SM> runtime; it would make a lot more sense to me to expose this structure 
> SM> in the type system.
> 
> i initially used normal splitted classes (vGetBuf was in BlockStream)
> and so on, but come accross problems with the type classes system and
> decided to simplify the design. now i feel himself more confident with
> the classes, feel that i know source of my previous problems and
> therefore slowly migrate back to the splitted classes design. the
> library as published is just on the half of this way. but i know some
> limitations. that is the one problem:
> 
> data BinHandle = forall h . (Stream IO h) => BinH h

One possibility is something like this:

data BinHandle = forall h . (Stream IO h, Typeable h) => BinH h

then you can recover the original stream type (by guessing what it is).

Or there are other solutions - adding an extra field to BinH, or 
separating the BinH constructor into two, one with a MemoryStream and 
one without.

What's interesting is that you're saying you really want dynamic typing 
here - you don't want to distinguish different types of BinHandle, 
instead you want the saveToFile operation to fail at runtime if the 
wrong kind of stream is used.  It's slightly strange to use dynamic 
typing here when the rest of the library would be using static typing, 
so it might be worthwhile considering static typing solutions instead: 
don't use an existential here, just add h as a parameter of BinHandle. 
This does mean you have to add Stream predicates a lot of places, 
though.  Someday (soon, I hope), GHC will let you say

   data BinHandle h where BinH :: Stream IO h => BinH h

but it doesn't work right now (or at least, it doesn't do what you want).

> moreover, splitting the Streams interface will require from the
> library users to give more classes in defining context for their
> functions, like the:
> 
> process :: (Stream IO h, Seekable IO h, Buffered h) => h -> IO ()
> 
> that is not so good, especially if adding new interfaces means
> slowdown of calls to this function

you can combine multiple classes with a dummy superclass.  If we get 
class synonyms (see Haskell' proposal) this will get easier.

Performance of the example above might actually be better than having a 
single Stream class, depending on how much dictionary *building* needs 
to happen.  In your library, every time an overloaded Stream function is 
called, it must be passed a dictionary for Stream, which is a tuple with 
20+ elements.  These dictionaries will probably be built at runtime, 
because of the superclass structure (the compiler usually won't be able 
to predict what layering of stream transformers will be used, and hence 
what dictionaries will be needed).  You can provide some specialisations 
to help - SPECIALISE INSTANCE should be useful here.

The point is that the performance implications aren't obvious, it 
depends a lot on how much sharing of dictionaries happens.

> SM> Also I'd like to see separate 
> SM> input/output streams for even more type safety, and I believe 
> SM> simplicity,
> 
> it will be great! but it is very uneasy and even seems impossible:
> 
> 1) this will prevent dividing streams into the
> MemoryStream/BlockStream/ByteStream, what i like you consider as more
> important. it is impossible to say what InputStream BlockStream
> implements only vGetBuf, while OutputStream BlockStream implements
> only vPutBuf operation

I don't think so - you just have InputByteStream/OutputByteStream 
classes, and similarly for the others.

> 2) such division will require to implement 2 or 3 (+ReadWrite) times
> more Stream types than now. Say, instead of FD we will get InputFD and
> OutputFD, instead of CharEncoding transformer - two transformers and
> so on. most of the functionality in Input and Ouput variants will be
> repeated (because this functionality don't depend on input/output
> mode) and in addition to the current large lists of passed calls like
> the:
> 
>     vIsEOF        (WithEncoding h _) = vIsEOF        h
>     vMkIOError    (WithEncoding h _) = vMkIOError    h
>     vReady        (WithEncoding h _) = vReady        h
>     vIsReadable   (WithEncoding h _) = vIsReadable   h
> 
> we will get the same lists in 2 or 3 repetitions!!!

the common operations should be members of a separate superclass.  I 
have in mind a structure like this:

class Stream h where
   streamEOF   :: h -> IO Bool
   streamReady :: h -> IO Bool
   streamClose :: h -> IO ()

class InputByteStream h where
   streamGet :: h -> IO Word8
   ...

class InputBlockStream h where
   streamGetBuf :: h -> Int -> Ptr Word8 -> IO Int
   ...

class InputMemoryStream h where
   streamGetMem :: h -> IO (Ptr Word8)
   ...

there's no duplication, just more structure.

By the way, I like your idea of exposing the difference between 
MemoryStream and ByteStream.  I'm not so sure about the difference 
between BlockStream and ByteStream - I think BlockStream should be the 
lowest level, and all the ByteStream operations can be provided on 
BlockStreams (or MemoryStremas) by reading/writing one byte at a time.

> 3) i don't think that we can completely throw away the r/w streams,
> they can be required for example for database-style access.

This doesn't stop you from having read/write streams, but it means that 
you could implement read-only and write-only buffering without the 
complication that comes with the possibility of read/write.  You have to 
implement read/write buffering separately from read-only and write-only 
buffering.

For example, you could have an InOutBufferedStream transformer that 
layers on top of two underlying buffered streams, and remembers which 
one was used last.  If an operation occurs on the other one, then the 
in-use buffer is flushed first.  This means you only pay the penalty of 
checking & flushing when you need to use this read/write transformer, 
rather than in every buffered stream.

> why ByteStream implements both the byte and text i/o? i think that in
> most cases people are still using latin-1 text i/o - i.e. each char is
> just 8 bits without any encoding. because that type of text i/o don't
> need any complex implememtation, each time when byte i/o is
> implemented, text i/o springs automatically. on the other side, utf-8
> encoding is rare and therefore separate transformer is used to
> implement it - it transforms each char i/o call into several byte i/o
> calls.
> 
> i definitely against implementing text i/o only through the encoding
> transformer because it will slowdown the i/o while in 90% cases
> encoding will not be used.

I don't think this is the way to go.  We should assume that text 
encoding/decoding is the norm, rather than optimising (heavily) for Latin-1.

By all means have a specialised Latin-1 stream transformer, which can be 
very efficient, but don't build it into the byte stream class.  Byte 
streams should deal in bytes, not Chars.

UTF-8 might be rare at the moment, but it will be the norm soon.  The 
library should provide fast text encoding/decoding, which means doing it 
via buffers, and probably using iconv, since we don't want to 
re-implement the encodings ourselves - for example what happens with 
UTF-8 encoding errors in your implementation?

This gives us a dilemna - since encoding and decoding needs to operate 
directly on buffers, it needs direct access to the buffer.  So it looks 
like encoding/decoding and buffering must be combined (this is what I 
did in my library).  But, you don't want to add this buffering to a 
memory stream, so perhaps buffered streams should be instances of 
MemoryStreams, and text encoding/decoding should layer on MemoryStreams 
only? (is this what you did?)

> SM> This will improve 
> SM> performance too - your Stream class has dictionaries with 20+ elements.
> 
> here you are king - i don't know whether it's better to have one class
> with 20 methods or 2 classes with 5 methods each in the function
> context?

see above - I'd like to see some measurements here.

> SM> I see that buffering works on vPutChar/vGetChar, and yet you seem to be 
> SM> buffering bytes - which is it?  Are you supposed to buffer before or 
> SM> after doing character encoding?  It seems before, because otherwise 
> SM> buffering will strip out all but the low 8 bits of each character. 
> SM> Using a more explicit type structure would help a lot here.
> 
> buffer contains bytes, which are read/written by the getbyte/putbyte
> operations as well as the all text i/o. this is latin1-only solution,
> of course. if one need utf-8 encoding, he need to apply CharEncoding
> transformer

see above - I think this is wrong.

Cheers,
	Simon