[Haskell-cafe] Re: Streams: the extensible I/O library
simonmarhaskell at gmail.com
Tue Feb 21 07:06:57 EST 2006
Bulat Ziganshin wrote:
> Wednesday, February 08, 2006, 2:58:30 PM, you wrote:
> SM> I would prefer to see more type structure, rather than putting
> SM> everything in the Stream class. You have classes ByteStream,
> SM> BlockStream etc, but these are just renamings of the Stream class. There
> SM> are many compositions that are illegal, but we don't find out until
> SM> runtime; it would make a lot more sense to me to expose this structure
> SM> in the type system.
> i initially used normal splitted classes (vGetBuf was in BlockStream)
> and so on, but come accross problems with the type classes system and
> decided to simplify the design. now i feel himself more confident with
> the classes, feel that i know source of my previous problems and
> therefore slowly migrate back to the splitted classes design. the
> library as published is just on the half of this way. but i know some
> limitations. that is the one problem:
> data BinHandle = forall h . (Stream IO h) => BinH h
One possibility is something like this:
data BinHandle = forall h . (Stream IO h, Typeable h) => BinH h
then you can recover the original stream type (by guessing what it is).
Or there are other solutions - adding an extra field to BinH, or
separating the BinH constructor into two, one with a MemoryStream and
What's interesting is that you're saying you really want dynamic typing
here - you don't want to distinguish different types of BinHandle,
instead you want the saveToFile operation to fail at runtime if the
wrong kind of stream is used. It's slightly strange to use dynamic
typing here when the rest of the library would be using static typing,
so it might be worthwhile considering static typing solutions instead:
don't use an existential here, just add h as a parameter of BinHandle.
This does mean you have to add Stream predicates a lot of places,
though. Someday (soon, I hope), GHC will let you say
data BinHandle h where BinH :: Stream IO h => BinH h
but it doesn't work right now (or at least, it doesn't do what you want).
> moreover, splitting the Streams interface will require from the
> library users to give more classes in defining context for their
> functions, like the:
> process :: (Stream IO h, Seekable IO h, Buffered h) => h -> IO ()
> that is not so good, especially if adding new interfaces means
> slowdown of calls to this function
you can combine multiple classes with a dummy superclass. If we get
class synonyms (see Haskell' proposal) this will get easier.
Performance of the example above might actually be better than having a
single Stream class, depending on how much dictionary *building* needs
to happen. In your library, every time an overloaded Stream function is
called, it must be passed a dictionary for Stream, which is a tuple with
20+ elements. These dictionaries will probably be built at runtime,
because of the superclass structure (the compiler usually won't be able
to predict what layering of stream transformers will be used, and hence
what dictionaries will be needed). You can provide some specialisations
to help - SPECIALISE INSTANCE should be useful here.
The point is that the performance implications aren't obvious, it
depends a lot on how much sharing of dictionaries happens.
> SM> Also I'd like to see separate
> SM> input/output streams for even more type safety, and I believe
> SM> simplicity,
> it will be great! but it is very uneasy and even seems impossible:
> 1) this will prevent dividing streams into the
> MemoryStream/BlockStream/ByteStream, what i like you consider as more
> important. it is impossible to say what InputStream BlockStream
> implements only vGetBuf, while OutputStream BlockStream implements
> only vPutBuf operation
I don't think so - you just have InputByteStream/OutputByteStream
classes, and similarly for the others.
> 2) such division will require to implement 2 or 3 (+ReadWrite) times
> more Stream types than now. Say, instead of FD we will get InputFD and
> OutputFD, instead of CharEncoding transformer - two transformers and
> so on. most of the functionality in Input and Ouput variants will be
> repeated (because this functionality don't depend on input/output
> mode) and in addition to the current large lists of passed calls like
> vIsEOF (WithEncoding h _) = vIsEOF h
> vMkIOError (WithEncoding h _) = vMkIOError h
> vReady (WithEncoding h _) = vReady h
> vIsReadable (WithEncoding h _) = vIsReadable h
> we will get the same lists in 2 or 3 repetitions!!!
the common operations should be members of a separate superclass. I
have in mind a structure like this:
class Stream h where
streamEOF :: h -> IO Bool
streamReady :: h -> IO Bool
streamClose :: h -> IO ()
class InputByteStream h where
streamGet :: h -> IO Word8
class InputBlockStream h where
streamGetBuf :: h -> Int -> Ptr Word8 -> IO Int
class InputMemoryStream h where
streamGetMem :: h -> IO (Ptr Word8)
there's no duplication, just more structure.
By the way, I like your idea of exposing the difference between
MemoryStream and ByteStream. I'm not so sure about the difference
between BlockStream and ByteStream - I think BlockStream should be the
lowest level, and all the ByteStream operations can be provided on
BlockStreams (or MemoryStremas) by reading/writing one byte at a time.
> 3) i don't think that we can completely throw away the r/w streams,
> they can be required for example for database-style access.
This doesn't stop you from having read/write streams, but it means that
you could implement read-only and write-only buffering without the
complication that comes with the possibility of read/write. You have to
implement read/write buffering separately from read-only and write-only
For example, you could have an InOutBufferedStream transformer that
layers on top of two underlying buffered streams, and remembers which
one was used last. If an operation occurs on the other one, then the
in-use buffer is flushed first. This means you only pay the penalty of
checking & flushing when you need to use this read/write transformer,
rather than in every buffered stream.
> why ByteStream implements both the byte and text i/o? i think that in
> most cases people are still using latin-1 text i/o - i.e. each char is
> just 8 bits without any encoding. because that type of text i/o don't
> need any complex implememtation, each time when byte i/o is
> implemented, text i/o springs automatically. on the other side, utf-8
> encoding is rare and therefore separate transformer is used to
> implement it - it transforms each char i/o call into several byte i/o
> i definitely against implementing text i/o only through the encoding
> transformer because it will slowdown the i/o while in 90% cases
> encoding will not be used.
I don't think this is the way to go. We should assume that text
encoding/decoding is the norm, rather than optimising (heavily) for Latin-1.
By all means have a specialised Latin-1 stream transformer, which can be
very efficient, but don't build it into the byte stream class. Byte
streams should deal in bytes, not Chars.
UTF-8 might be rare at the moment, but it will be the norm soon. The
library should provide fast text encoding/decoding, which means doing it
via buffers, and probably using iconv, since we don't want to
re-implement the encodings ourselves - for example what happens with
UTF-8 encoding errors in your implementation?
This gives us a dilemna - since encoding and decoding needs to operate
directly on buffers, it needs direct access to the buffer. So it looks
like encoding/decoding and buffering must be combined (this is what I
did in my library). But, you don't want to add this buffering to a
memory stream, so perhaps buffered streams should be instances of
MemoryStreams, and text encoding/decoding should layer on MemoryStreams
only? (is this what you did?)
> SM> This will improve
> SM> performance too - your Stream class has dictionaries with 20+ elements.
> here you are king - i don't know whether it's better to have one class
> with 20 methods or 2 classes with 5 methods each in the function
see above - I'd like to see some measurements here.
> SM> I see that buffering works on vPutChar/vGetChar, and yet you seem to be
> SM> buffering bytes - which is it? Are you supposed to buffer before or
> SM> after doing character encoding? It seems before, because otherwise
> SM> buffering will strip out all but the low 8 bits of each character.
> SM> Using a more explicit type structure would help a lot here.
> buffer contains bytes, which are read/written by the getbyte/putbyte
> operations as well as the all text i/o. this is latin1-only solution,
> of course. if one need utf-8 encoding, he need to apply CharEncoding
see above - I think this is wrong.
More information about the Haskell-Cafe