[Haskell-cafe] ANNOUNCE: streaming-conduit
Ivan Lazar Miljenovic
ivan.miljenovic at gmail.com
Mon Jun 12 01:05:45 UTC 2017
On 12 June 2017 at 06:55, Olaf Klinke <olf at aatal-apotheke.de> wrote:
> Dear Ivan,
> I'm confused. The package documentation states that the typical problem to avoid is extracting a long list from IO, which is bound to allocate too much memory.
Well, this is the whole point of libraries like streaming, conduit,
pipes, etc. but not specifically of streaming-conduit (which is just
to help two of these libraries interoperate).
> However, if I do
> preprocess :: IO [a]
> preprocess = fmap (map (parse :: String -> a) . lines) $ readFile filename
> then this may create a huge pile of thunks if the rest of the program is not written in a streaming style. But if done right, lazy IO will make sure only the necessary part of the list is kept im memory.
Lazy I/O works really well if you have just a single input stream that
you can read in and process elements of using an existing lazy I/O
function such as readFile then write out lazily as well. However, it
does require the unsafeInterleaveIO hack in readFile to work properly.
I have done stuff like this in production software where great care
was taken to ensure it remained lazy and was able to make it stream
(even though we had existing streaming use in other parts of the
In particular, the issue arises in that you have to be able to somehow
get the entire result of IO, treat it as a pure value, then do IO with
it again... but doing so lazily, which doesn't always work too well.
What these libraries do is make that streaming style explicit, so at a
low-level you specifically request the next chunk. Taking your
preprocess example, the equivalent would be that:
1. You read in a file as a ByteString
2. This is converted to String a chunk at a time
3. This implicit stream of Strings (or possibly just Chars, depending
on the implementation) is converted into an explicit Stream of
line-based Strings (though 2 and 3 may be switched around)
4. You map the parsing function on each element of the stream.
However, at a higher level there are various combinators to take most
of this into account for you. Streaming in particular tries to
emulate a list-based style, so that `IO [a]` is equivalent to `Stream
(Of a) IO r`. The API lets you act like you are operating over a list
(and still doing it with function composition, which makes it
different from most of the other similar libraries) whilst taking care
of the IO for you.
This does mean you can't do as much of the "do IO to get value,
operate purely on the value then do IO to put it back out", but if you
write combinators to do the streaming without specifying which monad
it's operating in (and thus being unable to do any IO).
See writings like
TL;DR: these kinds of libraries allow you to explicitly state how
values are streamed throughout your code rather than relying upon the
lazy I/O hacks (which if they work for you, go ahead; but if your code
gets too complex they start to fail).
> For example, a strict function like
> postprocess :: [a] -> IO ()
> postprocess = mapM (print . f)
> with some strict f would fit the bill. Did I get the concept of lazy IO wrong? What is the operational semantics of the following fragment?
> x:xs <- someIOaction :: IO [a]
> Control.DeepSeq.force x
> I used to think that xs is now a thunk which, when evalated further, may trigger more (read) IO actions. Hence, would behave as if someIOaction was one of your Streams. I do acknowledge, though, that the style above is easy to go wrong [*], whereas explicitly wrapping each list element in its own monadic action makes the intentions more verbose.
> [*] If someIOaction needs to touch every list element in order to ensure the result has the shape _:_ then we will indeed have a huge list of thunks.
Ivan Lazar Miljenovic
Ivan.Miljenovic at gmail.com
More information about the Haskell-Cafe