[Haskell-cafe] lazy-vs-strict, recursive IO (was: memory needed for SAX parsing XML)

Fri Apr 23 12:12:50 EDT 2010

(changed the thread title hoping to get some others to weigh in on the
recursive I/O scheme presented here)

> From: Daniil Elovkov <daniil.elovkov at googlemail.com>
> John Lato wrote:
>>
>>> Another (additional) approach would be to encapsulate unsafeInterleaveIO
>>> within some routine and not let it go out into the wild.
>>>
>>> lazilyDoWithIO :: IO a -> (a -> b) -> IO b
>>>
>>> It would use unsafeInterleave internally but catch all IO errors within
>>> itself.
>>>
>>> I wonder if this is a reasonable idea? Maybe it's already done?
>>> So the topic is shifting...
>>
>> doWithIO :: NFData b => IO a -> (a -> b) -> IO b
>> doWithIO m f = liftM (\a -> let b = f a in b `deepseq` b) m
>>
>> It works (just stick it in a "try" block for error handling), but you
>> need to write a lot of NFData instances.  You also need to be careful
>> that b is some sort of reduced structure, or you can end up forcing
>> the whole file (or other data) into memory.
>
> I meant a different thing. In your example there is no unsafeInterleave
> at all. I think you mean that 'm' argument is supposed to be an
> unsafeInterleaved io action, like getContents, and deepseq'ing saves us
> from it hanging somewhere for a long time. Ok

Yes, that's what I intended.  I understand now that it isn't what you meant.

>
> But I meant to have a routine that lets us use ordinary io actions in a
> lazy way, and restricting that 'hanging' within bounds of this routine.
>
> But having thought about it a little more I understood that it's impossible.

I think it's possible provided you have unsafeInterleaveIO and
deepseq.  As you discovered, to force evaluation through an
unsafeInterleaveIO you need to fully evaluate data, and deepseq is the
way to do so.

>
> Lazy io works now because unsafeInterleaveIO is sticked into getContents
> itself, and is called repeatedly (via recursion). Or at least I can
> think of this implementation, haven't looked into it.
>
> I realised that calling unsafeInterleaveIO for an ordinary io action
> will not make it run lazily. It will, still, run all-at-once, just not
> now, but later.

Yes, exactly.  unsafeInterleaveIO needs to be used recursively for
getContents to function properly.

>
> So, to cope with it, I can think of exposing a little structure of an io
> action. Normally it's completely opaque. But if we knew where its
> recursion point lies, then we could control its course of execution.
>
> So, if we had something like
>   type RecIO a = IO a -> IO a
> and io actions were like
>   getContents :: RecIO [Char]
>   getContents rec = do
>       c <- readOneChar
>       rest <- rec
>       return (c:rest)
>
> then we could either run them
>   normally :: RecIO a -> IO a
>   normally r = r (normally r)
> or
>   lazily :: RecIO a -> IO a
>   lazily r = unsafeInterleavIO $ r (lazily r)
>
> And
>   lazilyDoWithIO :: RecIO a -> (a -> b) -> IO b
>   lazilyDoWithIO m f = do
>     a <- lazily m
>     return $ f a
>
> Hmm, but then, we would have to take special care to not let it out of
> this function anyway... So here we come to deepSeq'ing you proposed.

I think my understanding matches yours at this point, except I'm not
sure I follow RecIO.  Your getContents, for example, seems to only
read one character.  I think you mean something like this?

recGetContents :: RecIO String
recGetContents rec = do
  c <- getChar
  ise <- isEOF
  if ise then return [c] else liftM (c:) rec

normally :: RecIO a -> IO a
normally = fix

lazily :: RecIO a -> IO a
lazily rec = let f = unsafeInterleaveIO (rec f) in f

This works, but I'm not sure that it accomplishes anything.
(it doesn't work in ghci because EOF isn't sent through the terminal,
but if you test for another character it works fine).

>
> And anyway, instead of re-writing pure functions to become iteratees we
> will have to re-write io functions to adopt continuation passing style.
>
> Initially it looked better to me :)

Re-writing pure functions to be iteratees doesn't have to be hard;
many of the provided iteratees have the same names as standard
functions so a lot of the work can be done by just changing imports.

>
> But with this approach we can run lazily any io action that has the form
> of RecIO. Also, we can interleave normally and lazily based on the time
> of day and other conditions :)
>

Originally I didn't see much use to this scheme, but I'm starting to
change my mind.  It would be nice for users to be able to control the
evaluation of functions like getContents, hGetContents, etc., and this
looks like it could be easier to maintain.

>
>> It also doesn't help with
>> other IO effects, e.g. writing output.  I consider this one of the
>> nicest features of iteratee-based processing.
>
> Can you clarify what's the problem with writing? I think I just haven't
> switched from the topic of gluing code.
>
> Because as for gluing code, type signature for io writing d -> IO () is
> perfectly sufficient.

The problem isn't with writing output per se, but with sequencing I/O
effects with lazy data processing.  If you wanted to write to a log
file after processing each 100kbytes of a file, for example, doing so
with bytestrings would require manually chunking the string, creating
output, etc., which then hampers both modularity and readability.
Adding this sort of monadic side-effect while data is only partially
evaluated is pretty simple with iteratees.  Actually, this recursive
IO combinator idea you've sketched out would be somewhat helpful in
this regard too.

John