[Haskell-cafe] Conduit Best Practices for leftover data

Sun Apr 15 05:30:05 CEST 2012

On Thu, Apr 12, 2012 at 9:25 AM, Myles C. Maxfield
<myles.maxfield at gmail.com> wrote:
> Hello,
> I am interested in the argument to Done, namely, leftover data. More
> specifically, when implementing a conduit/sink, what should the
> conduit specify for the (Maybe i) argument to Done in the following
> scenarios (Please note that these scenarios only make sense if the
> type of 'i' is something in Monoid):
>
> 1) The conduit outputted the last thing that it felt like outputting,
> and exited willfully. There seem to be two options here - a) the
> conduit/sink should greedily gather up all the remaining input in the
> stream and mconcat them, or b) Return the part of the last thing that
> never got represented in any part of anything outputted. Option b
> seems to make the most sense here.

Yes, option (b) is definitely what's intended.

> 2) Something upstream produced Done, so the second argument to
> NeedInput gets run. This is guaranteed to be run at the boundary of an
> item, so should it always return Nothing? Instead, should it remember
> all the input it has consumed for the current (yet-to-be-outputted)
> element, so it can let Data.Conduit know that, even though the conduit
> appeared to consume the past few items, it actually didn't (because it
> needs more input items to make an output)? Remembering this sequence
> could potentially have disastrous memory usage. On the other hand, It
> could also greedily gather everything remaining in the stream.

No, nothing so complicated is intended. Most likely you'll never
return any leftovers from the second field of NeedInput. One other
minor point: it's also possible that the second field will be used if
the *downstream* pipe returns Done.

> 3) The conduit/sink encountered an error mid-item. In general, is
> there a commonly-accepted way to deal with this? If a conduit fails in
> the middle of an item, it might not be clear where it should pick up
> processing, so the conduit probably shouldn't even attempt to
> continue. It would probably be good to return some notion of where it
> was in the input when it failed. It could return (Done (???) (Left
> errcode)) but this requires that everything downstream in the pipeline
> be aware of Errcode, which is not ideal.I could use MonadError along
> with PipeM, but this approach completely abandons the part of the
> stream that has been processed successfully. I'd like to avoid using
> Exceptions if at all possible.

Why avoid Exceptions? It's the right fit for the job. You can still
keep your conduit pure by setting up an `ExceptionT Identity` stack,
which is exactly how you can use the Data.Conduit.Text functions from
pure code. Really, what you need to be asking is "is there any logical
way to recover from an exception here?"

> It doesn't seem that a user application even has any way to access
> leftover data anyway, so perhaps this discussion will only be relevant
> in a future version of Conduit. At any rate, any feedback you could
> give me on this issue would be greatly appreciated.

Leftover data is definitely used:

1. If you compose together two `Sink` with monadic bind, the leftovers
from the first will be passed to the second.
2. If you use connect-and-resume ($$+), the leftovers are returned as
part of the `Source`, and provided downstream.

Michael