[Haskell-cafe] Network.HTTP+ByteStrings Interface--Or: How to shepherd handles and go with the flow at the same time?

Thu May 31 13:28:56 EDT 2007

The HAppS HTTP code basically delivers the first 64k and a handle to 
acquire the rest.  The 99% or higher case is that the document fits in 
memory so the 64k bound is fine.  If you have something bigger,  the 
user is going to have to decide how to handle that on a case by case basis.

Note: chunk-encoding means that there is no theoretical limit to how big 
an HTTP request or response may be.

-Alex-

Jules Bean wrote:
> I've been having something of a discussion on #haskell about this but
> I had to go off-line and, in any case, it's a complicated issue, and I
> may be able to be more clear in an email.
>
> The key point under discussion was what kind of interface the HTTP
> library should expose: synchronous, asynchronous? Lazy, strict?
>
> As someone just pointed out, "You don't like lazy IO, do you?". Well,
> that's a fair characterisation. I think unsafe lazy IO is a very very
> cute hack, and I'm in awe of some of the performance results which
> have been achieved, but I think the disadvantages are underestimated.
>
> Of course, there is a potential ambiguity in the phrase 'lazy IO'. You
> might interpret 'lazy IO' quite reasonably to refer any programming
> style in which the IO is performed 'as needed' by the rest of the
> program. So, to be clear, I'm not raising a warning flag about that
> practice in general, which is a very important programming
> technique. I'm raising a bit of a warning flag over the particular
> practice of achieving this in a way which conceals IO inside thunks
> which have no IO in their types: i.e. using unsafeInterleaveIO or even
> unsafePerformIO.
>
> Why is this a bad idea? Normally evaluating a haskell expression can
> have no side-effects. This is important because, in a lazy language,
> you never quite know[*] when something's going to be evaluated. Or if
> it will. Side-effects, on the other hand, are important things (like
> launching nuclear missiles) and it's rather nice to be precise about
> when they happen. One particular kind of side effect which is slightly
> less cataclysmic (only slightly) is the throwing of an exception. If
> pure code, which is normally guaranteed to "at worst" fail to
> terminate can suddenly throw an exception from somewhere deep in its
> midst, then it's extremely hard for your program to work out how far
> it has got, and what it has done, and what it hasn't done, and what it
> should do to recover. On the other hand, no failure may occur, but the
> data may never be processed, meaning that the IO is never 'finished'
> and valuable system resources are locked up forever. (Consider a naive
> program which reads only the first 1000 bytes of an XML document
> before getting an unrecoverable parse failure. The socket will never
> be closed, and system resources will be consumed permanently.)
>
> Trivial programs may be perfectly content to simply bail out if an
> exception is thrown. That's very sensible behaviour for a small
> 'pluggable' application (most of the various unix command line
> utilities all bail out silently or nearly silently on SIGPIPE, for
> example). However this is not acceptable behaviour in a complex
> program, clearly. There may be resources which need to be released,
> there may be data which needs saving, there may be reconstruction to
> be attempted on whatever it was that 'broke'.
>
> Error handling and recovery is hard. Always has been. One of the
> things that simplifies such issues is knowing "where" exceptions can
> occur. It greatly simplifies them. In haskell they can only occur in
> the IO monad, and they can only occur in rather specific ways: in most
> cases, thrown by particular IO primitives; they can also be thrown
> 'To' you by other threads, but as the programmer, that's your
> problem!.
>
> Ok. Five paragraphs of advocacy is plenty. If anyone is still reading
> now, then they must be either really interested in this problem, or
> really bored. Either way, it's good to have you with me! These issues
> are explained rather more elegantly by Oleg in [1].
>
>
> So, where does that leave the HTTP library? Well here are the kinds of
> interface I can imagine. I'm deliberately ignoring all the stuff about
> request headers, request content, and imagining that this is all about
> URL -> ByteString. Here are the options that occur to me:
>
> A. Strict, synchronous GET
>    sSynGET :: URL -> IO ByteString
>
>    Quite simply blocks the thread until the whole data has
>    arrived. Throws some kind of exception on failure, presumably. This
>    is a simple primitive, appropriate for relatively small files
>    (files which fit comfortably in your memory) and simple
>    programs. It's also great for programs which want to take their own
>    control over the degree of asynchrony; they can just fork as many
>    threads as they choose to GET with.
>
> B. Strict, asynchronous GET
>    sAsynGET :: URL -> IO (MVar ByteString)
>
>    Download the entire data, but do it in a separate thread. Give me
>    an MVar so I can synchronise on the data's arrival in whichever way
>    suits my program best. Suitable for small files which fit
>    conveniently in memory. Very easy to implement in terms of forkIO
>    and sSynGET so really it's a convenience function.
>
> C. Strict, synchronous, GET-to-file
>    sSynFileGET :: URL -> FilePath -> IO ()
> D. Strict, asynchronous, GET-to-file
>    sAsynFileGET :: URL -> FilePath -> IO (MVar ())
>
>    Download the entire data to a local file. This means that it
>    doesn't matter if the data is far bigger than local memory, it can
>    still be done efficiently. [Note that this doesn't mean it must use
>    lazy getContents magic under the hood. It could easily use strict
>    hGet in reasonable sized chunks and write them out straightaway.]
>    The only difference between the two variants is that one keeps
>    control until completion, the other gives you an MVar which you can
>    block on if/when you choose.  This method is appropriate for
>    clients which need extremely large data, and don't mind waiting for
>    it to finish before they start processing. It is also appropriate
>    for clients which want random access to large data requests (using
>    underlying hSeek-based random file IO, once the file has downloaded).
>
> E,F. Progressive GET
>    pSynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO ()
>    pAsynGET :: URL -> ((Bool,ByteString) -> IO ()) -> IO (MVar ())
>
>    (This is a particular simple case of Oleg's iteratees, I
>    think) Download the data at whatever speed is convenient. As data
>    arrives, feed it to the 'callback' provided. The ByteString is the
>    new chunk of data, the 'Bool' is just supposed to indicate whether
>    or not this is the final chunk. You can imagine slight
>    variations. Note that the library promises not to retain any
>    references to the ByteString chunks, so if the callback processes
>    them and then discards them they are eligible for garbage
>    collection. If the callback wishes to accumulate them, it can quite
>    easily 'append' the strict chunks into a lazy bytestring, which is
>    an efficient operation. This is suitable for applications which
>    wish to do something like progressive display of a pJPEG, or
>    compute a summary function, or perhaps even display an animation,
>    although that last would normally need a little more structure to
>    try to guarantee the data rate.
>
>    Incidentally there are more complex options than (Bool,Bytestring)
>    -> IO ().  A simple and obvious change is to add a return
>    value. Another is a 'state monad by hand', as in (Bool,Bytestring)
>    -> s -> s, and change the final return value of the type to IO s,
>    which allows the callback to accumulate summary information and
>    still be written as pure code. Other options allow the 'callback'
>    to request early termination, by layering in an 'Either' type in
>    there. Another more sophisticated option, I think, is the higher
>    rank
>
>    MonadTrans t => URL ->
>                     ((forall m. Monad m) => (Bool,ByteString) -> t m)
>            -> t IO ()
>
>    ...which, unless I've made a mistake, allows you to write in 'any
>    monad which can be expressed as a transformer', by transforming it
>    over IO, but still contains the implicit promise that the
>    'callback' does no IO. For example t = StateT reduces to the
>    earlier s -> s example, in effect, with a slightly different data
>    layout.
>
>    Another couple of refinements to the above are that in practice you
>    normally want to 'guarantee' your callback only runs on a chunk
>    size of at least X (in some cases 'exactly X'), or you want to
>    guarantee it's called at least every Y seconds. Neither of these
>    are hard to add.
>
> Given these three pairs of options, what need is there for an unsafe
> lazy GET?  What niche does it fill that is not equally well filled by
> one of these?
>
> Program conciseness, perhaps. The kind of haskell oneliner whose
> performance makes us so (justly) proud. In isolation, though I don't
> find that a convincing argument; not with the disadvantages taken also
> into account.
>
> The strongest argument then is that you have a 'stream processing'
> function, that is written 'naively' on [Word8] or Lazy ByteString, and
> wants to run as data is available, yet without wasting space. I'm
> inclined to feel that, if you really want to be able to run over 650M
> files, and you want run as data is available, then you in practice
> want to be able to give feedback to the rest of your application on
> your progress so far; I.e, L.Bytestring -> a is actually too simple a
> type anyway.
>
> I'm interested to know what opinions other people have on this,
> whether I've made any serious logic mistakes or just overlooked
> another approach which has advantages. Having spent quite a while this
> evening thinking this over, I just don't see the convincing case for
> the unsafe lazy approach, and I see plenty of problems with it...
>
> Cheers,
>
> Jules
>
>
> * Well, OK. If you're smart, then you know. Because you're smart, and
>   you thought about it carefully. But most of the time, you don't need
>   to know. And if you produce data (a pure structure) which is then
>   consumed by a library - or vice versa - then you don't know unless
>   you read their code.
>
> [1] http://www.haskell.org/pipermail/haskell-cafe/2007-March/023073.html
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe