[Haskell-cafe] Takusen and strictness, and perils of getContents

Sat Mar 3 02:13:27 EST 2007

Takusen permits on-demand processing on three different levels. It is
specifically designed for database processing in bounded memory with
predictable resource utilization and no resource leaks.

But first, about getContents. It has been mentioned a while ago that
getContents should be renamed to unsafeGetContents. I strongly support
that suggestion. I believe getContents should be used sparingly (I
personally never used it), and I believe it cannot give precise
resource guarantees and is a wrong model for database interfaces.

I will not dwell on the fact that getContents permits IO to occur
while evaluating pure code -- which is just wrong. There is a
practical consequence of this supposedly theoretical impurity: error
handling. As the manual states ``A semi-closed handle becomes closed:
... if an I/O error occurs when reading an item from the handle; or
once the entire contents of the handle has been read.'' That is, it is
not possible to tell if all the data from the channel have been read
or an I/O error has interfered. It is not possible to find out any details
about that I/O error. That alone disqualifies getContents from any
serious use. Even more egregious is resource handling and that
business with semi-closed handles, which is a resource
leak. 

Interfacing with a database requires managing lots of resources:
database connection, prepared statement handle, statement handle,
result set, database cursor, transaction, input buffers. Takusen was
specifically designed to be able to tell exactly when a resource is no
longer needed and can be _safely_ disposed of. That guarantee is not
available with getContents -- the resources associated with the handle
are disposed when the consumer of getContents is finished with
it. Since the consumer may be pure code, it is impossible to tell when
the evaluation finishes. It may be in a totally different part of the
code. To get more predictability, we have to add seq and deepSeq --
thus defeating the laziness we supposedly have gained with
getContents, and hoping that two wrongs somehow make it right.

Regarding Takusen: it is designed for incremental processing of
database data, on three levels:

	-- unless the programmer said that the query will yield small
amount of data, we don't ask the database for all of the result set at
the same time. We ask to deliver data in increments of 10 or 100 rows
(the programmer may tune the amount). The retrieved chunk is placed
into pre-allocated buffers.

	-- the retrieved chunk is given to an iteratee one row at a
time. The iteratee may at each point specify that it has had
enough. The processing immediately stops, no further chunks are
retrieved and all resources of the query are disposed of.

	-- Alternatively, Takusen offers the cursor-based interface,
with getNext and getCurrent methods. The rows are retrieved on-demand
in chunks. The interface is designed to restrict operations on a
cursor to a region of code. Once the region is exited (normally or by
exception), all associated resources are disposed of because they are
statically guaranteed to be unavailable outside the region.

Because the moments of resource allocation and deallocation are so
well known, Takusen can take care of all of it. The programmer will
never have to worry about resource leaks, deallocations, etc. 

A bit of experience: I have implemented a Web application server in
Haskell, using Takusen as a back end. The server runs as a FastCGI
dynamic server, retrieving a chunk of rows from the database,
formatting the rows (e.g., in XML), sending them up the FastCGI
interface and ultimately to the client, coming back for the next
chunk. The advantage of that stream-wise processing is low latency,
low memory consumption, and client limiting the database retrieval
rate. Typical requests routinely ask for thousands of database rows;
the server runs continuously serving hundred of requests, in constant
memory. The executable is 2.6 MB in size (GHC 6.4.2); the running
process takes VmSize of 6608 kB, including VmRSS of 3596 kB and VmData
of 1412 kB.  The code has not a single unsafePerformIO (and aside from
an S-expression parsing code I inherited) I used not a single
strictness annotation. The line count (including comments) is 7500
lines in 30 files.