[Haskell] Re: ANN: HDBC (Haskell Database Connectivity)

Wed Jan 4 11:49:17 EST 2006

On 2006-01-04, Krasimir Angelov <kr.angelov at gmail.com> wrote:
> 2006/1/4, John Goerzen <jgoerzen at complete.org>:
>> The final thing that prompted me to do this was that the PostgreSQL --
>> and possibly the Sqlite -- module for HSQL was segfaulting.  I spent
>> quite a bit of time with gdb and the HSQL code, and even with Simon
>> Marlow's assistance, was unable to track down the precise cause.  To
>> make matters worse, the problem was intermittent.  The Haskell program
>> in question was pure Haskell, and switching it to HDBC solved this
>> issue.
>
> Can you send me a short code that causes the segfaulting?

Unfortunately, I was never able to reproduce the problem with some
shorter code.  It seemed to only crop up after things had been running a
few minutes with heavy database activity.  It also didn't crop up from
the very beginning.  There's a thread about it from the ghc list here:

http://thread.gmane.org/gmane.comp.lang.haskell.glasgow.user/8802

At first, I had thought the problem was with GHC, but that turned out to
not be the case.

You can get the code with:

darcs get --partial '--tag=Last rev before switch to HDBC' \
    http://darcs.complete.org/gopherbot

>> I also had extremely high memory usage when dealing with large result
>> sets -- somewhere on the order of 700MB; the same consumes about 12MB
>> with HDBC.  My guess from looking briefly at the code is that the entire
>> result set is being read into memory up front.
>
> I can't understand this. The result set isn't read in memory unless
> you want do to it. If you are using collectRows then yes, you will end
> with the entire result set read in memory, but you can use fetch to
> read the set row by row as well. The forEachRow function is also
> helpfull in this case.

I was not using collectRows.  In fact, I was using fetch only.

>> There were a number of other problems as well:
>>
>>  * No support for prepared queries or for supplying
>>   replacable parameters.  (Supported everywhere in HDBC, which removes
>>   the need to have escaping.)  That's really my #1 complaint
>>   (well, aside from the segfaulting <g>).

This one is really the most critical to me.  While I haven't optimized
any of the HDBC drivers for it yet, it is already helpful simply from a
coding perspective (it's a lot easier for me to pass a list of values
than bother with generting these large query strings.)

>>  * No way to retrieve result data by column index instead of column name
>
> This is already available in CVS.

Very nice.

>>  * HSQL provided no way to see the result set as a lazy list, and
>>   the public API provided no way to implement that.  (There is 'fetch',
>>   but it seems that the entire result set was read into memory in
>>   advance anyway.)
>
> This is intentional. The trouble is that you can close the database
> connection before to evaluate the entire lazy list. This is different

That is true.  Of course, one can close a DB connection and try to call
fetch on an open statement later as well.  I have documented the proper
semantics on the HDBC API docs.

>> So, I'd invite you to take a look at the HDBC API at
>>
>>  http://darcs.complete.org/hdbc/doc/Database-HDBC.html
>
> I saw the API and some of the code. It is quite similar to HSQL in
> some cases but also differs in other. I saw that you are always
> fetching the entire row in list of SqlValue. I decided to provide
> function to get the values one by one because usually I preffer to
> represent each row as application specific data type. In your case you

It's probably the same amount of coding either way:

fetch sth
h1 <- get sth "col1"
h2 <- get sth "col2"
func h1 h2

vs.

l <- fetchRow sth
let h1 = l !! 0
let h2 = l !! 1
func (fromSql h1) (fromSql h2)

Performance will probably be roughly similar too, since HDBC isn't
having to map names to numbers.  Of course, one could also do:

[h1, h2] <- fetchRow sth

with HDBC.

Personally, I like the latter because it feels more "Haskelly" and pulls
more out of the IO monad.  Though of course it would be possible to do
it both ways with both APIs.

-- John