[Haskell-cafe] Property checks and PostgreSQL?

Sun Mar 14 09:41:31 UTC 2021

> So, I have a three layer stack: PostgreSQL data base, Haskell back end,
> PureScript front end. And there is a fault somewhere in this stack: sometimes
> some items present in the data base do not make it to the front end. A small
> fortune in programmer hours has been spent searching for the fault but there is
> no progress. No one can figure out even roughly where the fault is.

Are you running at the standard isolation level?
If yes, your code may be affected by nonrepeatable read, phantom read, 
or serialization anomaly - in a nutshell, this is what happens if you 
expect to get the same results if the query is run twice, but the rows 
that affect this result are somehow not included in the transactions 
(there are edge cases that can do this, the isolation levels 
progressively exclude them, at the expense of performance loss).

More details are on 
https://www.postgresql.org/docs/current/transaction-iso.html .

> Theoretically, we can carefully reset the data base back to the initial state
> after each run of the property check. Given that only a small amount of data is
> modified, it would take imperceptibly little time. But it is hard to know what
> exactly should be reset, and it is easy to get it wrong if done by hand!

Have you tried nested transactions?
I have no PSQL experience myself, but it looks like it should do the 
minimal-effort rollback you're after: Start a transaction, let the unit 
tests run (including their own transactions, just they're nested now), 
roll back the overarching initial transaction.

> One trick we do use is SQL transactions. We take a single connexion, start a
> transaction, perform a check and then roll the transaction back. Unfortunately,
> even this is not completely safe: it looks as though sequences used to draw
> primary keys from are not reset to their previous values! _(Or maybe we are
> doing something wrong — I am not really a PostgreSQL guru so I am not too sure
> of myself.)_

No, this is documented (if surprising) behaviour.
But TBH if your code is affected by the values you get from a sequence, 
that's not what sequences are intended for - they're for being unique, 
not for their numerical properties.

See https://www.postgresql.org/message-id/501B1494.9040502@ringerc.id.au 
for more details.

 >
  But even ignoring this problem _(it is irrelevant to most checks)_,
> there is another, more severe problem: transactions guarantee that uncommitted
> data is only visible inside the same connexion. So, there is no way to request
> it from the HTTP API. This trick is suitable for fuzzing the data base – back
> end layer, but nothing beyond that.

Yes, Postgresql gives you no way to read dirty data.
This is pretty typical of MVCC databases.

But even if there were a way, you'd never get a guarantee that you're 
actually getting the data that another transaction sees. If you wish to 
see what another transaction sees, you'll have to instrument the code 
that does the transaction.
This should be doable in the backend.

> Another thing I heard being done is to selectively drop and re-create exactly
> the tables affected by a run of a property check. This seems challenging to
> automate though. How can I know in a general way if a table has been touched?
> And then, how can I re-create exactly the affected tables given that the data
> base initialization code is an opaque blob?
> 
> I wonder if this problem is already solved. In any case, any advice is welcome!

I doubt it's a useful way forward. It's just not a use case that 
database are built for, and you have a high risk of hitting 
nonfunctional problems (performance, possibly from cold query 
optimization caches and such) or functional problems (sequencing of 
dropping/creating DB objects that depend on each other, for example, and 
you already saw the sequence number issue).
Background is that databases are such complex beasts that the developer 
teams struggle to get the core goals done, nice-to-have goals get 
dropped pretty quickly. PGSQL is exceptional in that it tries to keep as 
many nice-to-have goals intact as possible, but I wouldn't count on that.

Regards,
Jo

P.S.: This is concentrating on the database layer, which is unrelated to 
Haskell.
Which might be because your problem is database-related, or because I 
happen to know more about databases than about Haskell. Or because you 
described your problems in terms of database issue.
If it's really database-related, you may have more concrete advice on a 
PGQSL-related forum than here.