[Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Thu Jul 23 16:25:12 UTC 2020

While I can't contribute any Haskell knowledge, I know that many threads 
updating the same variable is the worst thing you can do; not only do 
you create a single bottleneck, if you have your threads running on 
multiple cores you get CPU pipeline stalls, L1 cache line flushes, 
and/or complicated cache coherency protocols executed between cores. 
It's not cheap: each of these mechanisms can take hundreds of CPU 
cycles, for a CPU that can execute multiple instructions per CPU cycle.

Incrementing a global counter is a really hard problem in multithreading...

I believe this is the reason why databases typically implement a 
SEQUENCE mechanism, and these sequences are usually implemented as 
"whenever a transaction asks for a sequence number, reserve a block of 
1,000 numbers for it so it can retrieve 999 additional numbers without 
the synchronization overhead.

This is also why real databases use transactions - these do not just 
isolate processes from each other's updates, they also allow the DB to 
let the transaction work on a snapshot and do all the synchronization 
once, during COMMIT.
And, as you just discovered, it's one of the major optimization areas in 
database engines :-)

TL;DR for the bad news: I suspect your problem is just unavoidable

However, I see a workaround: delayed index update.
Have each index twice: last-known and next-future.
last-known is what was created during the last index update. You need an 
append-only list of records that had an index field update, and all 
searches that use the index will also have to do a linear search in that 
list.
next-future is built in the background. It takes last-known and the 
updates from the append-only list, and generates a new index. Once 
next-future is finished, replace last-known with it.
You still need to to a global lock while replacing indexes, but you 
don't have to lock the index for every single update but just once.
You'll have to twiddle with parameters such as "at what point do I start 
a new index build", and you'll have to make sure that your linear list 
isn't yet another bottleneck (there are lock-free data structures to 
achieve such a thing, but these are complicated; or you can tell 
application programmers to try and collect as many updates as possible 
in a transaction so the number of synchronization points is smaller; 
however, too-large transactions can generate CPU cache overflows if the 
collected update data becomes too large, so there's a whole lot of 
tweaking, studying real performance data, hopefully finding the right 
set of diagnostic information to collect that allow the DB to 
automatically choose the right point to do its updates, etc. pp.)

TL;DR for the good news: You can coalesce N updates into one and divide 
the CPU core coordination overhead by a factor of N. You'll increase the 
bus pressure, so there's tons of fine tuning you can do (or avoid) after 
getting the first 90% of the speedup. (I'm drawing purely speculative 
numbers out of my hat here.)
Liability: You will want to add transactions and (likely) optimistic 
locking, if you don't have that already: Transaction boundaries are the 
natural point for coalescing updates.

Regards,
Jo