[Haskell-cafe] Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?
Joachim Durchholz
jo at durchholz.org
Thu Jul 23 16:25:12 UTC 2020
While I can't contribute any Haskell knowledge, I know that many threads
updating the same variable is the worst thing you can do; not only do
you create a single bottleneck, if you have your threads running on
multiple cores you get CPU pipeline stalls, L1 cache line flushes,
and/or complicated cache coherency protocols executed between cores.
It's not cheap: each of these mechanisms can take hundreds of CPU
cycles, for a CPU that can execute multiple instructions per CPU cycle.
Incrementing a global counter is a really hard problem in multithreading...
I believe this is the reason why databases typically implement a
SEQUENCE mechanism, and these sequences are usually implemented as
"whenever a transaction asks for a sequence number, reserve a block of
1,000 numbers for it so it can retrieve 999 additional numbers without
the synchronization overhead.
This is also why real databases use transactions - these do not just
isolate processes from each other's updates, they also allow the DB to
let the transaction work on a snapshot and do all the synchronization
once, during COMMIT.
And, as you just discovered, it's one of the major optimization areas in
database engines :-)
TL;DR for the bad news: I suspect your problem is just unavoidable
However, I see a workaround: delayed index update.
Have each index twice: last-known and next-future.
last-known is what was created during the last index update. You need an
append-only list of records that had an index field update, and all
searches that use the index will also have to do a linear search in that
list.
next-future is built in the background. It takes last-known and the
updates from the append-only list, and generates a new index. Once
next-future is finished, replace last-known with it.
You still need to to a global lock while replacing indexes, but you
don't have to lock the index for every single update but just once.
You'll have to twiddle with parameters such as "at what point do I start
a new index build", and you'll have to make sure that your linear list
isn't yet another bottleneck (there are lock-free data structures to
achieve such a thing, but these are complicated; or you can tell
application programmers to try and collect as many updates as possible
in a transaction so the number of synchronization points is smaller;
however, too-large transactions can generate CPU cache overflows if the
collected update data becomes too large, so there's a whole lot of
tweaking, studying real performance data, hopefully finding the right
set of diagnostic information to collect that allow the DB to
automatically choose the right point to do its updates, etc. pp.)
TL;DR for the good news: You can coalesce N updates into one and divide
the CPU core coordination overhead by a factor of N. You'll increase the
bus pressure, so there's tons of fine tuning you can do (or avoid) after
getting the first 90% of the speedup. (I'm drawing purely speculative
numbers out of my hat here.)
Liability: You will want to add transactions and (likely) optimistic
locking, if you don't have that already: Transaction boundaries are the
natural point for coalescing updates.
Regards,
Jo
More information about the Haskell-Cafe
mailing list