[Haskell-cafe] STM friendly TreeMap (or similar with range scan api) ? WAS: Best ways to achieve throughput, for large M:N ratio of STM threads, with hot TVar updates?

Thu Aug 6 17:50:17 UTC 2020

Hi Devs & Cafe,

I would report back my progress on it, actually I've got a rough conclusion that TL;DR:

> For data-intensive workloads, x86_64 ISA has its cache of CPU chips being a hardware bottleneck, it's very hard to scale up with added number of cores, so long as they share the cache as being in a single chip.

For the details -

I developed a minimal script interpreter for diagnostic purpose, dependent only on libraries bundled with GHC, the source repository is at: https://github.com/complyue/txs <https://github.com/complyue/txs>

I benchmarked on my machine with a single 6-core Xeon E5 CPU chip, for contention-free read/write performance scaling, got numbers at: https://github.com/complyue/txs/blob/master/results/baseline.csv <https://github.com/complyue/txs/blob/master/results/baseline.csv>

conc
thread avg tps
scale
eff
populate
1
1741
1.00
1.00
2
1285
1.48
0.74
3
1028
1.77
0.59
4
843
1.94
0.48
5
696
2.00
0.40
6
600
2.07
0.34

scan
1
1565
1.00
1.00
2
1285
1.64
0.82
3
1018
1.95
0.65
4
843
2.15
0.54
5
696
2.22
0.44
6
586
2.25
0.37

The script is at: https://github.com/complyue/txs/blob/master/scripts/scan.txs <https://github.com/complyue/txs/blob/master/scripts/scan.txs>

GHC cmdl is at: https://github.com/complyue/txs/blob/master/metric.bash <https://github.com/complyue/txs/blob/master/metric.bash>

ghc --make -Wall -threaded -rtsopts -prof -o txs -outputdir . -stubdir . -i../src ../src/Main.hs && (

  ./txs +RTS -N10 -A32m -H256m -qg -I0 -M5g -T -s <../scripts/"${SCRIPT}".txs

)

I intended to use a single Haskell based process to handle meta data about many ndarrays being crunched, acting as a centralized graph database, as it turned out, many clients queued to query/insert meta data against a single database node, will create such high data throughput that just few CPU chips can't handle well, we didn't expect this but apparently we'll have to deploy more machines as for such a database instance, with data partitioned and distributed to more nodes for load balancing. (A single machine with many sockets for CPU thus many NUMA nodes is neither an option for us.) While the flexibility a central graph database would provide, is not currently a crucial requirement of our business,  so we are not interested to further develop this database system.

We currently have CPU intensive workloads handled by some cluster of machines running Python processes (crunching numbers with Numpy and C++ tensors), while some Haskell based number crunching software are still under development, it may turn out some day in the future, that some heavier computation be bound with the db access, effectively creating some CPU intensive workloads for the database functionality, then we'll have the opportunity to dive deeper into the database implementation. And in case more flexibility required in near future, I think I'll tend to implement embedded database instances in those worker processes, in contrast to centralized db servers.

I wonder if ARM servers will have up scaling of data intensive workloads easier, though that's neither a near feasible option for us.

Thanks for everyone that have been helpful!

Best regards,
Compl

> On 2020-07-31, at 22:35, YueCompl via Haskell-Cafe <haskell-cafe at haskell.org> wrote:
> 
> Hi Ben,
> 
> Thanks as always for your great support! And at the moment I'm working on a minimum working example to reproduce the symptoms, I intend to work out a program depends only on libraries bundled with GHC, so it can be easily diagnosed without my complex env,  but so far no reprod yet. I'll come with some piece of code once it can reproduce something.
> 
> Thanks in advance.
> 
> Sincerely,
> Compl
> 
> 
>> On 2020-07-31, at 21:36, Ben Gamari <ben at well-typed.com> wrote:
>> 
>> Simon Peyton Jones via Haskell-Cafe <haskell-cafe at haskell.org> writes:
>> 
>>>> Compl’s problem is (apparently) that execution becomes dominated by
>>>> GC. That doesn’t sound like a constant-factor overhead from TVars, no
>>>> matter how efficient (or otherwise) they are. It sounds more like a
>>>> space leak to me; perhaps you need some strict evaluation or
>>>> something.
>>> 
>>> My point is only: before re-engineering STM it would make sense to get
>>> a much more detailed insight into what is actually happening, and
>>> where the space and time is going. We have tools to do this (heap
>>> profiling, Threadscope, …) but I know they need some skill and insight
>>> to use well. But we don’t have nearly enough insight to draw
>>> meaningful conclusions yet.
>>> 
>>> Maybe someone with experience of performance debugging might feel able
>>> to help Compl?
>>> 
>> Compl,
>> 
>> If you want to discuss the issue feel free to get in touch on IRC. I
>> would be happy to help.
>> 
>> It would be great if we had something of a decision tree for performance
>> tuning of Haskell code in the users guide or Wiki. We have so many tools
>> yet there isn't a comprehensive overview of
>> 
>> 1. what factors might affect which runtime characteristics of your
>>   program
>> 2. which tools can be used to measure which factors
>> 3. how these factors can be improved
>> 
>> Cheers,
>> 
>> - Ben
>> _______________________________________________
>> Haskell-Cafe mailing list
>> To (un)subscribe, modify options or view archives go to:
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>> Only members subscribed via the mailman list are allowed to post.
> 
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20200807/4807753a/attachment.html>