[Haskell-cafe] Optimizing a high-traffic network architecture
simonmar at microsoft.com
Thu Dec 15 09:02:02 EST 2005
On 15 December 2005 10:21, Joel Reymont wrote:
> Here are statistics that I gathered. I'm almost done modifying the
> program to use 1 timer thread instead of 1 per bot as well as writing
> to the socket from the writer thread. This should reduce the number
> of threads from 6k (2k x 3) to 2k plus change.
> It appears that +RTS -k3k does make a difference. As per Simon, 2-4k
> avoids the thread being garbage collected because each thread gets
> its own block in the storage manager. Simon, did I get that right?
> BTW, how does garbage-collecting a thread works in this scenario? My
> threads are very long-running.
> The total is the number of bots launched, lobby is how many bots
> connected to the lobby. Failed is mostly due to connection reset by
> peer errors. The Windows C++ server uses IOCP and running a firewall
> was apparently interfering with that somehow. I hate Windows :-(.
> --- Test#1 +RTS -k3k as per Simon. Keep-alive timeout of 9 minutes.
> Total: 1961, Lobby: 1961, Failed: 0
> Total: 2000, Lobby: 2000, Failed: 1
> This test went smoothly and got to 2k connections very quickly. Maybe
> within 30 minutes or so. I did not gather CPU usage, etc. statistics.
> --- Test #2, No thread stack increase, 1 minute keep-alive timeout,
> more network traffic
> With a 1 minute timeout things run veeery slow. 86 physical and 158Mb
> of VM with 1k bots, CPU 50-60%. Data sent/received is 60-70 packets
> and 6-7kb/sec. Killed after a while.
> The statistics are phys/VM, CPU usage in % and #packets/transfer speed
> Total: 1345, Lobby: 1326, Failed: 0, 102/184, 50%, 90/8kb
> Total: 1395, Lobby: 1367, Failed: 2
> Total: 1421, Lobby: 1394, Failed: 4
> Total: 1490, Lobby: 1463, Failed: 4, 108/194, 50%, 110/11Kb
> Total: 1574, Lobby: 1546, Failed: 4, 113/202, 50%, 116/11kb
Hmm, your machine is spending 50% of its time doing nothing, and the
network traffic is very low. I wouldn't expect 2k connections to pose
any problem at all, so further investigation is definitely required.
With 2k connections the overhead of select() is going to start to be a
problem. You would notice the system time going up. -threaded may help
with this, because it calls select() less often.
If that's not the cause, we should find out what your app is doing while
it's idle. If there are runnable threads (eg. the lauchner), then the
app should not be spending any of its time idle.
More information about the Haskell-Cafe