[Haskell-cafe] Timeout on pure code
k-bx at k-bx.com
Wed Apr 22 16:14:07 UTC 2015
Our company's main commercial product is a Snap-based web app which we
compile with GHC 7.8.4. It works on four app-servers currently
load-balanced behind Haproxy.
I recently implemented a new piece of functionality, which led to weird
behavior which I have no idea how to debug, so I'm asking here for help and
The new functionality is this: on specific url-handler, we need to query n
external services concurrently with a timeout, gather and render results.
Easy (in Haskell)!
The implementation looks, as you might imagine, something like this (sorry
for almost-real-haskell, I'm sure I forgot tons of imports and other
things, but I hope everything is clear as-is, if not -- I'll be glad to
update gist to make things more specific):
Now, this works wonderful for some time, and in logs I can see both,
successful fetches of external-content, and also lots of timeouts from our
external providers. Life is good.
But! After several days of work (sometimes a day, sometimes couple days),
apps on all 4 servers go crazy. It might take some interval (like 20
minutes) before they're all crazy, so it's not super-synchronous. Now: how
First of all, this endpoint timeouts. Haproxy requests for a response, and
response times out, so they "hang".
Secondly, logs are interesting. If you look at the code from gist once
again, you can see, that some of CandidateProvider's don't actually require
any networking work, so all they do is actually just logging that they're
working (I added this as part of debugging actually) and return pure data.
So what's weird is that they timeout also! Here's how output of our logs
starts to look like after the bug happens:
[2015-04-22 09:56:20] provider: CandidateProvider1
[2015-04-22 09:56:20] provider: CandidateProvider2
[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
[2015-04-22 09:56:22] provider: CandidateProvider1
[2015-04-22 09:56:22] provider: CandidateProvider2
[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
... and so on
What's also weird is that, even after timeout is logged, the string ""Got
responses!" never gets logged also! So hanging happens somewhere in-between.
I have to say I'm sorry that I don't have strace output now, I'll have to
wait until this situation happens once again, but I'll get later to you
with this info.
So, how is this possible that almost-pure code gets timed-out? And why does
it hang afterwards?
CPU and other resource usage is quite low, number of open file-descriptors
also (it seems).
Thanks for all the suggestions in advance!
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe