[Haskell-cafe] Timeout on pure code
greg at gregorycollins.net
Wed Apr 22 16:56:12 UTC 2015
Given your gist, the timeout on your requests is set to a half-second so
it's conceivable that a highly-loaded server might have GC pause times
approaching that long. Smells to me like a classic Haskell memory leak
(that's why the problem occurs after the server has been up for a while):
run your program with the heap profiler, and audit any shared
tables/IORefs/MVars to make sure you are not building up thunks there.
On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com> wrote:
> Our company's main commercial product is a Snap-based web app which we
> compile with GHC 7.8.4. It works on four app-servers currently
> load-balanced behind Haproxy.
> I recently implemented a new piece of functionality, which led to weird
> behavior which I have no idea how to debug, so I'm asking here for help and
> The new functionality is this: on specific url-handler, we need to query n
> external services concurrently with a timeout, gather and render results.
> Easy (in Haskell)!
> The implementation looks, as you might imagine, something like this (sorry
> for almost-real-haskell, I'm sure I forgot tons of imports and other
> things, but I hope everything is clear as-is, if not -- I'll be glad to
> update gist to make things more specific):
> Now, this works wonderful for some time, and in logs I can see both,
> successful fetches of external-content, and also lots of timeouts from our
> external providers. Life is good.
> But! After several days of work (sometimes a day, sometimes couple days),
> apps on all 4 servers go crazy. It might take some interval (like 20
> minutes) before they're all crazy, so it's not super-synchronous. Now: how
> crazy, exactly?
> First of all, this endpoint timeouts. Haproxy requests for a response, and
> response times out, so they "hang".
> Secondly, logs are interesting. If you look at the code from gist once
> again, you can see, that some of CandidateProvider's don't actually require
> any networking work, so all they do is actually just logging that they're
> working (I added this as part of debugging actually) and return pure data.
> So what's weird is that they timeout also! Here's how output of our logs
> starts to look like after the bug happens:
> [2015-04-22 09:56:20] provider: CandidateProvider1
> [2015-04-22 09:56:20] provider: CandidateProvider2
> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
> [2015-04-22 09:56:22] provider: CandidateProvider1
> [2015-04-22 09:56:22] provider: CandidateProvider2
> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
> ... and so on
> What's also weird is that, even after timeout is logged, the string ""Got
> responses!" never gets logged also! So hanging happens somewhere in-between.
> I have to say I'm sorry that I don't have strace output now, I'll have to
> wait until this situation happens once again, but I'll get later to you
> with this info.
> So, how is this possible that almost-pure code gets timed-out? And why
> does it hang afterwards?
> CPU and other resource usage is quite low, number of open file-descriptors
> also (it seems).
> Thanks for all the suggestions in advance!
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
Gregory Collins <greg at gregorycollins.net>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Haskell-Cafe