[Haskell-cafe] Timeout on pure code

Gregory Collins greg at gregorycollins.net
Wed Apr 22 16:56:12 UTC 2015


Given your gist, the timeout on your requests is set to a half-second so
it's conceivable that a highly-loaded server might have GC pause times
approaching that long. Smells to me like a classic Haskell memory leak
(that's why the problem occurs after the server has been up for a while):
run your program with the heap profiler, and audit any shared
tables/IORefs/MVars to make sure you are not building up thunks there.

Greg

On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com> wrote:

> Hi!
>
> Our company's main commercial product is a Snap-based web app which we
> compile with GHC 7.8.4. It works on four app-servers currently
> load-balanced behind Haproxy.
>
> I recently implemented a new piece of functionality, which led to weird
> behavior which I have no idea how to debug, so I'm asking here for help and
> ideas!
>
> The new functionality is this: on specific url-handler, we need to query n
> external services concurrently with a timeout, gather and render results.
> Easy (in Haskell)!
>
> The implementation looks, as you might imagine, something like this (sorry
> for almost-real-haskell, I'm sure I forgot tons of imports and other
> things, but I hope everything is clear as-is, if not -- I'll be glad to
> update gist to make things more specific):
>
> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>
> Now, this works wonderful for some time, and in logs I can see both,
> successful fetches of external-content, and also lots of timeouts from our
> external providers. Life is good.
>
> But! After several days of work (sometimes a day, sometimes couple days),
> apps on all 4 servers go crazy. It might take some interval (like 20
> minutes) before they're all crazy, so it's not super-synchronous. Now: how
> crazy, exactly?
>
> First of all, this endpoint timeouts. Haproxy requests for a response, and
> response times out, so they "hang".
>
> Secondly, logs are interesting. If you look at the code from gist once
> again, you can see, that some of CandidateProvider's don't actually require
> any networking work, so all they do is actually just logging that they're
> working (I added this as part of debugging actually) and return pure data.
> So what's weird is that they timeout also! Here's how output of our logs
> starts to look like after the bug happens:
>
> ```
> [2015-04-22 09:56:20] provider: CandidateProvider1
> [2015-04-22 09:56:20] provider: CandidateProvider2
> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
> [2015-04-22 09:56:22] provider: CandidateProvider1
> [2015-04-22 09:56:22] provider: CandidateProvider2
> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
> ... and so on
> ```
>
> What's also weird is that, even after timeout is logged, the string ""Got
> responses!" never gets logged also! So hanging happens somewhere in-between.
>
> I have to say I'm sorry that I don't have strace output now, I'll have to
> wait until this situation happens once again, but I'll get later to you
> with this info.
>
> So, how is this possible that almost-pure code gets timed-out? And why
> does it hang afterwards?
>
> CPU and other resource usage is quite low, number of open file-descriptors
> also (it seems).
>
> Thanks for all the suggestions in advance!
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>
>


-- 
Gregory Collins <greg at gregorycollins.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150422/292c6792/attachment.html>


More information about the Haskell-Cafe mailing list