[Haskell-cafe] Timeout on pure code

Kostiantyn Rybnikov k-bx at k-bx.com
Wed Apr 22 17:14:09 UTC 2015


Gregory,

Servers are far from being highly-overloaded, since they're currently under
a much less load they used to be. Memory consumption is stable and low, and
there's a lot of free RAM also.

Would you say that given these factors this scenario is unlikely?

On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins <greg at gregorycollins.net>
wrote:

> Given your gist, the timeout on your requests is set to a half-second so
> it's conceivable that a highly-loaded server might have GC pause times
> approaching that long. Smells to me like a classic Haskell memory leak
> (that's why the problem occurs after the server has been up for a while):
> run your program with the heap profiler, and audit any shared
> tables/IORefs/MVars to make sure you are not building up thunks there.
>
> Greg
>
> On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
> wrote:
>
>> Hi!
>>
>> Our company's main commercial product is a Snap-based web app which we
>> compile with GHC 7.8.4. It works on four app-servers currently
>> load-balanced behind Haproxy.
>>
>> I recently implemented a new piece of functionality, which led to weird
>> behavior which I have no idea how to debug, so I'm asking here for help and
>> ideas!
>>
>> The new functionality is this: on specific url-handler, we need to query
>> n external services concurrently with a timeout, gather and render results.
>> Easy (in Haskell)!
>>
>> The implementation looks, as you might imagine, something like this
>> (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other
>> things, but I hope everything is clear as-is, if not -- I'll be glad to
>> update gist to make things more specific):
>>
>> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>>
>> Now, this works wonderful for some time, and in logs I can see both,
>> successful fetches of external-content, and also lots of timeouts from our
>> external providers. Life is good.
>>
>> But! After several days of work (sometimes a day, sometimes couple days),
>> apps on all 4 servers go crazy. It might take some interval (like 20
>> minutes) before they're all crazy, so it's not super-synchronous. Now: how
>> crazy, exactly?
>>
>> First of all, this endpoint timeouts. Haproxy requests for a response,
>> and response times out, so they "hang".
>>
>> Secondly, logs are interesting. If you look at the code from gist once
>> again, you can see, that some of CandidateProvider's don't actually require
>> any networking work, so all they do is actually just logging that they're
>> working (I added this as part of debugging actually) and return pure data.
>> So what's weird is that they timeout also! Here's how output of our logs
>> starts to look like after the bug happens:
>>
>> ```
>> [2015-04-22 09:56:20] provider: CandidateProvider1
>> [2015-04-22 09:56:20] provider: CandidateProvider2
>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
>> [2015-04-22 09:56:22] provider: CandidateProvider1
>> [2015-04-22 09:56:22] provider: CandidateProvider2
>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
>> ... and so on
>> ```
>>
>> What's also weird is that, even after timeout is logged, the string ""Got
>> responses!" never gets logged also! So hanging happens somewhere in-between.
>>
>> I have to say I'm sorry that I don't have strace output now, I'll have to
>> wait until this situation happens once again, but I'll get later to you
>> with this info.
>>
>> So, how is this possible that almost-pure code gets timed-out? And why
>> does it hang afterwards?
>>
>> CPU and other resource usage is quite low, number of open
>> file-descriptors also (it seems).
>>
>> Thanks for all the suggestions in advance!
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>
>>
>
>
> --
> Gregory Collins <greg at gregorycollins.net>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150422/fb1360b9/attachment.html>


More information about the Haskell-Cafe mailing list