[Haskell-cafe] Timeout on pure code

Thu Apr 23 13:08:44 UTC 2015

All right, good news!

After adding ekg, gathering its data via bosun and seeing nothing useful I
actually figured out that I could try harder to reproduce issue by myself
instead of waiting for users to do that. And I succeeded! :)

So, after launching 20 infinite curl loops to that handler's url I was
quickly able to reproduce the issue, so the task seems clear now: keep
reducing the code, reproduce locally, possibly without external services
etc. I'll write up after I get to something.

Thanks.

On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins <greg at gregorycollins.net>
wrote:

> Maybe but it would be helpful to rule the scenario out. Johan's ekg
> library is also useful, it exports a webserver on a different port that you
> can use to track metrics like gc times, etc.
>
> Other options for further debugging include gathering strace logs from the
> binary. You'll have to do some data gathering to narrow down the cause
> unfortunately -- http client? your code? Snap server? GHC event manager
> (System.timeout is implemented here)? GC? etc
>
> G
>
> On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
> wrote:
>
>> Gregory,
>>
>> Servers are far from being highly-overloaded, since they're currently
>> under a much less load they used to be. Memory consumption is stable and
>> low, and there's a lot of free RAM also.
>>
>> Would you say that given these factors this scenario is unlikely?
>>
>> On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins <greg at gregorycollins.net
>> > wrote:
>>
>>> Given your gist, the timeout on your requests is set to a half-second so
>>> it's conceivable that a highly-loaded server might have GC pause times
>>> approaching that long. Smells to me like a classic Haskell memory leak
>>> (that's why the problem occurs after the server has been up for a while):
>>> run your program with the heap profiler, and audit any shared
>>> tables/IORefs/MVars to make sure you are not building up thunks there.
>>>
>>> Greg
>>>
>>> On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
>>> wrote:
>>>
>>>> Hi!
>>>>
>>>> Our company's main commercial product is a Snap-based web app which we
>>>> compile with GHC 7.8.4. It works on four app-servers currently
>>>> load-balanced behind Haproxy.
>>>>
>>>> I recently implemented a new piece of functionality, which led to weird
>>>> behavior which I have no idea how to debug, so I'm asking here for help and
>>>> ideas!
>>>>
>>>> The new functionality is this: on specific url-handler, we need to
>>>> query n external services concurrently with a timeout, gather and render
>>>> results. Easy (in Haskell)!
>>>>
>>>> The implementation looks, as you might imagine, something like this
>>>> (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other
>>>> things, but I hope everything is clear as-is, if not -- I'll be glad to
>>>> update gist to make things more specific):
>>>>
>>>> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>>>>
>>>> Now, this works wonderful for some time, and in logs I can see both,
>>>> successful fetches of external-content, and also lots of timeouts from our
>>>> external providers. Life is good.
>>>>
>>>> But! After several days of work (sometimes a day, sometimes couple
>>>> days), apps on all 4 servers go crazy. It might take some interval (like 20
>>>> minutes) before they're all crazy, so it's not super-synchronous. Now: how
>>>> crazy, exactly?
>>>>
>>>> First of all, this endpoint timeouts. Haproxy requests for a response,
>>>> and response times out, so they "hang".
>>>>
>>>> Secondly, logs are interesting. If you look at the code from gist once
>>>> again, you can see, that some of CandidateProvider's don't actually require
>>>> any networking work, so all they do is actually just logging that they're
>>>> working (I added this as part of debugging actually) and return pure data.
>>>> So what's weird is that they timeout also! Here's how output of our logs
>>>> starts to look like after the bug happens:
>>>>
>>>> ```
>>>> [2015-04-22 09:56:20] provider: CandidateProvider1
>>>> [2015-04-22 09:56:20] provider: CandidateProvider2
>>>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
>>>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
>>>> [2015-04-22 09:56:22] provider: CandidateProvider1
>>>> [2015-04-22 09:56:22] provider: CandidateProvider2
>>>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
>>>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
>>>> ... and so on
>>>> ```
>>>>
>>>> What's also weird is that, even after timeout is logged, the string
>>>> ""Got responses!" never gets logged also! So hanging happens somewhere
>>>> in-between.
>>>>
>>>> I have to say I'm sorry that I don't have strace output now, I'll have
>>>> to wait until this situation happens once again, but I'll get later to you
>>>> with this info.
>>>>
>>>> So, how is this possible that almost-pure code gets timed-out? And why
>>>> does it hang afterwards?
>>>>
>>>> CPU and other resource usage is quite low, number of open
>>>> file-descriptors also (it seems).
>>>>
>>>> Thanks for all the suggestions in advance!
>>>>
>>>> _______________________________________________
>>>> Haskell-Cafe mailing list
>>>> Haskell-Cafe at haskell.org
>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>>
>>>>
>>>
>>>
>>> --
>>> Gregory Collins <greg at gregorycollins.net>
>>>
>>
>>
>
>
> --
> Gregory Collins <greg at gregorycollins.net>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150423/5795c4fb/attachment.html>