[Haskell-cafe] Timeout on pure code

Kostiantyn Rybnikov k-bx at k-bx.com
Fri Apr 24 19:25:54 UTC 2015

An update for everyone interested (and not). Turned out it's neither GHC
RTS, Snap or networking issues, it's hslogger being very slow. I thought
it's slow when used concurrently, but just did a test when it writes 2000
5kb messages sequentially and that finishes in 111 seconds (while minimal
program that writes same 2000 messages finishes in 0.12s).

I hope I'll have a chance to investigate why hslogger is so slow in future,
but meanwhile will just remove logging.

On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov <k-bx at k-bx.com> wrote:

> All right, good news!
> After adding ekg, gathering its data via bosun and seeing nothing useful I
> actually figured out that I could try harder to reproduce issue by myself
> instead of waiting for users to do that. And I succeeded! :)
> So, after launching 20 infinite curl loops to that handler's url I was
> quickly able to reproduce the issue, so the task seems clear now: keep
> reducing the code, reproduce locally, possibly without external services
> etc. I'll write up after I get to something.
> Thanks.
> On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins <greg at gregorycollins.net
> > wrote:
>> Maybe but it would be helpful to rule the scenario out. Johan's ekg
>> library is also useful, it exports a webserver on a different port that you
>> can use to track metrics like gc times, etc.
>> Other options for further debugging include gathering strace logs from
>> the binary. You'll have to do some data gathering to narrow down the cause
>> unfortunately -- http client? your code? Snap server? GHC event manager
>> (System.timeout is implemented here)? GC? etc
>> G
>> On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
>> wrote:
>>> Gregory,
>>> Servers are far from being highly-overloaded, since they're currently
>>> under a much less load they used to be. Memory consumption is stable and
>>> low, and there's a lot of free RAM also.
>>> Would you say that given these factors this scenario is unlikely?
>>> On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins <
>>> greg at gregorycollins.net> wrote:
>>>> Given your gist, the timeout on your requests is set to a half-second
>>>> so it's conceivable that a highly-loaded server might have GC pause times
>>>> approaching that long. Smells to me like a classic Haskell memory leak
>>>> (that's why the problem occurs after the server has been up for a while):
>>>> run your program with the heap profiler, and audit any shared
>>>> tables/IORefs/MVars to make sure you are not building up thunks there.
>>>> Greg
>>>> On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
>>>> wrote:
>>>>> Hi!
>>>>> Our company's main commercial product is a Snap-based web app which we
>>>>> compile with GHC 7.8.4. It works on four app-servers currently
>>>>> load-balanced behind Haproxy.
>>>>> I recently implemented a new piece of functionality, which led to
>>>>> weird behavior which I have no idea how to debug, so I'm asking here for
>>>>> help and ideas!
>>>>> The new functionality is this: on specific url-handler, we need to
>>>>> query n external services concurrently with a timeout, gather and render
>>>>> results. Easy (in Haskell)!
>>>>> The implementation looks, as you might imagine, something like this
>>>>> (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other
>>>>> things, but I hope everything is clear as-is, if not -- I'll be glad to
>>>>> update gist to make things more specific):
>>>>> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
>>>>> Now, this works wonderful for some time, and in logs I can see both,
>>>>> successful fetches of external-content, and also lots of timeouts from our
>>>>> external providers. Life is good.
>>>>> But! After several days of work (sometimes a day, sometimes couple
>>>>> days), apps on all 4 servers go crazy. It might take some interval (like 20
>>>>> minutes) before they're all crazy, so it's not super-synchronous. Now: how
>>>>> crazy, exactly?
>>>>> First of all, this endpoint timeouts. Haproxy requests for a response,
>>>>> and response times out, so they "hang".
>>>>> Secondly, logs are interesting. If you look at the code from gist once
>>>>> again, you can see, that some of CandidateProvider's don't actually require
>>>>> any networking work, so all they do is actually just logging that they're
>>>>> working (I added this as part of debugging actually) and return pure data.
>>>>> So what's weird is that they timeout also! Here's how output of our logs
>>>>> starts to look like after the bug happens:
>>>>> ```
>>>>> [2015-04-22 09:56:20] provider: CandidateProvider1
>>>>> [2015-04-22 09:56:20] provider: CandidateProvider2
>>>>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1
>>>>> [2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2
>>>>> [2015-04-22 09:56:22] provider: CandidateProvider1
>>>>> [2015-04-22 09:56:22] provider: CandidateProvider2
>>>>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1
>>>>> [2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2
>>>>> ... and so on
>>>>> ```
>>>>> What's also weird is that, even after timeout is logged, the string
>>>>> ""Got responses!" never gets logged also! So hanging happens somewhere
>>>>> in-between.
>>>>> I have to say I'm sorry that I don't have strace output now, I'll have
>>>>> to wait until this situation happens once again, but I'll get later to you
>>>>> with this info.
>>>>> So, how is this possible that almost-pure code gets timed-out? And why
>>>>> does it hang afterwards?
>>>>> CPU and other resource usage is quite low, number of open
>>>>> file-descriptors also (it seems).
>>>>> Thanks for all the suggestions in advance!
>>>>> _______________________________________________
>>>>> Haskell-Cafe mailing list
>>>>> Haskell-Cafe at haskell.org
>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
>>>> --
>>>> Gregory Collins <greg at gregorycollins.net>
>> --
>> Gregory Collins <greg at gregorycollins.net>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150424/21cfa309/attachment-0001.html>

More information about the Haskell-Cafe mailing list