[Haskell-cafe] Timeout on pure code
Kostiantyn Rybnikov
k-bx at k-bx.com
Sat Apr 25 22:44:21 UTC 2015
Hi David.
I planned to create a detailed bug-report at hslogger's issues to start
investigation there (as a better place) on Monday, but since I have code
prepared already, it's easy to share it right now:
https://gist.github.com/k-bx/ccf6fd1c73680c8a4345
I'm launching it as:
time ./dist/build/seq/seq &> /dev/null
We don't use syslog driver, instead we have a separate file-to-syslog
worker to decouple these components.
On Sun, Apr 26, 2015 at 12:08 AM, David Turner <
dct25-561bs at mythic-beasts.com> wrote:
> Hi,
>
> I've had a look at this as we use hslogger too, so I'm keen to avoid
> this kind of performance issue. I threw a quick Criterion benchmark
> together:
>
> https://gist.github.com/DaveCTurner/f977123b4498c4c64569
>
> The headline result on my test machine are that each log call takes
> ~540us, so 2000 should take about a second. Would be interested if you
> could run the same benchmark on your setup as it's possible that
> there's something else downstream that's causing you a problem.
>
> A couple of things that might be worth bearing in mind: if you're
> talking to syslog over /dev/log then that can block if the log daemon
> falls behind: unix datagram sockets don't drop datagrams when they're
> congested. If the /dev/log test is slow but the UDP test is fast then
> it could be that your syslog can't handle the load.
>
> I'm using rsyslogd and have enabled the feature that combines
> identical messages, so this test doesn't generate much disk IO and it
> keeps up easily, so the UDP and /dev/log tests run about equally fast
> for me. Is your syslog writing out every message? It may be flushing
> to disk after every message too, which would be terribly slow.
>
> If you're not logging to syslog, what's your hslogger config?
>
> Cheers,
>
> David
>
>
> On 24 April 2015 at 20:25, Kostiantyn Rybnikov <k-bx at k-bx.com> wrote:
> > An update for everyone interested (and not). Turned out it's neither GHC
> > RTS, Snap or networking issues, it's hslogger being very slow. I thought
> > it's slow when used concurrently, but just did a test when it writes 2000
> > 5kb messages sequentially and that finishes in 111 seconds (while minimal
> > program that writes same 2000 messages finishes in 0.12s).
> >
> > I hope I'll have a chance to investigate why hslogger is so slow in
> future,
> > but meanwhile will just remove logging.
> >
> > On Thu, Apr 23, 2015 at 4:08 PM, Kostiantyn Rybnikov <k-bx at k-bx.com>
> wrote:
> >>
> >> All right, good news!
> >>
> >> After adding ekg, gathering its data via bosun and seeing nothing
> useful I
> >> actually figured out that I could try harder to reproduce issue by
> myself
> >> instead of waiting for users to do that. And I succeeded! :)
> >>
> >> So, after launching 20 infinite curl loops to that handler's url I was
> >> quickly able to reproduce the issue, so the task seems clear now: keep
> >> reducing the code, reproduce locally, possibly without external services
> >> etc. I'll write up after I get to something.
> >>
> >> Thanks.
> >>
> >> On Wed, Apr 22, 2015 at 11:09 PM, Gregory Collins
> >> <greg at gregorycollins.net> wrote:
> >>>
> >>> Maybe but it would be helpful to rule the scenario out. Johan's ekg
> >>> library is also useful, it exports a webserver on a different port
> that you
> >>> can use to track metrics like gc times, etc.
> >>>
> >>> Other options for further debugging include gathering strace logs from
> >>> the binary. You'll have to do some data gathering to narrow down the
> cause
> >>> unfortunately -- http client? your code? Snap server? GHC event manager
> >>> (System.timeout is implemented here)? GC? etc
> >>>
> >>> G
> >>>
> >>> On Wed, Apr 22, 2015 at 10:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
> >>> wrote:
> >>>>
> >>>> Gregory,
> >>>>
> >>>> Servers are far from being highly-overloaded, since they're currently
> >>>> under a much less load they used to be. Memory consumption is stable
> and
> >>>> low, and there's a lot of free RAM also.
> >>>>
> >>>> Would you say that given these factors this scenario is unlikely?
> >>>>
> >>>> On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins
> >>>> <greg at gregorycollins.net> wrote:
> >>>>>
> >>>>> Given your gist, the timeout on your requests is set to a half-second
> >>>>> so it's conceivable that a highly-loaded server might have GC pause
> times
> >>>>> approaching that long. Smells to me like a classic Haskell memory
> leak
> >>>>> (that's why the problem occurs after the server has been up for a
> while):
> >>>>> run your program with the heap profiler, and audit any shared
> >>>>> tables/IORefs/MVars to make sure you are not building up thunks
> there.
> >>>>>
> >>>>> Greg
> >>>>>
> >>>>> On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <k-bx at k-bx.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> Hi!
> >>>>>>
> >>>>>> Our company's main commercial product is a Snap-based web app which
> we
> >>>>>> compile with GHC 7.8.4. It works on four app-servers currently
> load-balanced
> >>>>>> behind Haproxy.
> >>>>>>
> >>>>>> I recently implemented a new piece of functionality, which led to
> >>>>>> weird behavior which I have no idea how to debug, so I'm asking
> here for
> >>>>>> help and ideas!
> >>>>>>
> >>>>>> The new functionality is this: on specific url-handler, we need to
> >>>>>> query n external services concurrently with a timeout, gather and
> render
> >>>>>> results. Easy (in Haskell)!
> >>>>>>
> >>>>>> The implementation looks, as you might imagine, something like this
> >>>>>> (sorry for almost-real-haskell, I'm sure I forgot tons of imports
> and other
> >>>>>> things, but I hope everything is clear as-is, if not -- I'll be
> glad to
> >>>>>> update gist to make things more specific):
> >>>>>>
> >>>>>> https://gist.github.com/k-bx/0cf7035aaf1ad6306e76
> >>>>>>
> >>>>>> Now, this works wonderful for some time, and in logs I can see both,
> >>>>>> successful fetches of external-content, and also lots of timeouts
> from our
> >>>>>> external providers. Life is good.
> >>>>>>
> >>>>>> But! After several days of work (sometimes a day, sometimes couple
> >>>>>> days), apps on all 4 servers go crazy. It might take some interval
> (like 20
> >>>>>> minutes) before they're all crazy, so it's not super-synchronous.
> Now: how
> >>>>>> crazy, exactly?
> >>>>>>
> >>>>>> First of all, this endpoint timeouts. Haproxy requests for a
> response,
> >>>>>> and response times out, so they "hang".
> >>>>>>
> >>>>>> Secondly, logs are interesting. If you look at the code from gist
> once
> >>>>>> again, you can see, that some of CandidateProvider's don't actually
> require
> >>>>>> any networking work, so all they do is actually just logging that
> they're
> >>>>>> working (I added this as part of debugging actually) and return
> pure data.
> >>>>>> So what's weird is that they timeout also! Here's how output of our
> logs
> >>>>>> starts to look like after the bug happens:
> >>>>>>
> >>>>>> ```
> >>>>>> [2015-04-22 09:56:20] provider: CandidateProvider1
> >>>>>> [2015-04-22 09:56:20] provider: CandidateProvider2
> >>>>>> [2015-04-22 09:56:21] Got timeout while requesting
> CandidateProvider1
> >>>>>> [2015-04-22 09:56:21] Got timeout while requesting
> CandidateProvider2
> >>>>>> [2015-04-22 09:56:22] provider: CandidateProvider1
> >>>>>> [2015-04-22 09:56:22] provider: CandidateProvider2
> >>>>>> [2015-04-22 09:56:23] Got timeout while requesting
> CandidateProvider1
> >>>>>> [2015-04-22 09:56:23] Got timeout while requesting
> CandidateProvider2
> >>>>>> ... and so on
> >>>>>> ```
> >>>>>>
> >>>>>> What's also weird is that, even after timeout is logged, the string
> >>>>>> ""Got responses!" never gets logged also! So hanging happens
> somewhere
> >>>>>> in-between.
> >>>>>>
> >>>>>> I have to say I'm sorry that I don't have strace output now, I'll
> have
> >>>>>> to wait until this situation happens once again, but I'll get later
> to you
> >>>>>> with this info.
> >>>>>>
> >>>>>> So, how is this possible that almost-pure code gets timed-out? And
> why
> >>>>>> does it hang afterwards?
> >>>>>>
> >>>>>> CPU and other resource usage is quite low, number of open
> >>>>>> file-descriptors also (it seems).
> >>>>>>
> >>>>>> Thanks for all the suggestions in advance!
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Haskell-Cafe mailing list
> >>>>>> Haskell-Cafe at haskell.org
> >>>>>> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >>>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>> --
> >>>>> Gregory Collins <greg at gregorycollins.net>
> >>>>
> >>>>
> >>>
> >>>
> >>>
> >>> --
> >>> Gregory Collins <greg at gregorycollins.net>
> >>
> >>
> >
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > Haskell-Cafe at haskell.org
> > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.haskell.org/pipermail/haskell-cafe/attachments/20150426/ebca2941/attachment.html>
More information about the Haskell-Cafe
mailing list