<div dir="ltr">Gregory,<div><br></div><div>Servers are far from being highly-overloaded, since they're currently under a much less load they used to be. Memory consumption is stable and low, and there's a lot of free RAM also.</div><div><br></div><div>Would you say that given these factors this scenario is unlikely?</div></div><div class="gmail_extra"><br><div class="gmail_quote">On Wed, Apr 22, 2015 at 7:56 PM, Gregory Collins <span dir="ltr"><<a href="mailto:greg@gregorycollins.net" target="_blank">greg@gregorycollins.net</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr">Given your gist, the timeout on your requests is set to a half-second so it's conceivable that a highly-loaded server might have GC pause times approaching that long. Smells to me like a classic Haskell memory leak (that's why the problem occurs after the server has been up for a while): run your program with the heap profiler, and audit any shared tables/IORefs/MVars to make sure you are not building up thunks there.<div><br></div><div>Greg</div></div><div class="gmail_extra"><br><div class="gmail_quote"><div><div class="h5">On Wed, Apr 22, 2015 at 9:14 AM, Kostiantyn Rybnikov <span dir="ltr"><<a href="mailto:k-bx@k-bx.com" target="_blank">k-bx@k-bx.com</a>></span> wrote:<br></div></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div><div class="h5"><div dir="ltr">Hi!<div><br></div><div>Our company's main commercial product is a Snap-based web app which we compile with GHC 7.8.4. It works on four app-servers currently load-balanced behind Haproxy.</div><div><br></div><div>I recently implemented a new piece of functionality, which led to weird behavior which I have no idea how to debug, so I'm asking here for help and ideas!</div><div><br></div><div>The new functionality is this: on specific url-handler, we need to query n external services concurrently with a timeout, gather and render results. Easy (in Haskell)!</div><div><br></div><div>The implementation looks, as you might imagine, something like this (sorry for almost-real-haskell, I'm sure I forgot tons of imports and other things, but I hope everything is clear as-is, if not -- I'll be glad to update gist to make things more specific):</div><div><br></div><div><a href="https://gist.github.com/k-bx/0cf7035aaf1ad6306e76" target="_blank">https://gist.github.com/k-bx/0cf7035aaf1ad6306e76</a><br></div><div><br></div><div>Now, this works wonderful for some time, and in logs I can see both, successful fetches of external-content, and also lots of timeouts from our external providers. Life is good.</div><div><br></div><div>But! After several days of work (sometimes a day, sometimes couple days), apps on all 4 servers go crazy. It might take some interval (like 20 minutes) before they're all crazy, so it's not super-synchronous. Now: how crazy, exactly?</div><div><br></div><div>First of all, this endpoint timeouts. Haproxy requests for a response, and response times out, so they "hang".</div><div><br></div><div>Secondly, logs are interesting. If you look at the code from gist once again, you can see, that some of CandidateProvider's don't actually require any networking work, so all they do is actually just logging that they're working (I added this as part of debugging actually) and return pure data. So what's weird is that they timeout also! Here's how output of our logs starts to look like after the bug happens:</div><div><br></div><div>```</div><div>[2015-04-22 09:56:20] provider: CandidateProvider1</div><div>[2015-04-22 09:56:20] provider: CandidateProvider2</div><div>[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider1</div><div>[2015-04-22 09:56:21] Got timeout while requesting CandidateProvider2</div><div>[2015-04-22 09:56:22] provider: CandidateProvider1</div><div>[2015-04-22 09:56:22] provider: CandidateProvider2</div><div>[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider1</div><div>[2015-04-22 09:56:23] Got timeout while requesting CandidateProvider2</div><div>... and so on</div><div>```</div><div><br></div><div>What's also weird is that, even after timeout is logged, the string ""Got responses!" never gets logged also! So hanging happens somewhere in-between.</div><div><br></div><div>I have to say I'm sorry that I don't have strace output now, I'll have to wait until this situation happens once again, but I'll get later to you with this info.</div><div><br></div><div>So, how is this possible that almost-pure code gets timed-out? And why does it hang afterwards?</div><div><br></div><div>CPU and other resource usage is quite low, number of open file-descriptors also (it seems).</div><div><br></div><div>Thanks for all the suggestions in advance!</div></div>
<br></div></div><span class="">_______________________________________________<br>
Haskell-Cafe mailing list<br>
<a href="mailto:Haskell-Cafe@haskell.org" target="_blank">Haskell-Cafe@haskell.org</a><br>
<a href="http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe" target="_blank">http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe</a><br>
<br></span></blockquote></div><span class="HOEnZb"><font color="#888888"><br><br clear="all"><div><br></div>-- <br><div>Gregory Collins <<a href="mailto:greg@gregorycollins.net" target="_blank">greg@gregorycollins.net</a>></div>
</font></span></div>
</blockquote></div><br></div>