[Haskell-cafe] Project postmortem

Thu Nov 17 08:43:14 EST 2005

Folks,

I have done a lot of experiments over the past few weeks and came to  
a few interesting conclusions. First some background, then issues,  
solutions and conclusions.

I wrote a test harness for a poker server that understands the  
different binary packets and can send and receive them. The harness  
launches each "script" in a separate unbound thread that connects to  
the server via TCP and does its work.

The main goals of the project were: easy scripting, very high number  
of connections from the harness (a few thousand) and running on  
Windows. I develop on Mac OSX but have a Windows machine for testing  
and to run the poker server.

Another key goal was to support the server encryption. SSL encryption  
is done in a wierd way that requires attaching read/write OpenSSL  
BIOs to the SSL descriptor so that SSL encrypts to/from memory.  
Encrypted chunks are then taken from the BIOs and sent as payload in  
servver packets.

Overall, I probably spent about 4 weeks writing the server and about  
2 more weeks grappling with the various issues. The issues centered  
around 1) the program trashing memory like no tomorrow, 2)  
intermittent crashes on Windows and 3) not being able to launch a  
high number of connections on Windows before crashing.

I significantly improved trashing of memory by switching to plain  
Haskell structures from nested lists of wxHaskell-style properties  
(attr := value). Intermittent crashes were harder to troubleshoot,  
specially given that things were running smoothly on Mac OSX.

Stack traces pointed into libcrypto (part of OpenSSL) and thus to the  
BIOs that I was allocating. I guesses that OpenSSL was maxing out  
some resources and closed the leak by explicitly freeing the SSL  
descriptor which freed the associated BIO structures. Then things got  
wierder as my program started crashing in a different place entirely  
with stack traces like this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0  0x0027c174 in s8j1_info ()
#1  0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2  0x0021cdc4 in schedule (mainThread=0x1100360,  
initialCapability=0x308548) at Schedule.c:932
#3  0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at  
Schedule.c:2156
#4  0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,  
initialCapability=0x0) at Schedule.c:2050
#5  0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6  0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104

I took waitThread_ as a clue and started digging deeper.

Whenever I connect to the server or send a command I wait for X  
seconds and if not connected or desired command is not received I  
throw an exception which fails the script. I implemented the timeout  
combinator a couple of different ways, including that in the  
Asynchronous Exceptions paper but it did not help. I think the issue  
has to do with killing threads that are using FFI. Although I'm  
killing threads that call the Haskell connectTo, hGetBuf, etc. I  
think it's still FFI.

I disposed of timeouts entirely, leaving connectTo as it is and using  
hWaitForInput on my socket handle to simulate timeouts. This improved  
things tremendously and I'm now able to run a few thousands of  
unbound script threads on Windows with OpenSSL FFI and everything.

Memory usage is still higher than I would have liked and crashes in  
OpenSSL still happen when the number of threads/memory usage is  
really high so there's still room for improvement. I should probably  
go back to using a foreign finalizer (SSL_free) on the SSL  
descriptors rather than freeing them explicitly as the freeing does  
not happen if a script fails mid-way.

I'm quite satisfied with my first Haskell project. I love Haskell and  
will continue hacking away with it. This list is invaluable in the  
depth of offered help whereas #haskell (IRC) is invaluable when speed  
matters. I'm quite amazed at the things I have been able to do, the  
expressiveness of Haskell and the clean looks.

Clean looks can be deceptive, though, as they can hide code of  
amazing complexity. Fundeps, existential types, HList take a while to  
grasp. Also, I feel somewhat like a pioneer and I definitely got more  
than a fair share of arrows in my back.

I had GHC run out of memory during compilation (fixed by SPJ), had it  
quit midway during compilation with an error about generated extents  
being too large in assembler code. I had GHC crash at runtime with an  
error like "fromJust not returning Just, this could not be  
happening!". Yesterday's error topped them all:

internal error: update_fwd: unknown/strange object  0
    Please report this as a bug to glasgow-haskell-bugs at haskell.org,
    or http://www.sourceforge.net/projects/ghc/

I think I got this when using +RTS -C0 -c.

Overall, the experience with Haskell has been exhilarating and I'm  
already preparing to use it on my next projects like detecting  
collusion in poker as well as rake optimization (Dazzle paper very  
helpful here!). Still, I think that GHC can be a bit rough around the  
edges and I would think twice about writing high-performance network  
apps with it.

	Thanks, Joel

P.S. The Glasgow Distributed Haskell (GdH) people are supposed to  
have a mailing list and I would love to share my findings twith them  
but I could not find the mailing list itself.

--
http://wagerlabs.com/