[Haskell-cafe] Project postmortem
joelr1 at gmail.com
Thu Nov 17 08:43:14 EST 2005
I have done a lot of experiments over the past few weeks and came to
a few interesting conclusions. First some background, then issues,
solutions and conclusions.
I wrote a test harness for a poker server that understands the
different binary packets and can send and receive them. The harness
launches each "script" in a separate unbound thread that connects to
the server via TCP and does its work.
The main goals of the project were: easy scripting, very high number
of connections from the harness (a few thousand) and running on
Windows. I develop on Mac OSX but have a Windows machine for testing
and to run the poker server.
Another key goal was to support the server encryption. SSL encryption
is done in a wierd way that requires attaching read/write OpenSSL
BIOs to the SSL descriptor so that SSL encrypts to/from memory.
Encrypted chunks are then taken from the BIOs and sent as payload in
Overall, I probably spent about 4 weeks writing the server and about
2 more weeks grappling with the various issues. The issues centered
around 1) the program trashing memory like no tomorrow, 2)
intermittent crashes on Windows and 3) not being able to launch a
high number of connections on Windows before crashing.
I significantly improved trashing of memory by switching to plain
Haskell structures from nested lists of wxHaskell-style properties
(attr := value). Intermittent crashes were harder to troubleshoot,
specially given that things were running smoothly on Mac OSX.
Stack traces pointed into libcrypto (part of OpenSSL) and thus to the
BIOs that I was allocating. I guesses that OpenSSL was maxing out
some resources and closed the leak by explicitly freeing the SSL
descriptor which freed the associated BIO structures. Then things got
wierder as my program started crashing in a different place entirely
with stack traces like this:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
#0 0x0027c174 in s8j1_info ()
#1 0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2 0x0021cdc4 in schedule (mainThread=0x1100360,
initialCapability=0x308548) at Schedule.c:932
#3 0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at
#4 0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,
initialCapability=0x0) at Schedule.c:2050
#5 0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6 0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104
I took waitThread_ as a clue and started digging deeper.
Whenever I connect to the server or send a command I wait for X
seconds and if not connected or desired command is not received I
throw an exception which fails the script. I implemented the timeout
combinator a couple of different ways, including that in the
Asynchronous Exceptions paper but it did not help. I think the issue
has to do with killing threads that are using FFI. Although I'm
killing threads that call the Haskell connectTo, hGetBuf, etc. I
think it's still FFI.
I disposed of timeouts entirely, leaving connectTo as it is and using
hWaitForInput on my socket handle to simulate timeouts. This improved
things tremendously and I'm now able to run a few thousands of
unbound script threads on Windows with OpenSSL FFI and everything.
Memory usage is still higher than I would have liked and crashes in
OpenSSL still happen when the number of threads/memory usage is
really high so there's still room for improvement. I should probably
go back to using a foreign finalizer (SSL_free) on the SSL
descriptors rather than freeing them explicitly as the freeing does
not happen if a script fails mid-way.
I'm quite satisfied with my first Haskell project. I love Haskell and
will continue hacking away with it. This list is invaluable in the
depth of offered help whereas #haskell (IRC) is invaluable when speed
matters. I'm quite amazed at the things I have been able to do, the
expressiveness of Haskell and the clean looks.
Clean looks can be deceptive, though, as they can hide code of
amazing complexity. Fundeps, existential types, HList take a while to
grasp. Also, I feel somewhat like a pioneer and I definitely got more
than a fair share of arrows in my back.
I had GHC run out of memory during compilation (fixed by SPJ), had it
quit midway during compilation with an error about generated extents
being too large in assembler code. I had GHC crash at runtime with an
error like "fromJust not returning Just, this could not be
happening!". Yesterday's error topped them all:
internal error: update_fwd: unknown/strange object 0
Please report this as a bug to glasgow-haskell-bugs at haskell.org,
I think I got this when using +RTS -C0 -c.
Overall, the experience with Haskell has been exhilarating and I'm
already preparing to use it on my next projects like detecting
collusion in poker as well as rake optimization (Dazzle paper very
helpful here!). Still, I think that GHC can be a bit rough around the
edges and I would think twice about writing high-performance network
apps with it.
P.S. The Glasgow Distributed Haskell (GdH) people are supposed to
have a mailing list and I would love to share my findings twith them
but I could not find the mailing list itself.
More information about the Haskell-Cafe