GHC 6.4.3 is stalled

Gregory Wright
Fri Jul 28 05:32:50 EDT 2006

On Jul 28, 2006, at 3:58 AM, Simon Marlow wrote:

> Hi Greg,
> Gregory Wright wrote:
>> Some data and a few questions:
>> 1. The failure on FreeBSD is not the same as on OS X.  I built 6.4.2
>> from cvs on FreeBSD 6.1, and ran the ghc-regress tests. The tests
>> took a long time to run (about 14 hours on a dual Xeon 2.8 GHz
>> with 2 GB of memory). Towards the end of the tests, there were
>> about 30 "timeout" processes running, apparently doing nothing
>> but consuming cpu cycles.
> Ok, this is certainly a problem with forkOS in the threaded RTS in  
> 6.4.2 on FreeBSD.  I probably need to get access to a FreeBSD box  
> to fix this myself, the code is pretty delicate (and sadly it has  
> completely changed in 6.6, too).
> It might be worth trying with -lthr instead of -lpthread, according  
> to Robert Watson.  This switches to an alternative, 1:1, threading  
> library.

I can try this.  If you need access to a FreeBSD 6.1 box (dual 2.8  
GHz Xeon, 2 G RAM),
I can set up ssh access for you.  Let me know.

>> 2. Notes on reproducing the FreeBSD 6.4.2 build:  I used
>>     fpconfig from the ghc-6-4 branch;
>>     ghc, libraries, hslibs and testsuite from the ghc-6-4-2 branch;
>>     gnu make 3.80;
>>     autoconf 2.59.
>> Gnu make 3.81 went into an infinite loop, much as gnu make 3.79
>> did when building ghc on OS X.
> That's odd, the fix for make 3.79 is in the 6.4.2 tree (rev.  
> of mk/  Something else must be happening with  
> 3.81, sigh.

Yes, seems to be one of those things.  I'm not going to look at it,  
since using
3.80 seems to work well enough at the moment.

>> 3. Did the threaded RTS work on 6.4.1?  Was it used by default?
> Presumably not.  In 6.4.2 we switched to using the threaded RTS by  
> default for GHC itself, which has forced the problem to the  
> surface.  Also there were some changes to the timeout program in  
> the testsuite, which have apparently forced some other problems to  
> the surface.
>> I can provide an RTS thread listing (+RTS -Ds) if that would be a   
>> starting
>> point.  Someone would have to explain what it means to me, though.
>> 4. When running with debugging turned on, I have seen the  
>> assertion  failure
>> ghc-6.4.2: internal error: ASSERTION FAILED: file GC.c, line 4356
>>     Please report this as a compiler bug.  See:
>> This points toward the stack being corrupted.  Maybe a thread   
>> overflowing
>> its stack?  I'm not sure.  The assertion that fails is
>>     ASSERT(frame < bottom);
>> It looks as if something has messed up the stack before this.
> Ok, it would help to find a smaller program that crashes with - 
> threaded: debugging GHC itself is quite hard because it's difficult  
> to get a deterministic run and hence reproducibility.  Look at   
> your testsuite failures and find threaded failures that aren't due  
> to the compiler crashing (or just build stage2 without -threaded  
> and run the testsuite again).  Tests in concurrent/ are a good bet.
> When we have a smallish program that crashes, we can start debugging.

I will do a build and look at the failing tests to isolate a simple  

Here's another data point: Joel Reymont said that his OS X/intel  
builds do not
crash during the testsuite (nothing in the CrashReporter logs).  But  
he mentioned
that he saw the accumulation of "timeout" processes.  Earlier this  
week, I
acquired a MacBook and have just finished loading ghc onto it.  I  
will try
to reproduce his result.

That information, if true, is a bit discouraging.  It seems to say  
that the problem
on intel may be different from that on ppc.  In particular, the  
compiler crashes
may only be happening on ppc. Yuck.  I will verify whether this is so.

Best Wishes,

>> I am willing to dig into this, but I need a bit more help with  
>> where  to start.
> Thanks for your help!
> Cheers,
> 	Simon
