[Haskell-cafe] Re: Mining Twitter data in Haskell and Clojure

Simon Marlow marlowsd at gmail.com
Wed Jun 16 05:47:41 EDT 2010


On 15/06/2010 20:43, braver wrote:
> On Jun 15, 6:27 am, Simon Marlow<marlo... at gmail.com>  wrote:
>> On 15/06/2010 06:09, braver wrote:
>>
>>> In fact, the tag cafe2, when run on the full dataset, gets stuck at 11
>>> days, with RAM slowly getting into 50 GB; a previous version caused
>>> ghc 6.12.1 to segfault around day 12 -- -debug showing an assert
>>> failure in Storage.c.  ghc 6.10 got stuck at 30 days for good, and
>>> when profiling crashed twice with  a "strange closure" or a stack
>>> overflow.  So allocation is a problem still.
>>
>> I'd be happy to help you track this down, but I don't have a machine big
>> enough.  Do you have any runs that display a problem with a smaller heap
>> (<  16GB)?
>>
>> If the program is apparently hung, try connecting to it with 'gdb
>> --pid=<pid>' and doing 'info thread' and 'where'.  That might give me
>> enough clues to find out where the problem is.
>>
>> Is this with -threaded, BTW?  With residency on that scale, I'd expect
>> the parallel GC to help quite a lot.  But obviously getting it to not
>> crash/hang is the first priority :)
>
> Simon - thanks for the tips, this is what gdb says when it's stuck at
> 45 GB when limited with -A5G -M40G:
>
> ...
> 0x00000000004c3c21 in free_mega_group ()
> (gdb) info thread
> * 1 Thread 0x2b21c1da4dc0 (LWP 10210)  0x00000000004c3c21 in
> free_mega_group ()
> (gdb) where
> #0  0x00000000004c3c21 in free_mega_group ()
> #1  0x00000000004c3ff9 in freeChain ()
> #2  0x00000000004c5ab0 in GarbageCollect ()
> #3  0x00000000004bff96 in scheduleDoGC ()
> #4  0x00000000004c0b25 in scheduleWaitThread ()
> #5  0x00000000004bea09 in real_main ()
> #6  0x00000000004beb17 in hs_main ()
> #7  0x00000037d5a1d974 in __libc_start_main () from /lib64/libc.so.6
> #8  0x0000000000402ca9 in _start ()

Thanks.  I don't see anything obviously wrong in free_mega_group() - 
it's part of the memory manager that returns a multi-MB block to the 
internal free list, and it looks down the free list to find the right 
place to put it, coalescing with adjacent free blocks if possible.  If 
it is looping here, that means the free list has a cycle, which is very 
bad indeed.

Could you try a few more things for me?

  - type 'display /i $pc' and then single step with 'si' for a while
    when it is in this state.  That will tell us whether it's looping
    here or not.

  - compile with -debug and run again.  That turns on a bunch of
    assertions.  You could also try adding +RTS -DS, this turns on
    more sanity checking (and will slow things down a lot).

If you are comfortable giving me a login on your machine then I could 
debug it directly, let me know.

Cheers,
	Simon


More information about the Haskell-Cafe mailing list