More on FreeBSD/amd64

Gregory Wright gwright at comcast.net
Wed Mar 28 10:06:36 EDT 2007


Hi Ian,

I have made some more progress on understanding the build
failure on FreeBSD/amd64.  I could use a check on my understanding
of the problem, though.

The setup:  I have an unregisterized ghc-6.4.2 successfully built
on FreeBSD/amd64.  It was bootstrapped from .hc files compiled
on FreeBSD/i386.  I am attempting to use this compiler (the ghc-inplace
from the unregisterized build, not a fully installed compiler) to
build a recent ghc-6.6 branch from darcs (20070314).

The build of ghc-6.6-20070314 fails when compiling rts/Linker.c.
The failure is mostly reproducible (more about that below).

It's also worth remembering that when I tried to build an unregisterized
ghc-6.6 on FreeBSD/amd64 using .hc files from ghc-6.6 built on  
FreeBSD/i386,
I had a crash at the same place, while trying to build rts/Linker.c

The failure comes from trying to allocate a huge amount of memory.
newPinnedByteArrayzh_fast is called with a giant argument, 0x4000000010.
So it looks like we're after 16 bytes, but the upper 32 bits has some
junk in it.

The above was the state of things just over a week ago.  Since then,
I've worked to track down whether the bug is in the ghc-6.4.2 runtime
or in the ghc-6.6 code.  The compiler that fails is the 6.6 stage1
compiler, which is a 6.6 compiler linked with the runtime from the
unregisterized 6.4.2 (let me know if I'm wrong about that).

I have rebuilt the unregisterized 6.4.2 with optimization turned off.
I haven't got any more information this way; it seems the problem is
really on the 6.6 side.  Here is my reasoning:

I run the 6.6 compiler under the debugger:


greenhouse-george> gdb /tmp/ghc/compiler/stage1/ghc-6.6.20070314
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and  
you are
welcome to change it and/or distribute copies of it under certain  
conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for  
details.
This GDB was configured as "amd64-marcel-freebsd"...
(gdb) dir /tmp/ghc-6.4.2/ghc/rts
Source directories searched: /tmp/ghc-6.4.2/ghc/rts:$cdir:$cwd
(gdb) b newPinnedByteArrayzh_fast
Breakpoint 1 at 0x163cbb4
(gdb) run -B/tmp/ghc -v  -optc-O -optc-Wall -optc-W -optc-Wstrict- 
prototypes -optc-Wmissing-prototypes -optc-Wmissing-declarations - 
optc-Winline -optc-Waggregate-return -optc-Wbad-function-cast -optc- 
I../includes -optc-I. -optc-Iparallel -optc-DCOMPILING_RTS -optc- 
fomit-frame-pointer -optc-I/usr/local/include -optc-fno-strict- 
aliasing -H16m -O -optc-O2 -static -I/usr/local/include -I. -#include  
HCIncludes.h -fvia-C -dcmm-lint     -c Linker.c -o Linker.o
Starting program: /tmp/ghc/compiler/stage1/ghc-6.6.20070314 -B/tmp/ 
ghc -v  -optc-O -optc-Wall -optc-W -optc-Wstrict-prototypes -optc- 
Wmissing-prototypes -optc-Wmissing-declarations -optc-Winline -optc- 
Waggregate-return -optc-Wbad-function-cast -optc-I../includes -optc- 
I. -optc-Iparallel -optc-DCOMPILING_RTS -optc-fomit-frame-pointer - 
optc-I/usr/local/include -optc-fno-strict-aliasing -H16m -O -optc-O2 - 
static -I/usr/local/include -I. -#include HCIncludes.h -fvia-C -dcmm- 
lint     -c Linker.c -o Linker.o

Breakpoint 1, 0x000000000163cbb4 in newPinnedByteArrayzh_fast ()


By playing around with it, I have isolated --- more or less --- when the
failure occurs.  In this run, it was after 973 calls to  
newPinnedByteArrayzh_fast.
If I quit gdb and re-run it, the number of calls is consistently the  
same.
However, from one system boot to the next, it varies a bit.   
Yesterday morning,
I had to skip 976 calls.


(gdb) c 973
Will ignore next 972 crossings of breakpoint 1.  Continuing.
Glasgow Haskell Compiler, Version 6.6.20070314, for Haskell 98,  
compiled by GHC version 6.4.2
Using package config file: /tmp/ghc/driver/package.conf.inplace
wired-in package base not found.
wired-in package rts mapped to rts-1.0
wired-in package haskell98 not found.
wired-in package template-haskell not found.
Hsc static flags: -static -static
Created temporary directory: /tmp/ghc1073_0
*** C Compiler:
gcc -x c Linker.c -o /tmp/ghc1073_0/ghc1073_0.s -v -S -Wimplicit -O - 
D__GLASGOW_HASKELL__=606 -O -Wall -W -Wstrict-prototypes -Wmissing- 
prototypes -Wmissing-declarations -Winline -Waggregate-return -Wbad- 
function-cast -I../includes -I. -Iparallel -DCOMPILING_RTS -fomit- 
frame-pointer -I/usr/local/include -fno-strict-aliasing -O2 -I /usr/ 
local/include -I . -I /tmp/ghc/includes -fwrapv

Breakpoint 1, 0x000000000163cbb4 in newPinnedByteArrayzh_fast ()


OK, I should be at the call which is going to blow up:


(gdb) bt
#0  0x000000000163cbb4 in newPinnedByteArrayzh_fast ()
#1  0x00000000016377ea in StgRun (f=0x163cbb0  
<newPinnedByteArrayzh_fast>,
     basereg=0x260fcd0) at StgCRun.c:93
#2  0x0000000001631f48 in schedule (mainThread=0x2611080,
     initialCapability=0x0) at Schedule.c:932
#3  0x0000000001633190 in waitThread_ (m=0x2611080,  
initialCapability=0x0)
     at Schedule.c:2156
#4  0x0000000001633085 in scheduleWaitThread (tso=0x8021c0000, ret=0x0,
     initialCapability=0x0) at Schedule.c:2050
#5  0x000000000162d60f in rts_evalLazyIO (p=0x1d409d0, ret=0x0) at  
RtsAPI.c:459
#6  0x000000000162ca24 in main (argc=33, argv=0x7fffffffe960) at  
Main.c:104
(gdb) f 1
#1  0x00000000016377ea in StgRun (f=0x163cbb0  
<newPinnedByteArrayzh_fast>,
     basereg=0x260fcd0) at StgCRun.c:93
93              f = (StgFunPtr) (f)();


Since the 6.4.2 runtime is unregisterized, the parameter are passed
through the virtual, not actual registers (is this correct?).  The
code for newPinnedByteArrayzh_fast says that the requested number of  
bytes
is passed in through R1, so take a look:


(gdb) p basereg->rR1
$1 = {w = 274877906960, a = 0x4000000010, c = 16, i8 = 16 '\020',
   f = 2.24207754e-44, i = 274877906960, p = 0x4000000010, cl =  
0x4000000010,
   offset = 274877906960,
   b = 0x4000000010 <Error reading address 0x4000000010: Bad address>,
   t = 0x4000000010}


This request should certainly wedge the system by gobbling up all of the
memory.  The question now is what put the bad value into R1?  It seems
as if newPinnedByteArrayzh_fast is just getting too big a request. I  
looked
at the assembly code for newPinnedByteArrayzh_fast, and the arguments
passed down to the functions it calls are calculated correctly, but are
just too big.

I also tried setting the breakpoint one call earlier (972 instead of  
973),
and then stepping through until I reached the next breakpoint.
newPinnedByteArrayzh_fast calls allocatePinned successfully and control
come back to StgCRun.c at line 87.  This is the miniinterpreter loop.
If I keep stepping control just stays in the miniinterpreter loop until
newPinnedByteArrayzh_fast is called again.  (I used the gdb "until"  
command
to run the loop until it either finished or hit a breakpoint.  It hit  
the
breakpoint.)

This leads me to believe that the problem is in the haskell code, not  
the
C code of the runtime system.

Looking at the RTS stack,


(gdb) pmem basereg->rSp 32
0x802dfe808:    0x0
0x802dfe800:    0x0
0x802dfe7f8:    0x2216f80 <stg_stop_thread_info>
0x802dfe7f0:    0x2217000 <stg_noforceIO_info>
0x802dfe7e8:    0x2149b90 <GHCziConc_childHandler_closure>
0x802dfe7e0:    0x0
0x802dfe7d8:    0x2214b60 <stg_catch_frame_info>
0x802dfe7d0:    0x2214b20 <stg_unblockAsyncExceptionszh_ret_info>
0x802dfe7c8:    0x802a121b0
0x802dfe7c0:    0x1d61f60 <s99O_info>
0x802dfe7b8:    0x802a15848
0x802dfe7b0:    0x1
0x802dfe7a8:    0x2214b60 <stg_catch_frame_info>
0x802dfe7a0:    0x2214b40 <stg_blockAsyncExceptionszh_ret_info>
0x802dfe798:    0x22179e0 <stg_ap_v_info>
0x802dfe790:    0x802a17af0
0x802dfe788:    0x802a17b08
0x802dfe780:    0x802a121b0
0x802dfe778:    0x802a15870
0x802dfe770:    0x1d61a40 <s99v_info>
0x802dfe768:    0x217aef0 <s6Z8_info>
0x802dfe760:    0x802a17b88
0x802dfe758:    0x802d04e98
0x802dfe750:    0x2125f40 <s3eo_info>
0x802dfe748:    0x802a18970
0x802dfe740:    0x802a17b08
0x802dfe738:    0x802a17b78
0x802dfe730:    0x1d61620 <r37t_closure>
0x802dfe728:    0x400000000
0x802dfe720:    0x802d04ed0
0x802dfe718:    0x400000000
0x802dfe710:    0x220db80 <s31b_info>
(gdb)



I'm guessing that the culpable module defines the symbol "s31b".   I  
added -ddump-stg
to the options for the 6.6 build.  (In fact, deleted everything,  
added the -ddump-stg option,
rebuilt and did the run under the debugger shown above, to make  
certain that everything
was consistent.)   The symbol "s31b" is from the module  
GHC.Compat.Unicode.

This may be getting somewhere.  Compat.Unicode does not exist in  
6.4.2, which
may explain why the unregisterized 6.4.2 is quite reliable --- I used  
it to build alex
and happy with no trouble.  Also, the Unicode module makes foreign  
calls to C library
functions.  FreeBSD has its own implementation of libc, not the same  
as linux, which
may explain why FreeBSD/amd64 chokes and Linux/amd64 works.

So. I am asking if this argument makes sense, especially my  
interpretation of
the RTS stack dump.  Any suggestions are appreciated.

Best Wishes,
Greg


-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/glasgow-haskell-users/attachments/20070328/1ab6adf1/attachment-0001.htm


More information about the Glasgow-haskell-users mailing list