[GHC] #13624: loadObj() does not respect alignment

Fri Apr 28 03:22:39 UTC 2017

#13624: loadObj() does not respect alignment
-------------------------------------+-------------------------------------
           Reporter:  tmcdonell      |             Owner:  (none)
               Type:  bug            |            Status:  new
           Priority:  normal         |         Milestone:
          Component:  Runtime        |           Version:  8.0.1
  System (Linker)                    |
           Keywords:                 |  Operating System:  Unknown/Multiple
       Architecture:                 |   Type of failure:  None/Unknown
  Unknown/Multiple                   |
          Test Case:                 |        Blocked By:
           Blocking:                 |   Related Tickets:
Differential Rev(s):                 |         Wiki Page:
-------------------------------------+-------------------------------------
 This is perhaps known, but I'll write it down here in case somebody else
 runs into this problem as well.

 Since `loadObj()` just `mmap()`s the entire object file and decodes it
 ''in place'', it does not respect the alignment requirements specified in
 the section headers. This is problematic for instructions which require
 alignment, e.g. SSE, AVX.

 The attached `map.ll` program is `map (+1)` over an array of floating
 point numbers. In particular, the core loop is 8-way SIMD vectorised x
 4-way unrolled, for 32-elements per loop iteration. A tail loop handles
 any remainder one-at-a-time.

 You can compile it using `llc -filetype=obj -mcpu=native map.ll`. For a
 CPU with AVX instructions (sandy bridge or later) you should get the
 following:

 {{{
 $ objdump -d map.o
 Disassembly of section .text:

 0000000000000000 <map>:
    0:   49 89 f3                mov    %rsi,%r11
    3:   49 29 fb                sub    %rdi,%r11
    6:   0f 8e f9 00 00 00       jle    105 <map+0x105>
    c:   49 83 fb 20             cmp    $0x20,%r11
   10:   0f 82 bd 00 00 00       jb     d3 <map+0xd3>
   16:   4d 89 da                mov    %r11,%r10
   19:   49 83 e2 e0             and    $0xffffffffffffffe0,%r10
   1d:   4d 89 d9                mov    %r11,%r9
   20:   49 83 e1 e0             and    $0xffffffffffffffe0,%r9
   24:   0f 84 a9 00 00 00       je     d3 <map+0xd3>
   2a:   49 01 fa                add    %rdi,%r10
   2d:   48 8d 44 ba 60          lea    0x60(%rdx,%rdi,4),%rax
   32:   49 8d 7c b8 60          lea    0x60(%r8,%rdi,4),%rdi
   37:   c5 fc 28 05 00 00 00    vmovaps 0x0(%rip),%ymm0        # 3f
 <map+0x3f>
   3e:   00
   3f:   4c 89 c9                mov    %r9,%rcx
   42:   66 66 66 66 66 2e 0f    data16 data16 data16 data16 nopw
 %cs:0x0(%rax,%rax,1)
   49:   1f 84 00 00 00 00 00
   50:   c5 f8 10 4f a0          vmovups -0x60(%rdi),%xmm1
   55:   c5 f8 10 57 c0          vmovups -0x40(%rdi),%xmm2
   5a:   c5 f8 10 5f e0          vmovups -0x20(%rdi),%xmm3
   5f:   c5 f8 10 27             vmovups (%rdi),%xmm4
   63:   c4 e3 75 18 4f b0 01    vinsertf128 $0x1,-0x50(%rdi),%ymm1,%ymm1
   6a:   c4 e3 6d 18 57 d0 01    vinsertf128 $0x1,-0x30(%rdi),%ymm2,%ymm2
   71:   c4 e3 65 18 5f f0 01    vinsertf128 $0x1,-0x10(%rdi),%ymm3,%ymm3
   78:   c4 e3 5d 18 67 10 01    vinsertf128 $0x1,0x10(%rdi),%ymm4,%ymm4
   7f:   c5 f4 58 c8             vaddps %ymm0,%ymm1,%ymm1
   83:   c5 ec 58 d0             vaddps %ymm0,%ymm2,%ymm2
   87:   c5 e4 58 d8             vaddps %ymm0,%ymm3,%ymm3
   8b:   c5 dc 58 e0             vaddps %ymm0,%ymm4,%ymm4
   8f:   c4 e3 7d 19 48 b0 01    vextractf128 $0x1,%ymm1,-0x50(%rax)
   96:   c5 f8 11 48 a0          vmovups %xmm1,-0x60(%rax)
   9b:   c4 e3 7d 19 50 d0 01    vextractf128 $0x1,%ymm2,-0x30(%rax)
   a2:   c5 f8 11 50 c0          vmovups %xmm2,-0x40(%rax)
   a7:   c4 e3 7d 19 58 f0 01    vextractf128 $0x1,%ymm3,-0x10(%rax)
   ae:   c5 f8 11 58 e0          vmovups %xmm3,-0x20(%rax)
   b3:   c4 e3 7d 19 60 10 01    vextractf128 $0x1,%ymm4,0x10(%rax)
   ba:   c5 f8 11 20             vmovups %xmm4,(%rax)
   be:   48 83 e8 80             sub    $0xffffffffffffff80,%rax
   c2:   48 83 ef 80             sub    $0xffffffffffffff80,%rdi
   c6:   48 83 c1 e0             add    $0xffffffffffffffe0,%rcx
   ca:   75 84                   jne    50 <map+0x50>
   cc:   4d 39 cb                cmp    %r9,%r11
   cf:   75 05                   jne    d6 <map+0xd6>
   d1:   eb 32                   jmp    105 <map+0x105>
   d3:   49 89 fa                mov    %rdi,%r10
   d6:   4c 29 d6                sub    %r10,%rsi
   d9:   4a 8d 04 92             lea    (%rdx,%r10,4),%rax
   dd:   4b 8d 0c 90             lea    (%r8,%r10,4),%rcx
   e1:   c5 fa 10 05 00 00 00    vmovss 0x0(%rip),%xmm0        # e9
 <map+0xe9>
   e8:   00
   e9:   0f 1f 80 00 00 00 00    nopl   0x0(%rax)
   f0:   c5 fa 58 09             vaddss (%rcx),%xmm0,%xmm1
   f4:   c5 fa 11 08             vmovss %xmm1,(%rax)
   f8:   48 83 c0 04             add    $0x4,%rax
   fc:   48 83 c1 04             add    $0x4,%rcx
  100:   48 ff ce                dec    %rsi
  103:   75 eb                   jne    f0 <map+0xf0>
  105:   c5 f8 77                vzeroupper
  108:   c3                      req
 }}}

 The attached `test.c` will load the object file and try to execute it. The
 `#define N` on line 7 will change the size of the array. For fewer than 32
 elements this works as expected (where the input array is [0..N-1]):

 {{{
 $ ./build.sh
 + llc-4.0 -filetype=obj -mcpu=native map.ll
 + ghc --make -no-hs-main test.c

 $ ./a.out
 array size is 31
 calling function...
 ok
 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0 16.0
 17.0 18.0 19.0 20.0 21.0 22.0 23.0 24.0 25.0 26.0 27.0 28.0 29.0 30.0 31.0
 }}}

 For 32 elements or larger (i.e. entering the core loop) the program will
 (almost certainly) segfault.

 {{{
 $ lldb a.out
 (lldb) target create "a.out"
 Current executable set to 'a.out' (x86_64).
 (lldb) run
 Process 7294 launched: '<snip>/a.out' (x86_64)
 array size is 32
 calling function...
 Process 7294 stopped
 * thread #1: tid = 0xc41676, 0x000000010019f207, queue = 'com.apple.main-
 thread', stop reason = EXC_BAD_ACCESS (code=EXC_I386_GPFLT)
     frame #0: 0x000000010019f207
 ->  0x10019f207: vmovaps 0xe1(%rip), %ymm0
     0x10019f20f: movq   %r9, %rcx
     0x10019f212: nopw   %cs:(%rax,%rax)
     0x10019f220: vmovups -0x60(%rdi), %xmm1
 }}}

 The `VMOVAPS` instruction requires the source address to be 32-byte
 aligned. It is attempting to load 8 floats from one of the const sections
 (the ones for the +1), but since the section was not loaded at the
 required alignment, fails.

 I've tested this on x86_64 macOS (Mach-O) and ubuntu (ELF). I don't have
 any other systems to test on.

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/13624>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler