SIMD/SSE support & alignment

Sun Mar 10 22:52:36 CET 2013

All,

I've been toying with the SSE code generation in GHC 7.7 and Geoffrey
Mainland's work to integrate this into the 'vector' library in order to
generate SIMD code from high-level Haskell code.

While working with this, I wrote some simple code for testing purposes,
then compiled this into LLVM IR and x86_64 assembly form in order to
figure out how 'good' the resulting code would be.

First and foremost: I'm really impressed. Whilst there's most certainly
room for improvement (one of them touched in this mail, though I also
noticed unnecessary constant memory reads inside a tight loop), the
initial results look very promising, especially taking into account how
high-level the source code is. This is pretty amazing!

As an example, here's 'test.hs':

{-# OPTIONS_GHC -fllvm -O3  -optlo-O3 -optlc-O=3 -funbox-strict-fields
#-}
module Test (sum) where

import Prelude hiding (sum)
import Data.Int (Int32)
import Data.Vector.Unboxed (Vector)
import qualified Data.Vector.Unboxed as U

sum :: Vector Int32 -> Int32
sum v = U.mfold' (+) (+) 0 v

When compiling this into assembly (compiler/library version details at
the end of this message), the 'sum' function yields (among other things)
this code:

.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header:
Depth=1
	prefetcht0	(%rsi)
	movdqu	-1536(%rsi), %xmm1
	paddd	%xmm1, %xmm0
	addq	$16, %rsi
	addq	$4, %rcx
	cmpq	%rdx, %rcx
	jl	.LBB2_3

The full LLVM IR and assembler output are attached to this message.

Whilst this is a nice and tight loop, I noticed the use of 'movdqu',
which is used for non-128bit aligned memory access in SSE code. For
aligned memory, 'movdqa' can be used, and this can have a major
performance impact.

Whilst I understand why this code is currently generated as-is (also in
other sample inputs), I wondered whether there are plans/approaches to
tackle this. In some cases (e.g. in 'sum') this could be by using the
scalar calculation at the beginning of the vector up until an aligned
boundary, then use aligned access and handle the tail using scalars
again, but I assume OTOH that's not trivial when multiple 'source'
vectors are used in the calculation.

This might even become more complex when using AVX code, which needs
256bit alignments.

Whilst I can't propose an out-of-the-box solution, I'd like to point at
the 'vector-simd' code [1] I wrote some months ago, which might propose
some ideas. In this package, I created an unboxed vector-like type whose
alignment is tracked at type level, and functions which consume a vector
define the minimal required alignment. As such, vectors can be allocated
at the minimal alignment they're required to be, throughout all code
using them.

As an example, if I'd use this code (OTOH):

sseFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast A16 o2)
=> Vector o1 a -> Vector o2 a
sseFoo = undefined

avxFoo :: (Storable a, AlignedToAtLeast A32 o1, AlignedToAtLeast A32 o2,
AlignedToAtLeast A32 o3) => Vector o1 a -> Vector o2 a -> Vector o3 a
avxFoo = undefined

the type of

combinedFoo v = avxFoo sv sv
  where
    sv = sseFoo v

would automagically be

combinedFoo :: (Storable a, AlignedToAtLeast A16 o1, AlignedToAtLeast
A32 o2) => Vector o1 a -> Vector o2 a

and when using this

v1 = combinedFoo (Vector.fromList [1 :: Int32, 2, 3, 4, 5, 6, 7, 8])

the allocated argument vector (result of Vector.fromList) will be
16byte-aligned as expected/required for the SSE function to work with
unaligned loads internally (assuming no unaligned slices are supported,
etc), whilst the intermediate result of 'sseFoo' ('sv') will be 32-byte
aligned as required by 'avxFoo'.

Attached: test.ll and test.s, compilation results of test.hs using

$ ghc-7.7.20130302 -keep-llvm-files
-package-db=cabal-dev/packages-7.7.20130302.conf -fforce-recomp -S
test.hs

GHC from HEAD/master compiled on my Fedora 18 system using system LLVM
(3.1), 'primitive' 8aef578fa5e7fb9fac3eac17336b722cbae2f921 from
git://github.com/mainland/primitive.git and 'vector'
e1a6c403bcca07b4c8121753daf120d30dedb1b0 from
git://github.com/mainland/vector.git

Nicolas

[1] https://github.com/NicolasT/vector-simd
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.hs
Type: text/x-haskell
Size: 289 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130310/27ece7d7/attachment-0001.hs>
-------------- next part --------------
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-linux-gnu"

declare  ccc i8* @memcpy(i8*, i8*, i64)

declare  ccc i8* @memmove(i8*, i8*, i64)

declare  ccc i8* @memset(i8*, i64, i64)

declare  ccc i64 @newSpark(i8*, i8*)

!0 = metadata !{metadata !"top"}
!1 = metadata !{metadata !"stack",metadata !0}
!2 = metadata !{metadata !"heap",metadata !0}
!3 = metadata !{metadata !"rx",metadata !2}
!4 = metadata !{metadata !"base",metadata !0}
!5 = metadata !{metadata !"other",metadata !0}

%__stginit_Test_struct = type <{}>
@__stginit_Test =  global %__stginit_Test_struct<{}>

%Test_zdwa_closure_struct = type <{i64}>
@Test_zdwa_closure =  global %Test_zdwa_closure_struct<{i64 ptrtoint (void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_zdwa_info to i64)}>

%Test_sum1_closure_struct = type <{i64}>
@Test_sum1_closure =  global %Test_sum1_closure_struct<{i64 ptrtoint (void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_sum1_info to i64)}>

%Test_sum_closure_struct = type <{i64}>
@Test_sum_closure =  global %Test_sum_closure_struct<{i64 ptrtoint (void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @Test_sum_info to i64)}>

%S1DM_srt_struct = type <{}>
@S1DM_srt = internal constant %S1DM_srt_struct<{}>

%s1xB_entry_struct = type <{i64, i64, i64}>
@s1xB_info_itable = internal constant %s1xB_entry_struct<{i64 8589934602, i64 8589934593, i64 9}>, section "X98A__STRIP,__me1", align 8

define internal cc 10 void @s1xB_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me2"
{
c1AJ:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 %R3_Arg, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xr = alloca i64, i32 1
  %ls1xy = alloca i64, i32 1
  %ls1xB = alloca i64, i32 1
  %ln1EB = load i64* %R3_Var
  store i64 %ln1EB, i64* %ls1xr
  %ln1EC = load i64* %R2_Var
  store i64 %ln1EC, i64* %ls1xy
  %ln1ED = load i64* %R1_Var
  store i64 %ln1ED, i64* %ls1xB
  %ln1EE = load i64* %ls1xr
  %ln1EF = load i64* %ls1xB
  %ln1EG = add i64 %ln1EF, 14
  %ln1EH = inttoptr i64 %ln1EG to i64*
  %ln1EI = load i64* %ln1EH, !tbaa !5
  %ln1EJ = icmp sge i64 %ln1EE, %ln1EI
  br i1 %ln1EJ, label %c1AN, label %c1AM

c1AM:
  %ln1EK = load i64* %ls1xr
  %ln1EL = add i64 %ln1EK, 1
  store i64 %ln1EL, i64* %R3_Var
  %ln1EM = load i64* %ls1xy
  %ln1EN = load i64* %ls1xB
  %ln1EO = add i64 %ln1EN, 6
  %ln1EP = inttoptr i64 %ln1EO to i64*
  %ln1EQ = load i64* %ln1EP, !tbaa !5
  %ln1ER = load i64* %ls1xB
  %ln1ES = add i64 %ln1ER, 22
  %ln1ET = inttoptr i64 %ln1ES to i64*
  %ln1EU = load i64* %ln1ET, !tbaa !5
  %ln1EV = load i64* %ls1xr
  %ln1EW = add i64 %ln1EU, %ln1EV
  %ln1EX = shl i64 %ln1EW, 2
  %ln1EY = add i64 %ln1EX, 16
  %ln1EZ = add i64 %ln1EQ, %ln1EY
  %ln1F0 = inttoptr i64 %ln1EZ to i32*
  %ln1F1 = load i32* %ln1F0, !tbaa !5
  %ln1F2 = sext i32 %ln1F1 to i64
  %ln1F3 = add i64 %ln1EM, %ln1F2
  %ln1F4 = trunc i64 %ln1F3 to i32
  %ln1F5 = sext i32 %ln1F4 to i64
  store i64 %ln1F5, i64* %R2_Var
  %ln1F6 = load i64* %ls1xB
  store i64 %ln1F6, i64* %R1_Var
  %ln1F7 = load i64** %Base_Var
  %ln1F8 = load i64** %Sp_Var
  %ln1F9 = load i64** %Hp_Var
  %ln1Fa = load i64* %R1_Var
  %ln1Fb = load i64* %R2_Var
  %ln1Fc = load i64* %R3_Var
  %ln1Fd = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @s1xB_info( i64* %ln1F7, i64* %ln1F8, i64* %ln1F9, i64 %ln1Fa, i64 %ln1Fb, i64 %ln1Fc, i64 undef, i64 undef, i64 undef, i64 %ln1Fd ) nounwind
  ret void

c1AN:
  %ln1Fe = load i64* %ls1xy
  store i64 %ln1Fe, i64* %R1_Var
  %ln1Ff = load i64** %Sp_Var
  %ln1Fg = getelementptr inbounds i64* %ln1Ff, i32 0
  %ln1Fh = bitcast i64* %ln1Fg to i64*
  %ln1Fi = load i64* %ln1Fh, !tbaa !1
  %ln1Fj = inttoptr i64 %ln1Fi to void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)*
  %ln1Fk = load i64** %Base_Var
  %ln1Fl = load i64** %Sp_Var
  %ln1Fm = load i64** %Hp_Var
  %ln1Fn = load i64* %R1_Var
  %ln1Fo = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Fj( i64* %ln1Fk, i64* %ln1Fl, i64* %ln1Fm, i64 %ln1Fn, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Fo ) nounwind
  ret void

}

%Test_zdwa_entry_struct = type <{i64, i64, i64}>
@Test_zdwa_info_itable =  constant %Test_zdwa_entry_struct<{i64 4294967301, i64 0, i64 15}>, section "X98A__STRIP,__me3", align 8

define  cc 10 void @Test_zdwa_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me4"
{
c1Bf:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xj = alloca i64, i32 1
  %ln1FV = load i64* %R2_Var
  store i64 %ln1FV, i64* %ls1xj
  %ln1FW = load i64** %Sp_Var
  %ln1FX = getelementptr inbounds i64* %ln1FW, i32 -4
  %ln1FY = ptrtoint i64* %ln1FX to i64
  %ln1FZ = load i64* %SpLim_Var
  %ln1G0 = icmp ult i64 %ln1FY, %ln1FZ
  br i1 %ln1G0, label %c1Cf, label %c1Ce

c1Ce:
  %ln1G1 = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @c1Bg_info to i64
  %ln1G2 = load i64** %Sp_Var
  %ln1G3 = getelementptr inbounds i64* %ln1G2, i32 -1
  store i64 %ln1G1, i64* %ln1G3, !tbaa !1
  %ln1G4 = load i64* %ls1xj
  store i64 %ln1G4, i64* %R1_Var
  %ln1G5 = load i64** %Sp_Var
  %ln1G6 = getelementptr inbounds i64* %ln1G5, i32 -1
  %ln1G7 = ptrtoint i64* %ln1G6 to i64
  %ln1G8 = inttoptr i64 %ln1G7 to i64*
  store i64* %ln1G8, i64** %Sp_Var
  %ln1G9 = load i64** %Base_Var
  %ln1Ga = load i64** %Sp_Var
  %ln1Gb = load i64** %Hp_Var
  %ln1Gc = load i64* %R1_Var
  %ln1Gd = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @stg_ap_0_fast( i64* %ln1G9, i64* %ln1Ga, i64* %ln1Gb, i64 %ln1Gc, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Gd ) nounwind
  ret void

c1Cf:
  %ln1Ge = load i64* %ls1xj
  store i64 %ln1Ge, i64* %R2_Var
  %ln1Gf = ptrtoint %Test_zdwa_closure_struct* @Test_zdwa_closure to i64
  store i64 %ln1Gf, i64* %R1_Var
  %ln1Gg = load i64** %Base_Var
  %ln1Gh = getelementptr inbounds i64* %ln1Gg, i32 -1
  %ln1Gi = bitcast i64* %ln1Gh to i64*
  %ln1Gj = load i64* %ln1Gi, !tbaa !4
  %ln1Gk = inttoptr i64 %ln1Gj to void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)*
  %ln1Gl = load i64** %Base_Var
  %ln1Gm = load i64** %Sp_Var
  %ln1Gn = load i64** %Hp_Var
  %ln1Go = load i64* %R1_Var
  %ln1Gp = load i64* %R2_Var
  %ln1Gq = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Gk( i64* %ln1Gl, i64* %ln1Gm, i64* %ln1Gn, i64 %ln1Go, i64 %ln1Gp, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Gq ) nounwind
  ret void

}

declare  cc 10 void @stg_ap_0_fast(i64* noalias nocapture, i64* noalias nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%c1Bg_entry_struct = type <{i64, i64}>
@c1Bg_info_itable = internal constant %c1Bg_entry_struct<{i64 0, i64 32}>, section "X98A__STRIP,__me5", align 8

define internal cc 10 void @c1Bg_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me6"
{
c1Bg:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yF = alloca i64, i32 1
  %ls1xu = alloca i64, i32 1
  %ls1xv = alloca i64, i32 1
  %ls1xs = alloca i64, i32 1
  %lc1Bn = alloca i64, i32 1
  %ls1xH = alloca i64, i32 1
  %ls1xX = alloca <4 x i32>, i32 1
  %ls1xL = alloca i64, i32 1
  %ln1I5 = load i64** %Hp_Var
  %ln1I6 = getelementptr inbounds i64* %ln1I5, i32 4
  %ln1I7 = ptrtoint i64* %ln1I6 to i64
  %ln1I8 = inttoptr i64 %ln1I7 to i64*
  store i64* %ln1I8, i64** %Hp_Var
  %ln1I9 = load i64* %R1_Var
  store i64 %ln1I9, i64* %ls1yF
  %ln1Ia = load i64** %Hp_Var
  %ln1Ib = ptrtoint i64* %ln1Ia to i64
  %ln1Ic = load i64** %Base_Var
  %ln1Id = getelementptr inbounds i64* %ln1Ic, i32 35
  %ln1Ie = bitcast i64* %ln1Id to i64*
  %ln1If = load i64* %ln1Ie, !tbaa !4
  %ln1Ig = icmp ugt i64 %ln1Ib, %ln1If
  br i1 %ln1Ig, label %c1Cb, label %c1BR

c1BR:
  %ln1Ih = load i64* %ls1yF
  %ln1Ii = add i64 %ln1Ih, 7
  %ln1Ij = inttoptr i64 %ln1Ii to i64*
  %ln1Ik = load i64* %ln1Ij, !tbaa !5
  store i64 %ln1Ik, i64* %ls1xu
  %ln1Il = load i64* %ls1yF
  %ln1Im = add i64 %ln1Il, 15
  %ln1In = inttoptr i64 %ln1Im to i64*
  %ln1Io = load i64* %ln1In, !tbaa !5
  store i64 %ln1Io, i64* %ls1xv
  %ln1Ip = load i64* %ls1yF
  %ln1Iq = add i64 %ln1Ip, 23
  %ln1Ir = inttoptr i64 %ln1Iq to i64*
  %ln1Is = load i64* %ln1Ir, !tbaa !5
  store i64 %ln1Is, i64* %ls1xs
  %ln1It = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @s1xB_info to i64
  %ln1Iu = load i64** %Hp_Var
  %ln1Iv = getelementptr inbounds i64* %ln1Iu, i32 -3
  store i64 %ln1It, i64* %ln1Iv, !tbaa !2
  %ln1Iw = load i64* %ls1xu
  %ln1Ix = load i64** %Hp_Var
  %ln1Iy = getelementptr inbounds i64* %ln1Ix, i32 -2
  store i64 %ln1Iw, i64* %ln1Iy, !tbaa !2
  %ln1Iz = load i64* %ls1xs
  %ln1IA = load i64** %Hp_Var
  %ln1IB = getelementptr inbounds i64* %ln1IA, i32 -1
  store i64 %ln1Iz, i64* %ln1IB, !tbaa !2
  %ln1IC = load i64* %ls1xv
  %ln1ID = load i64** %Hp_Var
  %ln1IE = getelementptr inbounds i64* %ln1ID, i32 0
  store i64 %ln1IC, i64* %ln1IE, !tbaa !2
  %ln1IF = load i64** %Hp_Var
  %ln1IG = ptrtoint i64* %ln1IF to i64
  %ln1IH = add i64 %ln1IG, -22
  store i64 %ln1IH, i64* %lc1Bn
  %ln1II = load i64* %ls1xs
  %ln1IJ = load i64* %ls1xs
  %ln1IK = srem i64 %ln1IJ, 4
  %ln1IL = sub i64 %ln1II, %ln1IK
  store i64 %ln1IL, i64* %ls1xH
  %ln1IM = insertelement <4 x i32> < i32 0, i32 0, i32 0, i32 0 >, i32 0, i32 0
  %ln1IN = insertelement <4 x i32> %ln1IM, i32 0, i32 1
  %ln1IO = insertelement <4 x i32> %ln1IN, i32 0, i32 2
  %ln1IP = insertelement <4 x i32> %ln1IO, i32 0, i32 3
  %ln1IQ = bitcast <4 x i32> %ln1IP to <4 x i32>
  store <4 x i32> %ln1IQ, <4 x i32>* %ls1xX, align 1
  store i64 0, i64* %ls1xL
  br label %s1xV

s1xV:
  %ln1IR = load i64* %ls1xL
  %ln1IS = load i64* %ls1xH
  %ln1IT = icmp sge i64 %ln1IR, %ln1IS
  br i1 %ln1IT, label %c1C1, label %c1C0

c1C0:
  %ln1IU = load i64* %ls1xu
  %ln1IV = add i64 %ln1IU, 16
  %ln1IW = load i64* %ls1xv
  %ln1IX = load i64* %ls1xL
  %ln1IY = add i64 %ln1IW, %ln1IX
  %ln1IZ = shl i64 %ln1IY, 2
  %ln1J0 = add i64 %ln1IZ, 1536
  %ln1J1 = add i64 %ln1IV, %ln1J0
  %ln1J2 = inttoptr i64 %ln1J1 to i8*
  store i64 undef, i64* %R3_Var
  store i64 undef, i64* %R4_Var
  store i64 undef, i64* %R5_Var
  store i64 undef, i64* %R6_Var
  store float undef, float* %F1_Var
  store double undef, double* %D1_Var
  store float undef, float* %F2_Var
  store double undef, double* %D2_Var
  store float undef, float* %F3_Var
  store double undef, double* %D3_Var
  store float undef, float* %F4_Var
  store double undef, double* %D4_Var
  store float undef, float* %F5_Var
  store double undef, double* %D5_Var
  store float undef, float* %F6_Var
  store double undef, double* %D6_Var
  call ccc void (i8*,i32,i32,i32)* @llvm.prefetch( i8* %ln1J2, i32 0, i32 3, i32 1 )
  %ln1J3 = load <4 x i32>* %ls1xX, align 1
  %ln1J4 = load i64* %ls1xu
  %ln1J5 = add i64 %ln1J4, 16
  %ln1J6 = load i64* %ls1xv
  %ln1J7 = load i64* %ls1xL
  %ln1J8 = add i64 %ln1J6, %ln1J7
  %ln1J9 = shl i64 %ln1J8, 2
  %ln1Ja = add i64 %ln1J5, %ln1J9
  %ln1Jb = inttoptr i64 %ln1Ja to <4 x i32>*
  %ln1Jc = load <4 x i32>* %ln1Jb, align 1, !tbaa !5
  %ln1Jd = add <4 x i32> %ln1J3, %ln1Jc
  %ln1Je = bitcast <4 x i32> %ln1Jd to <4 x i32>
  store <4 x i32> %ln1Je, <4 x i32>* %ls1xX, align 1
  %ln1Jf = load i64* %ls1xL
  %ln1Jg = add i64 %ln1Jf, 4
  store i64 %ln1Jg, i64* %ls1xL
  br label %s1xV

c1C1:
  %ln1Jh = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @c1Bm_info to i64
  %ln1Ji = load i64** %Sp_Var
  %ln1Jj = getelementptr inbounds i64* %ln1Ji, i32 -3
  store i64 %ln1Jh, i64* %ln1Jj, !tbaa !1
  %ln1Jk = load i64* %ls1xL
  store i64 %ln1Jk, i64* %R3_Var
  store i64 0, i64* %R2_Var
  %ln1Jl = load i64* %lc1Bn
  store i64 %ln1Jl, i64* %R1_Var
  %ln1Jm = load <4 x i32>* %ls1xX, align 1
  %ln1Jn = load i64** %Sp_Var
  %ln1Jo = getelementptr inbounds i64* %ln1Jn, i32 -2
  %ln1Jp = bitcast i64* %ln1Jo to <4 x i32>*
  store <4 x i32> %ln1Jm, <4 x i32>* %ln1Jp, align 1, !tbaa !1
  %ln1Jq = load i64** %Sp_Var
  %ln1Jr = getelementptr inbounds i64* %ln1Jq, i32 -3
  %ln1Js = ptrtoint i64* %ln1Jr to i64
  %ln1Jt = inttoptr i64 %ln1Js to i64*
  store i64* %ln1Jt, i64** %Sp_Var
  %ln1Ju = load i64** %Base_Var
  %ln1Jv = load i64** %Sp_Var
  %ln1Jw = load i64** %Hp_Var
  %ln1Jx = load i64* %R1_Var
  %ln1Jy = load i64* %R2_Var
  %ln1Jz = load i64* %R3_Var
  %ln1JA = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @s1xB_info( i64* %ln1Ju, i64* %ln1Jv, i64* %ln1Jw, i64 %ln1Jx, i64 %ln1Jy, i64 %ln1Jz, i64 undef, i64 undef, i64 undef, i64 %ln1JA ) nounwind
  ret void

c1Cb:
  %ln1JB = load i64** %Base_Var
  %ln1JC = getelementptr inbounds i64* %ln1JB, i32 41
  store i64 32, i64* %ln1JC, !tbaa !4
  %ln1JD = load i64* %ls1yF
  store i64 %ln1JD, i64* %R1_Var
  %ln1JE = load i64** %Base_Var
  %ln1JF = load i64** %Sp_Var
  %ln1JG = load i64** %Hp_Var
  %ln1JH = load i64* %R1_Var
  %ln1JI = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @stg_gc_unpt_r1( i64* %ln1JE, i64* %ln1JF, i64* %ln1JG, i64 %ln1JH, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1JI ) nounwind
  ret void

}

declare  ccc void @llvm.prefetch(i8*, i32, i32, i32)

declare  cc 10 void @stg_gc_unpt_r1(i64* noalias nocapture, i64* noalias nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%c1Bm_entry_struct = type <{i64, i64}>
@c1Bm_info_itable = internal constant %c1Bm_entry_struct<{i64 451, i64 32}>, section "X98A__STRIP,__me7", align 8

define internal cc 10 void @c1Bm_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me8"
{
c1Bm:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1xX = alloca <4 x i32>, i32 1
  %ln1Kr = load i64** %Sp_Var
  %ln1Ks = getelementptr inbounds i64* %ln1Kr, i32 1
  %ln1Kt = bitcast i64* %ln1Ks to <4 x i32>*
  %ln1Ku = load <4 x i32>* %ln1Kt, align 1, !tbaa !1
  %ln1Kv = bitcast <4 x i32> %ln1Ku to <4 x i32>
  store <4 x i32> %ln1Kv, <4 x i32>* %ls1xX, align 1
  %ln1Kw = load i64* %R1_Var
  %ln1Kx = load <4 x i32>* %ls1xX, align 1
  %ln1Ky = extractelement <4 x i32> %ln1Kx, i32 0
  %ln1Kz = sext i32 %ln1Ky to i64
  %ln1KA = add i64 %ln1Kw, %ln1Kz
  %ln1KB = trunc i64 %ln1KA to i32
  %ln1KC = sext i32 %ln1KB to i64
  %ln1KD = load <4 x i32>* %ls1xX, align 1
  %ln1KE = extractelement <4 x i32> %ln1KD, i32 1
  %ln1KF = sext i32 %ln1KE to i64
  %ln1KG = add i64 %ln1KC, %ln1KF
  %ln1KH = trunc i64 %ln1KG to i32
  %ln1KI = sext i32 %ln1KH to i64
  %ln1KJ = load <4 x i32>* %ls1xX, align 1
  %ln1KK = extractelement <4 x i32> %ln1KJ, i32 2
  %ln1KL = sext i32 %ln1KK to i64
  %ln1KM = add i64 %ln1KI, %ln1KL
  %ln1KN = trunc i64 %ln1KM to i32
  %ln1KO = sext i32 %ln1KN to i64
  %ln1KP = load <4 x i32>* %ls1xX, align 1
  %ln1KQ = extractelement <4 x i32> %ln1KP, i32 3
  %ln1KR = sext i32 %ln1KQ to i64
  %ln1KS = add i64 %ln1KO, %ln1KR
  %ln1KT = trunc i64 %ln1KS to i32
  %ln1KU = sext i32 %ln1KT to i64
  store i64 %ln1KU, i64* %R1_Var
  %ln1KV = load i64** %Sp_Var
  %ln1KW = getelementptr inbounds i64* %ln1KV, i32 4
  %ln1KX = ptrtoint i64* %ln1KW to i64
  %ln1KY = inttoptr i64 %ln1KX to i64*
  store i64* %ln1KY, i64** %Sp_Var
  %ln1KZ = load i64** %Sp_Var
  %ln1L0 = getelementptr inbounds i64* %ln1KZ, i32 0
  %ln1L1 = bitcast i64* %ln1L0 to i64*
  %ln1L2 = load i64* %ln1L1, !tbaa !1
  %ln1L3 = inttoptr i64 %ln1L2 to void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)*
  %ln1L4 = load i64** %Base_Var
  %ln1L5 = load i64** %Sp_Var
  %ln1L6 = load i64** %Hp_Var
  %ln1L7 = load i64* %R1_Var
  %ln1L8 = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1L3( i64* %ln1L4, i64* %ln1L5, i64* %ln1L6, i64 %ln1L7, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1L8 ) nounwind
  ret void

}

%Test_sum1_entry_struct = type <{i64, i64, i64}>
@Test_sum1_info_itable =  constant %Test_sum1_entry_struct<{i64 4294967301, i64 0, i64 15}>, section "X98A__STRIP,__me9", align 8

define  cc 10 void @Test_sum1_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me10"
{
c1Dh:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yn = alloca i64, i32 1
  %ln1LG = load i64* %R2_Var
  store i64 %ln1LG, i64* %ls1yn
  %ln1LH = load i64** %Sp_Var
  %ln1LI = getelementptr inbounds i64* %ln1LH, i32 -1
  %ln1LJ = ptrtoint i64* %ln1LI to i64
  %ln1LK = load i64* %SpLim_Var
  %ln1LL = icmp ult i64 %ln1LJ, %ln1LK
  br i1 %ln1LL, label %c1Dw, label %c1Dv

c1Dv:
  %ln1LM = ptrtoint void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)* @c1Di_info to i64
  %ln1LN = load i64** %Sp_Var
  %ln1LO = getelementptr inbounds i64* %ln1LN, i32 -1
  store i64 %ln1LM, i64* %ln1LO, !tbaa !1
  %ln1LP = load i64* %ls1yn
  store i64 %ln1LP, i64* %R2_Var
  %ln1LQ = load i64** %Sp_Var
  %ln1LR = getelementptr inbounds i64* %ln1LQ, i32 -1
  %ln1LS = ptrtoint i64* %ln1LR to i64
  %ln1LT = inttoptr i64 %ln1LS to i64*
  store i64* %ln1LT, i64** %Sp_Var
  %ln1LU = load i64** %Base_Var
  %ln1LV = load i64** %Sp_Var
  %ln1LW = load i64** %Hp_Var
  %ln1LX = load i64* %R1_Var
  %ln1LY = load i64* %R2_Var
  %ln1LZ = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @Test_zdwa_info( i64* %ln1LU, i64* %ln1LV, i64* %ln1LW, i64 %ln1LX, i64 %ln1LY, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1LZ ) nounwind
  ret void

c1Dw:
  %ln1M0 = load i64* %ls1yn
  store i64 %ln1M0, i64* %R2_Var
  %ln1M1 = ptrtoint %Test_sum1_closure_struct* @Test_sum1_closure to i64
  store i64 %ln1M1, i64* %R1_Var
  %ln1M2 = load i64** %Base_Var
  %ln1M3 = getelementptr inbounds i64* %ln1M2, i32 -1
  %ln1M4 = bitcast i64* %ln1M3 to i64*
  %ln1M5 = load i64* %ln1M4, !tbaa !4
  %ln1M6 = inttoptr i64 %ln1M5 to void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)*
  %ln1M7 = load i64** %Base_Var
  %ln1M8 = load i64** %Sp_Var
  %ln1M9 = load i64** %Hp_Var
  %ln1Ma = load i64* %R1_Var
  %ln1Mb = load i64* %R2_Var
  %ln1Mc = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1M6( i64* %ln1M7, i64* %ln1M8, i64* %ln1M9, i64 %ln1Ma, i64 %ln1Mb, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Mc ) nounwind
  ret void

}

%c1Di_entry_struct = type <{i64, i64}>
@c1Di_info_itable = internal constant %c1Di_entry_struct<{i64 0, i64 32}>, section "X98A__STRIP,__me11", align 8

define internal cc 10 void @c1Di_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me12"
{
c1Di:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 undef, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ls1yp = alloca i64, i32 1
  %ln1MU = load i64** %Hp_Var
  %ln1MV = getelementptr inbounds i64* %ln1MU, i32 2
  %ln1MW = ptrtoint i64* %ln1MV to i64
  %ln1MX = inttoptr i64 %ln1MW to i64*
  store i64* %ln1MX, i64** %Hp_Var
  %ln1MY = load i64* %R1_Var
  store i64 %ln1MY, i64* %ls1yp
  %ln1MZ = load i64** %Hp_Var
  %ln1N0 = ptrtoint i64* %ln1MZ to i64
  %ln1N1 = load i64** %Base_Var
  %ln1N2 = getelementptr inbounds i64* %ln1N1, i32 35
  %ln1N3 = bitcast i64* %ln1N2 to i64*
  %ln1N4 = load i64* %ln1N3, !tbaa !4
  %ln1N5 = icmp ugt i64 %ln1N0, %ln1N4
  br i1 %ln1N5, label %c1Ds, label %c1Dp

c1Dp:
  %ln1N6 = ptrtoint [0 x i64]* @base_GHCziInt_I32zh_con_info to i64
  %ln1N7 = load i64** %Hp_Var
  %ln1N8 = getelementptr inbounds i64* %ln1N7, i32 -1
  store i64 %ln1N6, i64* %ln1N8, !tbaa !2
  %ln1N9 = load i64* %ls1yp
  %ln1Na = load i64** %Hp_Var
  %ln1Nb = getelementptr inbounds i64* %ln1Na, i32 0
  store i64 %ln1N9, i64* %ln1Nb, !tbaa !2
  %ln1Nc = load i64** %Hp_Var
  %ln1Nd = ptrtoint i64* %ln1Nc to i64
  %ln1Ne = add i64 %ln1Nd, -7
  store i64 %ln1Ne, i64* %R1_Var
  %ln1Nf = load i64** %Sp_Var
  %ln1Ng = getelementptr inbounds i64* %ln1Nf, i32 1
  %ln1Nh = ptrtoint i64* %ln1Ng to i64
  %ln1Ni = inttoptr i64 %ln1Nh to i64*
  store i64* %ln1Ni, i64** %Sp_Var
  %ln1Nj = load i64** %Sp_Var
  %ln1Nk = getelementptr inbounds i64* %ln1Nj, i32 0
  %ln1Nl = bitcast i64* %ln1Nk to i64*
  %ln1Nm = load i64* %ln1Nl, !tbaa !1
  %ln1Nn = inttoptr i64 %ln1Nm to void (i64*, i64*, i64*, i64, i64, i64, i64, i64, i64, i64)*
  %ln1No = load i64** %Base_Var
  %ln1Np = load i64** %Sp_Var
  %ln1Nq = load i64** %Hp_Var
  %ln1Nr = load i64* %R1_Var
  %ln1Ns = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* %ln1Nn( i64* %ln1No, i64* %ln1Np, i64* %ln1Nq, i64 %ln1Nr, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1Ns ) nounwind
  ret void

c1Ds:
  %ln1Nt = load i64** %Base_Var
  %ln1Nu = getelementptr inbounds i64* %ln1Nt, i32 41
  store i64 16, i64* %ln1Nu, !tbaa !4
  %ln1Nv = load i64* %ls1yp
  store i64 %ln1Nv, i64* %R1_Var
  %ln1Nw = load i64** %Base_Var
  %ln1Nx = load i64** %Sp_Var
  %ln1Ny = load i64** %Hp_Var
  %ln1Nz = load i64* %R1_Var
  %ln1NA = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @stg_gc_unbx_r1( i64* %ln1Nw, i64* %ln1Nx, i64* %ln1Ny, i64 %ln1Nz, i64 undef, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1NA ) nounwind
  ret void

}

@base_GHCziInt_I32zh_con_info = external global [0 x i64]

declare  cc 10 void @stg_gc_unbx_r1(i64* noalias nocapture, i64* noalias nocapture, i64* noalias nocapture, i64, i64, i64, i64, i64, i64, i64) align 8

%Test_sum_entry_struct = type <{i64, i64, i64}>
@Test_sum_info_itable =  constant %Test_sum_entry_struct<{i64 4294967301, i64 0, i64 15}>, section "X98A__STRIP,__me13", align 8

define  cc 10 void @Test_sum_info(i64* noalias nocapture %Base_Arg, i64* noalias nocapture %Sp_Arg, i64* noalias nocapture %Hp_Arg, i64 %R1_Arg, i64 %R2_Arg, i64 %R3_Arg, i64 %R4_Arg, i64 %R5_Arg, i64 %R6_Arg, i64 %SpLim_Arg) align 8 nounwind section "X98A__STRIP,__me14"
{
c1DE:
  %Base_Var = alloca i64*, i32 1
  store i64* %Base_Arg, i64** %Base_Var
  %Sp_Var = alloca i64*, i32 1
  store i64* %Sp_Arg, i64** %Sp_Var
  %Hp_Var = alloca i64*, i32 1
  store i64* %Hp_Arg, i64** %Hp_Var
  %R1_Var = alloca i64, i32 1
  store i64 %R1_Arg, i64* %R1_Var
  %R2_Var = alloca i64, i32 1
  store i64 %R2_Arg, i64* %R2_Var
  %R3_Var = alloca i64, i32 1
  store i64 undef, i64* %R3_Var
  %R4_Var = alloca i64, i32 1
  store i64 undef, i64* %R4_Var
  %R5_Var = alloca i64, i32 1
  store i64 undef, i64* %R5_Var
  %R6_Var = alloca i64, i32 1
  store i64 undef, i64* %R6_Var
  %SpLim_Var = alloca i64, i32 1
  store i64 %SpLim_Arg, i64* %SpLim_Var
  %F1_Var = alloca float, i32 1
  store float undef, float* %F1_Var
  %D1_Var = alloca double, i32 1
  store double undef, double* %D1_Var
  %XMM1_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM1_Var, align 1
  %F2_Var = alloca float, i32 1
  store float undef, float* %F2_Var
  %D2_Var = alloca double, i32 1
  store double undef, double* %D2_Var
  %XMM2_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM2_Var, align 1
  %F3_Var = alloca float, i32 1
  store float undef, float* %F3_Var
  %D3_Var = alloca double, i32 1
  store double undef, double* %D3_Var
  %XMM3_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM3_Var, align 1
  %F4_Var = alloca float, i32 1
  store float undef, float* %F4_Var
  %D4_Var = alloca double, i32 1
  store double undef, double* %D4_Var
  %XMM4_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM4_Var, align 1
  %F5_Var = alloca float, i32 1
  store float undef, float* %F5_Var
  %D5_Var = alloca double, i32 1
  store double undef, double* %D5_Var
  %XMM5_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM5_Var, align 1
  %F6_Var = alloca float, i32 1
  store float undef, float* %F6_Var
  %D6_Var = alloca double, i32 1
  store double undef, double* %D6_Var
  %XMM6_Var = alloca <4 x i32>, i32 1
  store <4 x i32> undef, <4 x i32>* %XMM6_Var, align 1
  %ln1NI = load i64* %R2_Var
  store i64 %ln1NI, i64* %R2_Var
  %ln1NJ = load i64** %Base_Var
  %ln1NK = load i64** %Sp_Var
  %ln1NL = load i64** %Hp_Var
  %ln1NM = load i64* %R1_Var
  %ln1NN = load i64* %R2_Var
  %ln1NO = load i64* %SpLim_Var
  tail call cc 10 void (i64*,i64*,i64*,i64,i64,i64,i64,i64,i64,i64)* @Test_sum1_info( i64* %ln1NJ, i64* %ln1NK, i64* %ln1NL, i64 %ln1NM, i64 %ln1NN, i64 undef, i64 undef, i64 undef, i64 undef, i64 %ln1NO ) nounwind
  ret void

}

@llvm.used = appending global [4 x i8*] [i8* bitcast (%c1Di_entry_struct* @c1Di_info_itable to i8*), i8* bitcast (%c1Bm_entry_struct* @c1Bm_info_itable to i8*), i8* bitcast (%c1Bg_entry_struct* @c1Bg_info_itable to i8*), i8* bitcast (%s1xB_entry_struct* @s1xB_info_itable to i8*)], section "llvm.metadata"

-------------- next part --------------
	.file	"/tmp/ghc19964_0/ghc19964_0.bc"
	.data
	.type	Test_zdwa_closure, at object # @Test_zdwa_closure
	.globl	Test_zdwa_closure
	.align	8
Test_zdwa_closure:
	.quad	Test_zdwa_info
	.size	Test_zdwa_closure, 8

	.type	Test_sum1_closure, at object # @Test_sum1_closure
	.globl	Test_sum1_closure
	.align	8
Test_sum1_closure:
	.quad	Test_sum1_info
	.size	Test_sum1_closure, 8

	.type	Test_sum_closure, at object # @Test_sum_closure
	.globl	Test_sum_closure
	.align	8
Test_sum_closure:
	.quad	Test_sum_info
	.size	Test_sum_closure, 8

	.section	".note.GNU-stack","", at progbits

	.text
	.type	s1xB_info_itable, at object # @s1xB_info_itable
	.align	8
s1xB_info_itable:
	.quad	8589934602              # 0x20000000a
	.quad	8589934593              # 0x200000001
	.quad	9                       # 0x9
	.size	s1xB_info_itable, 24

	.text
	.align	8, 0x90
	.type	s1xB_info, at function
s1xB_info:                              # @s1xB_info
# BB#0:                                 # %c1AJ
	movq	%r14, %rax
	movq	14(%rbx), %rcx
	cmpq	%rsi, %rcx
	jle	.LBB0_3
# BB#1:                                 # %c1AM.lr.ph
	movq	22(%rbx), %rdx
	addq	%rsi, %rdx
	movq	6(%rbx), %rdi
	leaq	16(%rdi,%rdx,4), %rdx
	.align	16, 0x90
.LBB0_2:                                # %c1AM
                                        # =>This Inner Loop Header: Depth=1
	addl	(%rdx), %eax
	movslq	%eax, %rax
	addq	$4, %rdx
	incq	%rsi
	cmpq	%rsi, %rcx
	jg	.LBB0_2
.LBB0_3:                                # %c1AN
	movq	(%rbp), %rcx
	movq	%rax, %rbx
	jmpq	*%rcx  # TAILCALL
.Ltmp0:
	.size	s1xB_info, .Ltmp0-s1xB_info

	.text
	.type	Test_zdwa_info_itable, at object # @Test_zdwa_info_itable
	.globl	Test_zdwa_info_itable
	.align	8
Test_zdwa_info_itable:
	.quad	4294967301              # 0x100000005
	.quad	0                       # 0x0
	.quad	15                      # 0xf
	.size	Test_zdwa_info_itable, 24

	.text
	.globl	Test_zdwa_info
	.align	8, 0x90
	.type	Test_zdwa_info, at function
Test_zdwa_info:                         # @Test_zdwa_info
# BB#0:                                 # %c1Bf
	leaq	-32(%rbp), %rax
	cmpq	%r15, %rax
	jae	.LBB1_1
# BB#2:                                 # %c1Cf
	movq	-8(%r13), %rax
	movl	$Test_zdwa_closure, %ebx
	jmpq	*%rax  # TAILCALL
.LBB1_1:                                # %c1Ce
	movq	$c1Bg_info, -8(%rbp)
	addq	$-8, %rbp
	movq	%r14, %rbx
	jmp	stg_ap_0_fast           # TAILCALL
.Ltmp1:
	.size	Test_zdwa_info, .Ltmp1-Test_zdwa_info

	.text
	.type	c1Bg_info_itable, at object # @c1Bg_info_itable
	.align	8
c1Bg_info_itable:
	.quad	0                       # 0x0
	.quad	32                      # 0x20
	.size	c1Bg_info_itable, 16

	.text
	.align	8, 0x90
	.type	c1Bg_info, at function
c1Bg_info:                              # @c1Bg_info
# BB#0:                                 # %c1Bg
	movq	%r12, %rax
	leaq	32(%rax), %r12
	cmpq	280(%r13), %r12
	jbe	.LBB2_1
# BB#8:                                 # %c1Cb
	movq	$32, 328(%r13)
	jmp	stg_gc_unpt_r1          # TAILCALL
.LBB2_1:                                # %c1BR
	movq	23(%rbx), %rcx
	movq	7(%rbx), %rsi
	movq	15(%rbx), %rdi
	movq	$s1xB_info, 8(%rax)
	movq	%rsi, 16(%rax)
	movq	%rcx, 24(%rax)
	movq	%rcx, %rdx
	sarq	$63, %rdx
	shrq	$62, %rdx
	addq	%rcx, %rdx
	movq	%rdi, (%r12)
	andq	$-4, %rdx
	pxor	%xmm0, %xmm0
	xorl	%eax, %eax
	testq	%rdx, %rdx
	movq	%rax, %rcx
	jle	.LBB2_4
# BB#2:                                 # %c1C0.lr.ph
	leaq	1552(%rsi,%rdi,4), %rsi
	pxor	%xmm0, %xmm0
	xorl	%ecx, %ecx
	.align	16, 0x90
.LBB2_3:                                # %c1C0
                                        # =>This Inner Loop Header: Depth=1
	prefetcht0	(%rsi)
	movdqu	-1536(%rsi), %xmm1
	paddd	%xmm1, %xmm0
	addq	$16, %rsi
	addq	$4, %rcx
	cmpq	%rdx, %rcx
	jl	.LBB2_3
.LBB2_4:                                # %c1C1
	movq	$c1Bm_info, -24(%rbp)
	movdqu	%xmm0, -16(%rbp)
	movq	-8(%r12), %rdx
	cmpq	%rcx, %rdx
	jle	.LBB2_7
# BB#5:                                 # %c1AM.lr.ph.i
	subq	%rcx, %rdx
	addq	(%r12), %rcx
	movq	-16(%r12), %rax
	leaq	16(%rax,%rcx,4), %rcx
	xorl	%eax, %eax
	.align	16, 0x90
.LBB2_6:                                # %c1AM.i
                                        # =>This Inner Loop Header: Depth=1
	addl	(%rcx), %eax
	movslq	%eax, %rax
	addq	$4, %rcx
	decq	%rdx
	jne	.LBB2_6
.LBB2_7:                                # %s1xB_info.exit
	pextrd	$3, %xmm0, %ecx
	addl	%eax, %ecx
	pextrd	$2, %xmm0, %eax
	addl	%ecx, %eax
	pextrd	$1, %xmm0, %ecx
	addl	%eax, %ecx
	movd	%xmm0, %eax
	addl	%ecx, %eax
	movslq	%eax, %rbx
	movq	8(%rbp), %rax
	addq	$8, %rbp
	jmpq	*%rax  # TAILCALL
.Ltmp2:
	.size	c1Bg_info, .Ltmp2-c1Bg_info

	.text
	.type	c1Bm_info_itable, at object # @c1Bm_info_itable
	.align	8
c1Bm_info_itable:
	.quad	451                     # 0x1c3
	.quad	32                      # 0x20
	.size	c1Bm_info_itable, 16

	.text
	.align	8, 0x90
	.type	c1Bm_info, at function
c1Bm_info:                              # @c1Bm_info
# BB#0:                                 # %c1Bm
	movdqu	8(%rbp), %xmm0
	pextrd	$3, %xmm0, %eax
	addl	%ebx, %eax
	pextrd	$2, %xmm0, %ecx
	addl	%eax, %ecx
	pextrd	$1, %xmm0, %eax
	addl	%ecx, %eax
	movd	%xmm0, %ecx
	addl	%eax, %ecx
	movslq	%ecx, %rbx
	movq	32(%rbp), %rax
	addq	$32, %rbp
	jmpq	*%rax  # TAILCALL
.Ltmp3:
	.size	c1Bm_info, .Ltmp3-c1Bm_info

	.text
	.type	Test_sum1_info_itable, at object # @Test_sum1_info_itable
	.globl	Test_sum1_info_itable
	.align	8
Test_sum1_info_itable:
	.quad	4294967301              # 0x100000005
	.quad	0                       # 0x0
	.quad	15                      # 0xf
	.size	Test_sum1_info_itable, 24

	.text
	.globl	Test_sum1_info
	.align	8, 0x90
	.type	Test_sum1_info, at function
Test_sum1_info:                         # @Test_sum1_info
# BB#0:                                 # %c1Dh
	leaq	-8(%rbp), %rax
	cmpq	%r15, %rax
	jae	.LBB4_1
# BB#3:                                 # %c1Dw
	movq	-8(%r13), %rax
	movl	$Test_sum1_closure, %ebx
	jmpq	*%rax  # TAILCALL
.LBB4_1:                                # %c1Dv
	movq	$c1Di_info, -8(%rbp)
	leaq	-40(%rbp), %rcx
	cmpq	%r15, %rcx
	jae	.LBB4_4
# BB#2:                                 # %c1Cf.i
	movq	-8(%r13), %rcx
	movq	%rax, %rbp
	movl	$Test_zdwa_closure, %ebx
	jmpq	*%rcx  # TAILCALL
.LBB4_4:                                # %c1Ce.i
	movq	$c1Bg_info, -16(%rbp)
	addq	$-16, %rbp
	movq	%r14, %rbx
	jmp	stg_ap_0_fast           # TAILCALL
.Ltmp4:
	.size	Test_sum1_info, .Ltmp4-Test_sum1_info

	.text
	.type	c1Di_info_itable, at object # @c1Di_info_itable
	.align	8
c1Di_info_itable:
	.quad	0                       # 0x0
	.quad	32                      # 0x20
	.size	c1Di_info_itable, 16

	.text
	.align	8, 0x90
	.type	c1Di_info, at function
c1Di_info:                              # @c1Di_info
# BB#0:                                 # %c1Di
	movq	%r12, %rax
	leaq	16(%rax), %r12
	cmpq	280(%r13), %r12
	jbe	.LBB5_1
# BB#2:                                 # %c1Ds
	movq	$16, 328(%r13)
	jmp	stg_gc_unbx_r1          # TAILCALL
.LBB5_1:                                # %c1Dp
	movq	$base_GHCziInt_I32zh_con_info, 8(%rax)
	movq	%rbx, 16(%rax)
	movq	8(%rbp), %rax
	addq	$8, %rbp
	leaq	-7(%r12), %rbx
	jmpq	*%rax  # TAILCALL
.Ltmp5:
	.size	c1Di_info, .Ltmp5-c1Di_info

	.text
	.type	Test_sum_info_itable, at object # @Test_sum_info_itable
	.globl	Test_sum_info_itable
	.align	8
Test_sum_info_itable:
	.quad	4294967301              # 0x100000005
	.quad	0                       # 0x0
	.quad	15                      # 0xf
	.size	Test_sum_info_itable, 24

	.text
	.globl	Test_sum_info
	.align	8, 0x90
	.type	Test_sum_info, at function
Test_sum_info:                          # @Test_sum_info
# BB#0:                                 # %c1DE
	leaq	-8(%rbp), %rax
	cmpq	%r15, %rax
	jae	.LBB6_1
# BB#3:                                 # %c1Dw.i
	movq	-8(%r13), %rax
	movl	$Test_sum1_closure, %ebx
	jmpq	*%rax  # TAILCALL
.LBB6_1:                                # %c1Dv.i
	movq	$c1Di_info, -8(%rbp)
	leaq	-40(%rbp), %rcx
	cmpq	%r15, %rcx
	jae	.LBB6_4
# BB#2:                                 # %c1Cf.i.i
	movq	-8(%r13), %rcx
	movq	%rax, %rbp
	movl	$Test_zdwa_closure, %ebx
	jmpq	*%rcx  # TAILCALL
.LBB6_4:                                # %c1Ce.i.i
	movq	$c1Bg_info, -16(%rbp)
	addq	$-16, %rbp
	movq	%r14, %rbx
	jmp	stg_ap_0_fast           # TAILCALL
.Ltmp6:
	.size	Test_sum_info, .Ltmp6-Test_sum_info

	.type	__stginit_Test, at object  # @__stginit_Test
	.bss
	.globl	__stginit_Test
	.align	8
__stginit_Test:
	.size	__stginit_Test, 0