>     solution for your needs would be for us to
>      > use the LTO support in LLVM to inline across module boundaries - in
>      > particular to inline primop implementations into their call
>     sites. LLVM
>      > would then probably deal with unrolling small loops with
>     statically known
>      > bounds.
>     Could we simply use this?
> Might be easier to implement a PrimOp inlining pass, and to run it
> before LLVM's built-in MemCpyOptimization pass [0]. This wouldn't
> generally be as good as LTO but would work without gold.
> [0]

Ideally you'd want the heap check in the primop to be aggregated into 
the calling function's heap check, and the primop should allocate 
directly from the heap instead of calling out to the RTS allocate(). 
All this is a bit much to expect LLVM to do, but we could do it in the 
Glorious New Code Generator...

For small arrays like this maybe we should have a new array type that 
leaves out all the card-marking stuff too (or just use tuples, as Roman 


