Request for assistance from Haskell-oriented startup: GHCi performance

Tue Jan 20 21:45:31 UTC 2015

Hi Konrad,

I was spending a little bit of time examining this just this morning,
and during my investigation, I followed your example from Stack
Overflow, but I find myself needing a little guidance.

One question I have about your example: Are your snippets so small
that they are prohibitively impacted by dynamic linking?

In short, I was testing your little example here. I don't have an
NVidia card, but I used your example from StackOverflow and slightly
modified it:

---------------------------------------------
{-# LANGUAGE OverloadedStrings #-}

module Main where

import Data.Array.Accelerate as A
--import Data.Array.Accelerate.CUDA as C
import Data.Array.Accelerate.Interpreter as C
import Data.Time.Clock       (diffUTCTime, getCurrentTime)

main :: IO ()
main = do
    start <- getCurrentTime
    print $ C.run $ A.maximum $ A.map (+1) $ A.use (fromList
(Z:.1000000) [1..1000000] :: Vector Double)
    end   <- getCurrentTime
    print $ diffUTCTime end start
---------------------------------------------

OK, so now compile it (GHC 7.8.4; note I use -dynamic for consistency):

$ ghc -O2 -fforce-recomp -threaded Test1.hs -dynamic
[1 of 1] Compiling Main             ( Test1.hs, Test1.o )
Linking Test1 ...

$ ./Test1
Array (Z) [100001.0]
0.391819s

OK, .39 seconds. Now try it interpreted:

$ ghci Test1
GHCi, version 7.8.4: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> :main
... linking messages ...
Array (Z) [100001.0]
0.462821s

OK, .46seconds, 30% slower. But now run `:main` again without terminating GHCi:

Prelude Main> :main
Array (Z) [100001.0]
0.000471s

It got much faster! This is probably because GHC (when optimizing)
lifted out the constant expression from your accelerate program (the
`C.run ...` part) into a CAF, which was already evaluated once; so
subsequent evaluations are much simpler (you can fix this by modifying
the example to take a command line argument representing the `Exp
Double` to use `A.map` over, to make sure GHC can't float it out).

So this compelled me to run with -v3, when loading GHCi (I didn't
suspect the CAF behavior at first). And you get a very informative
message when you run `:main` -

$ ghci Test1
GHCi, version 7.8.4: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> :set -v3
wired-in package ghc-prim mapped to
ghc-prim-0.3.1.0-a24f9c14c632d75b683d0f93283aea37
wired-in package integer-gmp mapped to
integer-gmp-0.5.1.0-26579559b3647acf4f01d5edd9491a46
wired-in package base mapped to base-4.7.0.2-bfd89587617e381ae01b8dd7b6c7f1c1
wired-in package rts mapped to builtin_rts
wired-in package template-haskell mapped to
template-haskell-2.9.0.0-6d27c2b362b15abb1822f2f34b9ae7f9
wired-in package dph-seq not found.
wired-in package dph-par not found.
Prelude Main> :main
*** Parser:
*** Desugar:
*** Simplify:
*** CorePrep:
*** ByteCodeGen:
Loading package pretty-1.1.1.1 ... linking ... done.
Loading package array-0.5.0.0 ... linking ... done.
Loading package deepseq-1.3.0.2 ... linking ... done.
Loading package old-locale-1.0.0.6 ... linking ... done.
Loading package time-1.4.2 ... linking ... done.
Loading package containers-0.5.5.1 ... linking ... done.
Loading package bytestring-0.10.4.0 ... linking ... done.
Loading package transformers-0.3.0.0 ... linking ... done.
Loading package mtl-2.1.3.1 ... linking ... done.
Loading package template-haskell ... linking ... done.
Loading package fclabels-2.0.2.2 ... linking ... done.
Loading package text-1.2.0.4 ... linking ... done.
Loading package hashable-1.2.3.1 ... linking ... done.
Loading package primitive-0.5.4.0 ... linking ... done.
Loading package vector-0.10.12.2 ... linking ... done.
Loading package hashtables-1.2.0.2 ... linking ... done.
Loading package unordered-containers-0.2.5.1 ... linking ... done.
Loading package accelerate-0.15.0.0 ... linking ... done.
*** Linker:
/usr/bin/gcc -fno-stack-protector -DTABLES_NEXT_TO_CODE -o
/tmp/ghc16243_0/ghc16243_2.so Test1.o -shared -Wl,-Bsymbolic
-Wl,-h,ghc16243_2.so -L/home/a/ghc-7.8.4/lib/ghc-7.8.4/base-4.7.0.2
-Wl,-rpath -Wl,/home/a/ghc-7.8.4/lib/ghc-7.8.4/base-4.7.0.2
-L/home/a/ghc-7.8.4/lib/ghc-7.8.4/integer-gmp-0.5.1.0 -Wl,-rpath
-Wl,/home/a/ghc-7.8.4/lib/ghc-7.8.4/integer-gmp-0.5.1.0
-L/home/a/ghc-7.8.4/lib/ghc-7.8.4/ghc-prim-0.3.1.0 -Wl,-rpath
-Wl,/home/a/ghc-7.8.4/lib/ghc-7.8.4/ghc-prim-0.3.1.0
-L/home/a/ghc-7.8.4/lib/ghc-7.8.4/rts-1.0 -Wl,-rpath
-Wl,/home/a/ghc-7.8.4/lib/ghc-7.8.4/rts-1.0 -lHSbase-4.7.0.2-ghc7.8.4
-lHSinteger-gmp-0.5.1.0-ghc7.8.4 -lHSghc-prim-0.3.1.0-ghc7.8.4 -lgmp
'-Wl,--hash-size=31' -Wl,--reduce-memory-overheads
Array (Z) [100001.0]
0.410921s

So with GHCi, we have to dynamically invoke GCC to create a shared
object file which GHC then loads (as it contains the expression we
entered to evaluate).

OK, so how *much* overhead is this? Well, it takes long enough for me
to see a pause, and we don't need to recompile this object file in the
background multiple times, only once. So we can modify the test:

---------------------------------------------
{-# LANGUAGE OverloadedStrings #-}

module Main where

import System.Environment (getArgs)

import Data.Array.Accelerate as A
--import Data.Array.Accelerate.CUDA as C
import Data.Array.Accelerate.Interpreter as C
import Data.Time.Clock       (diffUTCTime, getCurrentTime)

main :: IO ()
main = do
    n <- (constant . read . head) `fmap` getArgs :: IO (Exp Double)
    start <- getCurrentTime
    print $ C.run $ A.maximum $ A.map (+n) $ A.use (fromList
(Z:.100000) [1..100000] :: Vector Double)
    end   <- getCurrentTime
    print $ diffUTCTime end start
---------------------------------------------

Compile, run, interpret twice:

$ ghc -O2 -fforce-recomp -threaded Test2.hs -dynamic
[1 of 1] Compiling Main             ( Test2.hs, Test2.o )
Linking Test2 ...

$ ./Test2 10
Array (Z) [100010.0]
0.29926s

$ ghci Test2
GHCi, version 7.8.4: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Ok, modules loaded: Main.
Prelude Main> :main 10
...
Array (Z) [100010.0]
0.37202s

Prelude Main> :main 10
Array (Z) [100010.0]
0.299433s

Which is about 100ms slower or somewhere thereabouts, within the
threshold of human perception I'd guess.

However, this benchmark is very unscientific (`diffUTCTime` isn't a
very reliable metric to be quite honest), and the Core is really hard
to read due to it being so verbose from Accelerate (although you can
pretty easily see the CAF-ified `Array Double` at the top-level
wrapped in an `unsafeDupablePerformIO`).

So what this is telling me is that I think I need more information I'm
afraid. There are some key points to enumerate:

  - I'm not sure if the example program is actually very
representative of the slowdowns you're seeing; what we'd really need
to see is what output your compiler produces and GHC then consumes
through the API.
    - Also, GHCi is dynamically linked, but GHC generally builds
static programs; I'm not sure if you've built your loader (the program
that uses the GHC API) statically or dynamically. I'm not sure what
the speed difference would be here between dynamically linked GHC
loader and a statically linked one, but it's not like ours is free. It
may be faster though (but less correct).
    - NB: I'd guess this probably hurts a lot of you interactively run
lots of fresh Stmts, since each of these may be generating dynamic
files in the background (if you're linking dynamically).

  - It's clear the overhead of dynamic loading is nowhere near free;
even with a statically linked GHC RTS, I don't think it's free either.
To answer your original question, there's not really a smaller API to
handle a lot of this I'm afraid; and the current one has really only
been 'designed' for GHCi's use - things like fast, long-term
interactive loading of foreign code for dynamic applications have not
been the priority; a normally-short-lived REPL for users has (for
example, we couldn't even *unload* object files until very recently).

  - It's also not entirely clear if this slowdown is always constant
or it grows with the input. For the interpreter, it didn't seem to,
but like I said above, I think the original example was slightly
busted anyway.

Can you provide any more information about your build, configuration,
or perhaps a boiled down test case? (Even machine-generated test cases
are fine, if they work for you!) Maybe I can get further with an
Amazon GPU instance and a little more information.

Let me know if you need any help getting information from GHC itself.

On Wed, Jan 14, 2015 at 5:16 PM, Simon Peyton Jones
<simonpj at microsoft.com> wrote:
> Konrad
>
>
> That does sound frustrating.
>
>
>
> I think your first port of call should be Manuel Chakravarty, the author of
> accelerate.  The example you give in your stackoverflow post can only be
> some weird systems thing.  After all, you are executing precisely the same
> code (namely compiled Accelerate code); it’s just that in one case it’s
> dynamically linked and excecuted from GHCi and in the other it’s linked and
> executed by the shell.  I have no clue what could cause that.  I wonder if
> you are using a GPU and whether that might somehow behave differently.
> Could it be the difference between static linking and dynamic linking (which
> could plausibly account for some startup delay)?  Is it a fixed overhead (eg
> takes 100ms extra) or does it run a factor of two slower (increase the size
> of your test case to see)?
>
>
>
> I’d be happy to have a Skype call with you, but I am rather unlikely to know
> anything helpful because it doesn’t sound like a core Haskell issue at all.
> You are executing the very same machine instructions!
>
>
>
> The overheads of the GHC API to compile and run the expression “main” are
> pretty small.
>
>
>
> I’m copying ghc-devs in case anyone else has any ideas.
>
>
>
> Simon
>
>
>
>
>
>
>
> From: Konrad Gądek [mailto:kgadek at gmail.com]
> Sent: 14 January 2015 13:59
> To: Simon Peyton Jones
> Cc: Piotr Młodawski; kgadek at flowbox.io
> Subject: Request for assistance from Haskell-oriented startup: GHCi
> performance
>
>
>
> Dear Mr Jones,
>
>
>
> My name is Konrad Gądek and I'm one of the programmers at Flowbox (
> http://flowbox.io ), a startup that is to bring a fresh view on image
> composition in movie industry. We proudly use Haskell in nearly all of our
> development. I believe you may remember our CEO, Wojciech Daniło, from
> discussions like in this thread: https://phabricator.haskell.org/D69 .
>
> What can be interesting for you is that to achieve our goals as a company,
> we started developing a new programming language - Luna. Long story short,
> we believe that Luna could be as beneficial for the Haskell community as
> Elixir is for Erlang.
>
> However, we found some major performance problems with the code that are as
> critical for us as they are cryptic. We have found difficulties in
> pinpointing the actual issue, not to mention solving it. We're getting a bit
> desperate about that, nobody so far has been able to help us, and so we
> would like to ask you for help. We would be really really grateful if you
> could take a look, maybe your fresh ideas could shed some light on the
> issue. Details are attached below.
>
> Is there any chance we could arrange eg. a Skype call so we could further
> discuss the matter?
>
>
>
> Thank you in advance!
>
>
>
> Background
>
> Currently Luna is trans-compiled to Haskell and then compiled to bytecode by
> GHC. Furthermore, we use ghci to evaluate expressions (the flow graph)
> interactively. We use accelerate library to perform high-performance
> computations with the help of graphic cards.
>
> The problem
>
> Executing some of the functions from libraries compiled with -O2 (especially
> from accelerate) is much slower than calling it from compiled executable
> (see
> http://stackoverflow.com/questions/27541609/difference-in-performance-of-compiled-accelerate-code-ran-from-ghci-and-shell
> and https://github.com/AccelerateHS/accelerate/issues/227).
>
> Maybe there is some other way to interactively evaluate Haskell code, which
> is more lightweight/more customizable ie. would not require all ghc-api
> features which are probably slowing down the whole process? Is it possible
> to just use ghc linker and make function calls simpler and more time
> efficient?
>
>
>
> Details
>
> We feed ghci with statements (using ghc-api) and declarations (using runStmt
> and runDecls). We can also change imports and language extensions before
> each call. The overall process is as follows:
>
> on init:
>
> ·
>
> set ghcpath to one with our custom installation of ghc with preinstalled
> graphic libraries
> set imports to our libraries
> enable/disable appropriate language extensions
>
> for each run:
>
> ·
>
> generate haskell code (including datatype declarations, using lenses and
> TemplateHaskell) and load it to ghci using runDecls
> for each expression:
>
> o
>
> run statements that use freshly generated code
> bind (lazy) results to variables
> evaluate values from bound variables, and get it from GhcMonad to runtime of
> our interpreter (see
> http://hackage.haskell.org/package/hint–0.4.2.1/docs/Language-Haskell-Interpreter.html#v:interpret)
>
> This behaviour was observed when using GHC 7.8.3 (with D69 patch) on Fedora
> 20 (x86-64), Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz
>
> Tried so far
>
> Specializing nearly everything in accelerate library, specializing calls to
> accelearate methods (no speedup).
> Load precompiled, optimised code to ghci (no speedup).
> Truth to be told, we have no idea what to try next.
>
>
>
>
>
>
>
> --
>
> Konrad Gądek
>
> typechecker team-leader in Flowbox
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>

-- 
Regards,

Austin Seipp, Haskell Consultant
Well-Typed LLP, http://www.well-typed.com/