[Haskell-cafe] Profiling slow multiplication of elements read from vector

Wed Dec 12 15:56:14 CET 2012

Hey! Thanks a lot for your reply!

> Disclaimer: I am compiling GHC 7.6.1-rc1 while testing this, so my
> measurements might be unreliable. Best try it out yourself.
> 
> Also, this blind-stab-optimization is /not/ best practice, I just
> enjoyed fiddling around.

What /is/ best practice in regards to performance optimization of code
such as mine? What other options than blind stabbing do I have? 

As for applying your suggestions: I have improved the speed of my
overall program by about 15% (see the new profiling I have obtained),
but it still seems as if the program spends too much time reading from
the vector and multiplying 5 integers --- but probably the cost centers
are allocated wrongly?

I tried various combinations of the changes you suggest, and had some
very interesting results.

It seems like there are some performance differences using GHC
7.4.2 (which I use due to cabal-dev) and GHC 7.6.*.

Using my initial code, I get a runtime of around 15.7s.

On 12/11/12 at 10:43pm, Joachim Breitner wrote:
> neighProd vs l !i = lukUp vs l (i-2) * lukUp vs l (i-1) *
> 		lukUp vs l (i) * lukUp vs l (i+1) * lukUp vs l (i+2)
> 

Using that code rather than a map, actually leads to slower results! I
am getting an average of 18.0s for this.

Interestingly, passing the index `!i' as an unboxed Int leads to a
time-increase in my map version --- 16.8s.

However:

> Using V.unsafeIndex there gives another noticable speedup.

This helps a lot, giving me 14.8s and 15.1 for your suggestion; but
combining it with

> lukUp :: V.Vector Int -> Int -> Int -> Int
> lukUp vs l i = let !i' = i `mod` l
>                 in V.unsafeIndex vs i'
> 
> did, according to core, prevent some boxing, but did not speed up the
> program. Allocating Ints that are unused before the next GC are very
> cheap.

gives me a time of about 12.8s with your suggestions (explicit calls to
`lukUp', unsafeIndex'ing), which is the fastest result I got so far!

> You can save more time by not passing l around, the size of the vector
> is already present in vs and can cheaply be accessed by "V.length vs".

I decided to pass l around explicitly in order to safe on expensive
length-calculations. As it turns out, checking the bounds gives me a
decrease in speed in all cases.

Furthermore, using `mod' is actually a feature of
my underlying model, as it enforces periodic boundary conditions (think
of the vector representing a circle). Therefore the boundary checking
just happens accidentally.

> Oh, and you are creating a new byte array in every invocation of
> randVec. That is of course expensive. Using mutable arrays seems to be a
> more natural choice. Or is that really what you want to do? Judging from
> your code I would expect that you want to create a new array consisting
> only of values returned by neighProd; with the current code, when
> updating the second value you are including the _updated_ first entry in
> your product.

I will try to use a mutable vector to see if I can speed the thing up
further, since I do indeed want to update my vector one element at a
time, taking into account the elements 4 nearest and next-to-nearest
neighbours (as you can see, every iteration of the code acts on the
updated vector).

I just decided to go with an unmutable structure (a list at first,
followed by a Data.Map structure), in order to work without side
effects.

I'd be happy to get more input about how to increase the performance of
my code!

Cheers

Janis
-------------- next part --------------
	Wed Dec 12 15:42 2012 Time and Allocation Profiling Report  (Final)

	   vec +RTS -p -RTS

	total time  =       12.46 secs   (12463 ticks @ 1000 us, 1 processor)
	total alloc = 8,080,059,488 bytes  (excludes profiling overheads)

COST CENTRE MODULE  %time %alloc

neighProd   Main     27.7    5.9
iter        Main     25.1    9.9
randVec.vs' Main     20.5   84.2
lukUp       Main     15.2    0.0
randVec     Main     11.5    0.0

                                                                     individual     inherited
COST CENTRE     MODULE                             no.     entries  %time %alloc   %time %alloc

MAIN            MAIN                                66           0    0.0    0.0   100.0  100.0
 main           Main                               133           0    0.0    0.0   100.0  100.0
  main.vs       Main                               141           1    0.0    0.0     0.0    0.0
  iter          Main                               134    10000001   25.1    9.9   100.0  100.0
   lukUp        Main                               138    10000000    0.0    0.0     0.0    0.0
   neighProd    Main                               136    10000000    0.0    0.0     0.0    0.0
   randVec      Main                               135    10000000   11.5    0.0    74.9   90.1
    randVec.vs' Main                               143    10000000   20.5   84.2    20.5   84.2
    neighProd   Main                               139           0   27.7    5.9    42.8    5.9
     lukUp      Main                               140    40000000   15.2    0.0    15.2    0.0
 CAF            Main                               131           0    0.0    0.0     0.0    0.0
  main          Main                               132           1    0.0    0.0     0.0    0.0
   main.vs      Main                               142           0    0.0    0.0     0.0    0.0
   main.l       Main                               137           1    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding                    112           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Handle.FD                   110           0    0.0    0.0     0.0    0.0
 CAF            GHC.Conc.Signal                     96           0    0.0    0.0     0.0    0.0
 CAF            GHC.IO.Encoding.Iconv               93           0    0.0    0.0     0.0    0.0
 CAF            GHC.Integer.Logarithms.Internals    74           0    0.0    0.0     0.0    0.0