Replacement for GMP: Update

Wed Jan 3 00:35:10 EST 2007

On Dec 29, 2006, at 8:32 AM, Thorkil Naur wrote:
> On Friday 01 September 2006 06:43, Peter Tanski wrote:
>> ...
>> For a brief overview of the speed of the libraries I looked at   
>> carefully, see http://hackage.haskell.org/trac/ghc/wiki/ 
>> ReplacingGMPNotes (I added a few charts to show the speed of  
>> ARPREC, OpenSSL's BN, GMP and LibTomMath.  ....)  I tested GMP and  
>> OpenSSL's BN for reference.
>
> I must confess that it took me a while to understand what you were  
> doing here
> and I am still uncertain: For example, how many multiplications  
> were actually
> performed for bit size 4096?

4096 = size 4 (counting from zero: sizes[5] = {256,512,1024,2048,4096})
1,000,000 / ( 4 * 3 ) = 83,333 rounds

My reason for doing this was simple: as a basic comparison, rather  
than an absolute measurement, the number of rounds doesn't matter as  
long as the results are measurable.  Besides, more rounds means more  
time running each test.  I did a lot of tweaking, especially with  
ARPREC, to get each library to (1) a generally available speed and  
(2) a configuration similar to what it would be when used with GHC.   
I could have measured the speed in nanoseconds, with one iteration  
for each calculation using random numbers of a specified size and  
posting the mean for a number of trials but that would have required  
me to use OS-X specific profiling software like Shark in order to get  
reliable measurements--a bit more labour intensive as it would  
require me to manually perform each test.  (I did use random numbers  
of a specified size.)

> In addition, for "Powers", the markings on the
> horizontal axis ("256 pow(n,3)", "512 pow(n,4)", "1024 pow 
> (n5)" (missing a
> "," here?), ...) on your graph seem to indicate that you are  
> changing two
> different variables (the bit size and the exponent) at the same time.

Yes, the testing is a bit sloppy there (so is the graph; ugly typo).   
The graph shows a general trend more than anything else.  I actually  
tested the Exponentials (Powers) individually for each size and  
timing but posting a 3-D graph or making the graph (time = exponent/ 
size) seemed like overkill or would conflict with the previous  
measurements.  Not a bad idea, though, just for clarity.

> I would suggest that you also quoted the original measurements  
> directly. And perhaps
> (especially in case of the "Powers" and "Division") some more  
> details about
> what was actually measured.

I did quote the original measurements directly.  There wasn't much  
variance overall and I took what seemed like median results from a  
number of tests.  What matters is the relative time to initialise and  
perform each computation since in a GHC-implementation each  
computation would require some minimal initialisation.  ARPREC was  
built for this and in ARPREC-only tests the major factor in speed of  
initialisation was the time to compute the architecture and precision- 
specific constants for PI, Log_2 and Log_10 (the Epsilon constant  
doesn't require much time).  Log_2 and Log_10 are necessary for  
printing Integers because computations in ARPREC are performed as  
floating-point values and must converted to decimal digits by  
multiplying by Log_10.  (Note that printing Integers also requires a  
size increase as the mantissa holds the Integer value, requiring  
further multiplication by the float-exponent.)

Details on differences between algorithms used in each library would  
be fairly complex: as you already know, each library (ARPREC, GMP,  
LibTomMath, etc.) uses a different algorithm based on the size of the  
number-array *and* each may have a different implementation of an  
algorithm--LibTomMath uses a simpler Karatsuba algorithm, for example.

> It is distinctly odd that squaring seems to be more expensive than
> multiplication for some bit sizes in 3 of the 4 measured cases.

This is also an implementation matter: the particular algorithms  
change with size and squaring may require some extra initialisation  
for, say, computing the size of the result and the number of  
operations.  All of the libraries provide special squaring functions  
and I used those.  LibTomMath is a good example: it uses its  
"baseline" squaring algorithm for small sizes and Comba-optimised  
Toom-Cook or Karatsuba algorithms for larger sizes. (I purposely did  
not tune LibTomMath or any of the libraries because I wanted a more- 
or-less average-case comparison, so the Karatsuba algorithm was used  
for size=512 bytes (128 digits) and Toom-Cook was used for size=2048  
bytes (512 digits).)  So where you see LibTomMath's time dip in the  
middle of the 'Squaring' graph it is using the Karatsuba algorithm.   
ARPREC uses a FFT for everything above size=256 and calculates with  
fewer actual digits (it stores extra "size" as an exponent, just like  
ordinary floating point numbers).  The trends in the graphs for  
ARPREC versus GMP generally hold true until GMP's FFT kicks in, at  
size > 32768.

> Also, I
> wonder what divisions these libraries are capable of carrying out  
> so much
> faster than comparable multiplications. For the division  
> measurements, I may
> again simply be presuming something about your experiments that  
> simply isn't
> true.

As for Division, here is the short answer why ARPREC was so much  
slower than the rest and why LibTomMath was so fast where a  
corresponding LibTomMath multiplication was relatively slower.   
ARPREC computes division as long division for small operands  
(size=256) and multiplication by the reciprocal (Newton-Raphson  
iteration followed by Karp's trick to eliminate trailing errors in  
the mantissa) for larger operations (>=512): that is why ARPREC is so  
slow at size=256 and grows progressively faster for greater sizes.   
LibTomMath division relies on subtraction at first to compute the  
approximate number of iterations--it computes the *entire* number as  
an approximation rather than GMP's recursive division from the paper,  
C. Burnikel and J. Ziegler, "Fast Recursive Division" (1998) <http:// 
citeseer.ist.psu.edu/correct/246850>.  For both LibTomMath and  
ARPREC, if the divisor is greater than the dividend the "division"  
calculation of an integer (without remainder) is very fast since the  
result is 0, so for those two I had to check that the random-number  
divisor was less than the dividend before calculation.

>> ...
>> I keep talking about ARPREC--why?  For three reasons:
>> (1) I trust its level of precision--this has been tested, see FFTW's
>> benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz-
>> gcc4/
>
> I am not sure I understand here: With respect to Haskell Integers,  
> I would not
> say that there is any concept of a "level of precision": If Integer
> computations are not accurate, they are wrong.

FFT's must be performed with floating point or fixed point numbers  
(unless you want to deal with some mad algorithm involving continued  
fractions): there will always be some approximation so I always  
consider precision.  The same might be said for division with  
remainder for some algorithms.  All this holds especially true when  
you must round the approximation to what users would expect to be an  
exact integer.

> Again, some additional details about these measurements would be  
> most welcome.

I hope the above explanations help a little.

> I looked briefly at ARPREC: It seems that it works with an  
> explicitly set
> maximum bound on the number of digits desired. Although it also  
> appears that
> it is possible to change this bound as computations proceed, such  
> additional
> administration seems difficult to embrace.

The size relation for operations corresponds to magnitude, so  
addition and subtraction operations grow or shrink linearly while  
multiplication, division and powers (exponents) change  
logarithmically  (roughly, log_2).  It would add an extra check on  
initialisation for every operation (or a string of operations if they  
were chained together) but it would not be large price.  This is  
actually done internally by all the other libraries, anyway.  The  
simple solution is to put the check into each function in the ARPREC  
file c_mp.cpp.

>> This is the problem I face: GHC unfortunately does not use Integer as
>> a mathematical operation but as a primitive type, complete with
>> bitwise operations.
>
> I do not understand what problem you are referring to here.

It is both an aesthetic and an implementation problem.  If Integer  
were a mathematical library the primary consideration would be  
mathematical operations and the user would choose operations to  
perform a more complex calculation or series of calculations.  For  
GMP or LibTomMath you may know the general size of the operands in  
your program and use different functions for multiplication,  
division, squaring, powers, roots or other things like modular  
arithmetic and number theory.  As a pedestrian primitive Integer must  
support the basic operations (only the basic operations) of a  
primitive type.  All multiplication goes through one function and the  
user relies on the library to choose the appropriate one.  The user  
would not be able to choose a special multiplication algorithm  
because he knows the numbers will be, say, larger than 256 bits in  
size or modularly related.  This puts the burden on the compiler to  
optimise the library for the user, if possible.  Meanwhile, as a  
primitive Integers in Haskell only support a basic set of operations:  
any extra operations would have to be provided in a special library.   
(Though I may add extra primitive operations to GHC, this would be  
nonstandard).  As an aesthetic consideration, I mentioned bitwise  
operations because they are atomic, primitive operations typically  
associated with machine types (int32_t, uint64_t, etc.).

>> Modifying ARPREC will  result in a dynamic library (or DLL) that  
>> outputs simple arrays of  doubles--this is how ARPREC currently  
>> functions; there is no need to  wrap a C++ class in an opaque C  
>> pointer you cannot operate on  independently. The huge downside to  
>> ARPREC and one of the least  trivial of my problems is how to  
>> efficiently convert an array of  doubles to precise integers,  
>> especially for cryptographic purposes.
>
> Could you explain exactly what this problem is? If it still makes  
> sense to
> spend time on it, of course.

This point is somewhat moot since I am no longer using ARPREC :)  The  
first problem is opaque data types from C++/C.  If ARPREC stored the  
resulting number as a Class object with members of different types  
(ints and doubles), like this:

-- apint.hpp

class APInt {
public:
// ...

private:

	int size;
	double exponent;
	double* mantissa;
};

Then, in a separate .cpp file I would create functions callable from C:

-- c_apint.h

extern "C" {

APInt*
newAPInt (int size, ...);

void
destroyAPInt (APInt* x);

int
addAPInt (APInt* x, APInt* y, APInt* res);

// ...

} // end extern "C"

The way to work with this in C (and Haskell) would be an opaque pointer:

typedef struct APInt* pAPInt;

 From C, I could never access the members of the struct directly so  
any special Haskell functions for modifying those members would have  
to be special functions in the c-interface file, 'c_apint.h'.

An additional problem, somewhat related to the opaque-object problem  
is converting an array of doubles to a (precise) array of uint32_t.   
Individual conversions between doubles and ints are expensive and you  
have to be very careful about rounding.  Alternatives that avoid  
rounding are:

(1) normalise the arbitrary precision double (and all the doubles in  
the array) and use a union to access the mantissa of each double in  
the array directly, and convert each to a uint32_t while carrying the  
overflow from each mantissa to the next uint32_t; or,

(2) convert the array of doubles to a string and convert the string  
to an array of uint32_t.

Both methods are expensive--you would have to multiply the arbitrary  
precision double by log_10 and normalise it, to say nothing of the  
conversion.  Worse, for Haskell Integers some conversion would be  
necessary, unless you want to use arithmetic means to perform the  
bitwise primitive operations.

>>> I better be specific here: What I do is integer factorization  
>>> (splitting integers into their prime factors). And, for example,  
>>> I have implemented a version of the ECM (Elliptic Curve Method)  
>>> in Haskell. Having done that (independently and mainly to learn  
>>> what it was about), I eventually compared it in performance to  
>>> GMP-ECM which Paul Zimmermann wrote. And I was initially  
>>> disappointed to find that GMP-ECM was about 7 times faster than  
>>> my  ECM code. Fortunately, various sessions of "hand-waving"  
>>> brought this factor down to about 1.5. So, whenever GMP-ECM used  
>>> 2 minutes, my program would use 3 minutes.
>>
>> One of the reasons I am working on a library written in C++ is that
>> Haskell may be too slow for such things.
>
> I am not sure what you mean here. I would happily accept a price  
> of, say, 10
> or 20 percent reduced performance, if I got the ease of programming  
> (mostly)
> in Haskell in return. And I am still not convinced that, in a  
> computation
> that involves heavy Integer work, multiplications, divisions,  
> greatest common
> divisor computations with, say, 200-1000 digit numbers, that the  
> Haskell
> reduced performance price could not be kept as low as this.

You could be right--at present I am fairly pessimistic about the  
various overheads imposed by Haskell for such things as arrays but  
that does not mean I am correct.  You may have noticed my mention of  
the BigFloat library in a somewhat related thread, "bignums, gmp,  
bytestring ... ?," at http://www.haskell.org/pipermail/glasgow- 
haskell-users/2006-November/011601.html .  BigFloat <http://  
bignum.sourceforge.net/> is an arbitrary precision library Martin Guy  
originally wrote in Miranda and ported to Haskell: it is a good  
library but a little slow (see timing results for the Many Digits  
"Friendly" Competition, where BigFloat competed as Bignum, at http:// 
www.cs.ru.nl/~milad/manydigits/final_timings_rankings.html ).  Of  
course this is only one library and it is performing some complex  
calculations.

It is interesting to note that BigFloat *is* precise and uses [Int]  
for the mantissa.  There is a note on the BigFloat page about coming  
close to finding the highest computed power of 2 after more than a  
month of computation (before someone turned the power off).

Of course, you are correct in assuming some of this moot.  I am no  
longer working on a C++ library: it is written in C with some  
assembler and in my spare time I have been toying with a hybrid  
Haskell-C system where the Haskell portion of the system works to  
optimise basic bits of C code.  (I mentioned the idea for this in the  
November 2006 thread linked above.)  This should give users the full  
flexibility of Haskell with the speed of C, though writing it  
portably is proving to be a real hassle.  I have gotten a little side- 
tracked porting GHC to be Windows-native in my quest to use a better  
compiler than gcc 3.4 (the best available for MinGW), namely CL, but  
it is all to the good.  I had a lot to learn about Cmm and the Native  
Code Generator (NCG)--at one time I thought I knew something about  
Cmm; it was naive.  There is a lot of room for NCG optimisations,  
such as auto-vectorization and floating point exceptions.

>> My best question is: are you using pure mathematical formulas for
>> this?
>
> I am not sure what this question means. Please elaborate.

It was a question out of curiosity: for your ECM work, are you able  
to implement the mathematical formulae directly in Haskell?  (Yes, I  
believe "formulae" is the correct plural of "formula.")  C programs  
require many special operations for dealing with the data and a fast  
Haskell program would probably have to use a few (maybe I am a bad  
Haskell programmer?).

>> If you are using pure mathematical functions, ARPREC is fine (at  
>> least it is precise).  Mathematically, there are alternative  
>> algorithms for floating point numbers that are not available for  
>> integers...
>
> Again, I am not sure what you mean by this. It is certainly  
> possible to
> compute useful inverses of integers without using a floating point
> representation. To be sure, the inverse (reciprocal) of an integer  
> is not an
> integer, but it could still be computed in a suitable fixed point
> representation. See, for example, Knuth's "Seminumerical  
> Algorithms" ("The
> art of computer programming", vol 2), end of section 4.3.1 and  
> algorithm R in
> section 4.3.3.

Fixed point is certainly an option but it is not a "natural"  
operation over ints and may add more overhead than necessary  
considering the relative speed advantages of modern floating point  
processors.  (By "not available," I was talking about hardware  
support though I'll admit my typo-riddled prose may have thrown you  
off--"Mathematically," may have been an artifact.)  All of the  
fastest integer division algorithms I am aware of do not use fixed  
point for division by inverse multiplication.

> I still believe that the excessive memory usage was caused by GMP  
> allocating a
> lot of storage using the GHC allocator during a single mpz_powm  
> call, so that
> the GHC garbage collector never got the opportunity to actually  
> free the
> allocated storage. The mpz_powm function of gmp-4.2 that I am  
> looking at
> calls mpn_sqr_n inside its main loop and that function itself  
> allocates
> storage in some cases. In other cases, it calls mpn_mul_fft_full  
> which,
> again, allocates storage.

O.K.  I misunderstood your program: I was under the impression that  
you wrote the GMP program in C.  That is an interesting insight  
concerning the GHC garbage collector.

> so it may very well be that some of the above remarks are not  
> relevant any
> more. In addition, you wrote
> (http://www.haskell.org/pipermail/glasgow-haskell-users/2006- 
> November/011601.html):
>
>> ...
>> Integers are mostly a convenience.
>> ...
>
> ...
>
> My highest priority is still correctness. The thought that there  
> might, in
> some special case, have crept an error into my basic arithmetic  
> operators
> makes me shudder. In the present case of multiprecision integer  
> operations,
> what worries me is the complications and the many special cases that
> implementors deal with. And not least special cases that address some
> particular architecture: This opens the possibility that code  
> working on one
> machine stops working on another.

The general trend for floating or fixed point libraries that seek  
precision is low to medium speed; there are many fast libraries out  
there which are faster but not as precise.  One example is the BOOST  
uBLAS library: it is slower than BLAS but more precise.

> It seems that my level of tolerance of such risks is not as high as  
> most other
> people's. Personally, if I were to write a multiprecision library  
> myself, I
> would build in all sorts of possibilities for verifying that  
> everything
> remains in place and results computed were indeed correct.

Ideally, these should be handled by the compiler-system rather than  
run-time checks in code.

> For example, I understand that memory errors occur, not often, but  
> often
> enough to be significant for such longwinding computations (days,  
> months,
> years) that I envision. Other people are doing long computations (e.g.
> http://www.mersenne.org/) and they exercise some care in the form  
> of test
> programs that are intended to verify that a particular machine is
> sufficiently stable. I would probably carry this even further, by  
> including,
> even in "production" runs, various checks (individual arithmetic  
> operations
> verified by checks modulo some number, say).

That is an excellent point.  No arbitrary (multi) precision library I  
know of provides a facility for performing such checks.  It would be  
a nice thing to add.

> Another thing that bothers me is the unpredictability of the  
> performance of
> these multiprecision packages. When I look at the GMP code, there are
> countless special cases that are taken into account. Each of them,  
> of course,
> represents the implementor's desire to provide me, the user, with  
> the best
> service (performance) possible. But it makes measuring performance a
> nightmare, because you never know, for sure, whether some special  
> case has
> been taken into account in a particular situation.

One reason I kept the comparison of different libraries very basic  
and general.  The best performance comes from special tuning tied to  
a particular equation, as with FFTW.

> This was brought home to me most forcefully some months back: I have
> GHC-compiled programs running on a couple of Linux machines (not  
> the same
> Linux distribution), but had used some precompiled GMP package on  
> both of
> these machines. After looking at the GMP code, I realised (as I  
> should have
> much earlier, of course) that I should use a package that was  
> compiled for
> the specific machine I was working on to ensure that all the  
> various special
> performance tricks could come into play. So I did. And performance  
> improved
> on one machine, but degraded on another. I am still not sure what the
> explanation of this behaviour really is. But one thing I am certainly
> concerned about is whether, in fact, the GMP build process is  
> actually able
> to guess some machine architecture correctly and ensure that its  
> potential is
> used to the fullest extent.

This is a very big problem with every fast mathematical library.  The  
GMP build system is primitive (it tends to rely on assembler blocks  
written for particular types of machines but does not perform further  
testing).  You know how ATLAS uses a specialised build process to  
automatically optimise code for a particular machine and  
architecture.  It is very delicate and once you have built the  
library as a developer you cannot expect the same performance if you  
package the library with a program designed for one operating system  
and many different (but related) machines: a program for OS X 10.4  
built on a ppc7450 (G4) will run differently on a G3 or G5.  My own  
problem involves the different variations for Altivec and SSE on  
different PPC and Intel/AMD processors, especially if I include any  
assembler code but also if the library is compiled by a different  
compiler or compiler version.  I should probably make precision the  
highest priority, portability, second and speed, third.

> One thing that seems missing in many packages is the possibility to  
> experiment
> with the tuning parameters more dynamically. For example, GMP  
> includes a
> tuning program that is able to suggest values for certain  
> parameters. But
> these parameters then have to be compiled into the package for  
> final use. Not
> particularly elegant and something which, surely, could be changed,  
> without
> significant overhead.

GMP is unfortunately an open-source project, academic or commercial.   
Some developer would have to spend the time to do the tedious work of  
weighing the different parameter combinations so the tuning program  
could determine and input the best performance choices and report back.

> The conclusion, for me at least, is this: If you are a naive user  
> of Integers,
> you are much better off with a completely simple, no-special-cases,  
> solid and
> working, multiprecision package. Which might even be fairly portable.
>
> For heavy users (such as me) with special requirements, it seems  
> that they can
> never get the ultimate performance without having considerable  
> insight into
> the details of the matter anyway. And, hence, they need to do some  
> complex
> work themselves to get the desired performance.

A hybrid solution that gave basic Integer functionality but allowed  
for variations in a special library would be better.  Integer may be  
a Primitive but it should really be a separate Haskell library.   
Claus Reinke had a good comment about not automatically importing  
Integer when a user doesn't need it.

> For Haskell, this would mean ensuring that the interface between  
> GHC-compiled
> code and some multiprecision package was reasonably comprehensible and
> allowed the multiprecision support to be replaced. And here I try  
> to avoid
> the usual dream of "easy replacement", that seems entirely  
> unrealistic. But
> something reasonable that the rest of GHC (and not least: The GHC  
> developers)
> can live with without too much work. And that heavy users, with  
> sufficient
> insight and probably a significant dose of hard work (such as that  
> you have
> put into this) would be able to use for their own purposes.

GMP *would* have been easy to replace but, like so much else in GHC,  
it was hacked in and then hacked in again.  No offense to those who  
have spent more than a decade working on GHC--there is still working  
code from the early 90's, at least in the Haskell libraries--but I  
guess it suffers from how fun and easy it is to write in Haskell and  
the comparative drudgery of cleaning things up.  Modularity is key.   
I really have to stop playing around and post a prototype so people  
(like you) can have something to work with and criticise :)

Cheers,
Pete