Replacement for GMP: Update
Peter Tanski
p.tanski at gmail.com
Wed Jan 3 00:35:10 EST 2007
On Dec 29, 2006, at 8:32 AM, Thorkil Naur wrote:
> On Friday 01 September 2006 06:43, Peter Tanski wrote:
>> ...
>> For a brief overview of the speed of the libraries I looked at
>> carefully, see http://hackage.haskell.org/trac/ghc/wiki/
>> ReplacingGMPNotes (I added a few charts to show the speed of
>> ARPREC, OpenSSL's BN, GMP and LibTomMath. ....) I tested GMP and
>> OpenSSL's BN for reference.
>
> I must confess that it took me a while to understand what you were
> doing here
> and I am still uncertain: For example, how many multiplications
> were actually
> performed for bit size 4096?
4096 = size 4 (counting from zero: sizes[5] = {256,512,1024,2048,4096})
1,000,000 / ( 4 * 3 ) = 83,333 rounds
My reason for doing this was simple: as a basic comparison, rather
than an absolute measurement, the number of rounds doesn't matter as
long as the results are measurable. Besides, more rounds means more
time running each test. I did a lot of tweaking, especially with
ARPREC, to get each library to (1) a generally available speed and
(2) a configuration similar to what it would be when used with GHC.
I could have measured the speed in nanoseconds, with one iteration
for each calculation using random numbers of a specified size and
posting the mean for a number of trials but that would have required
me to use OS-X specific profiling software like Shark in order to get
reliable measurements--a bit more labour intensive as it would
require me to manually perform each test. (I did use random numbers
of a specified size.)
> In addition, for "Powers", the markings on the
> horizontal axis ("256 pow(n,3)", "512 pow(n,4)", "1024 pow
> (n5)" (missing a
> "," here?), ...) on your graph seem to indicate that you are
> changing two
> different variables (the bit size and the exponent) at the same time.
Yes, the testing is a bit sloppy there (so is the graph; ugly typo).
The graph shows a general trend more than anything else. I actually
tested the Exponentials (Powers) individually for each size and
timing but posting a 3-D graph or making the graph (time = exponent/
size) seemed like overkill or would conflict with the previous
measurements. Not a bad idea, though, just for clarity.
> I would suggest that you also quoted the original measurements
> directly. And perhaps
> (especially in case of the "Powers" and "Division") some more
> details about
> what was actually measured.
I did quote the original measurements directly. There wasn't much
variance overall and I took what seemed like median results from a
number of tests. What matters is the relative time to initialise and
perform each computation since in a GHC-implementation each
computation would require some minimal initialisation. ARPREC was
built for this and in ARPREC-only tests the major factor in speed of
initialisation was the time to compute the architecture and precision-
specific constants for PI, Log_2 and Log_10 (the Epsilon constant
doesn't require much time). Log_2 and Log_10 are necessary for
printing Integers because computations in ARPREC are performed as
floating-point values and must converted to decimal digits by
multiplying by Log_10. (Note that printing Integers also requires a
size increase as the mantissa holds the Integer value, requiring
further multiplication by the float-exponent.)
Details on differences between algorithms used in each library would
be fairly complex: as you already know, each library (ARPREC, GMP,
LibTomMath, etc.) uses a different algorithm based on the size of the
number-array *and* each may have a different implementation of an
algorithm--LibTomMath uses a simpler Karatsuba algorithm, for example.
> It is distinctly odd that squaring seems to be more expensive than
> multiplication for some bit sizes in 3 of the 4 measured cases.
This is also an implementation matter: the particular algorithms
change with size and squaring may require some extra initialisation
for, say, computing the size of the result and the number of
operations. All of the libraries provide special squaring functions
and I used those. LibTomMath is a good example: it uses its
"baseline" squaring algorithm for small sizes and Comba-optimised
Toom-Cook or Karatsuba algorithms for larger sizes. (I purposely did
not tune LibTomMath or any of the libraries because I wanted a more-
or-less average-case comparison, so the Karatsuba algorithm was used
for size=512 bytes (128 digits) and Toom-Cook was used for size=2048
bytes (512 digits).) So where you see LibTomMath's time dip in the
middle of the 'Squaring' graph it is using the Karatsuba algorithm.
ARPREC uses a FFT for everything above size=256 and calculates with
fewer actual digits (it stores extra "size" as an exponent, just like
ordinary floating point numbers). The trends in the graphs for
ARPREC versus GMP generally hold true until GMP's FFT kicks in, at
size > 32768.
> Also, I
> wonder what divisions these libraries are capable of carrying out
> so much
> faster than comparable multiplications. For the division
> measurements, I may
> again simply be presuming something about your experiments that
> simply isn't
> true.
As for Division, here is the short answer why ARPREC was so much
slower than the rest and why LibTomMath was so fast where a
corresponding LibTomMath multiplication was relatively slower.
ARPREC computes division as long division for small operands
(size=256) and multiplication by the reciprocal (Newton-Raphson
iteration followed by Karp's trick to eliminate trailing errors in
the mantissa) for larger operations (>=512): that is why ARPREC is so
slow at size=256 and grows progressively faster for greater sizes.
LibTomMath division relies on subtraction at first to compute the
approximate number of iterations--it computes the *entire* number as
an approximation rather than GMP's recursive division from the paper,
C. Burnikel and J. Ziegler, "Fast Recursive Division" (1998) <http://
citeseer.ist.psu.edu/correct/246850>. For both LibTomMath and
ARPREC, if the divisor is greater than the dividend the "division"
calculation of an integer (without remainder) is very fast since the
result is 0, so for those two I had to check that the random-number
divisor was less than the dividend before calculation.
>> ...
>> I keep talking about ARPREC--why? For three reasons:
>> (1) I trust its level of precision--this has been tested, see FFTW's
>> benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz-
>> gcc4/
>
> I am not sure I understand here: With respect to Haskell Integers,
> I would not
> say that there is any concept of a "level of precision": If Integer
> computations are not accurate, they are wrong.
FFT's must be performed with floating point or fixed point numbers
(unless you want to deal with some mad algorithm involving continued
fractions): there will always be some approximation so I always
consider precision. The same might be said for division with
remainder for some algorithms. All this holds especially true when
you must round the approximation to what users would expect to be an
exact integer.
> Again, some additional details about these measurements would be
> most welcome.
I hope the above explanations help a little.
> I looked briefly at ARPREC: It seems that it works with an
> explicitly set
> maximum bound on the number of digits desired. Although it also
> appears that
> it is possible to change this bound as computations proceed, such
> additional
> administration seems difficult to embrace.
The size relation for operations corresponds to magnitude, so
addition and subtraction operations grow or shrink linearly while
multiplication, division and powers (exponents) change
logarithmically (roughly, log_2). It would add an extra check on
initialisation for every operation (or a string of operations if they
were chained together) but it would not be large price. This is
actually done internally by all the other libraries, anyway. The
simple solution is to put the check into each function in the ARPREC
file c_mp.cpp.
>> This is the problem I face: GHC unfortunately does not use Integer as
>> a mathematical operation but as a primitive type, complete with
>> bitwise operations.
>
> I do not understand what problem you are referring to here.
It is both an aesthetic and an implementation problem. If Integer
were a mathematical library the primary consideration would be
mathematical operations and the user would choose operations to
perform a more complex calculation or series of calculations. For
GMP or LibTomMath you may know the general size of the operands in
your program and use different functions for multiplication,
division, squaring, powers, roots or other things like modular
arithmetic and number theory. As a pedestrian primitive Integer must
support the basic operations (only the basic operations) of a
primitive type. All multiplication goes through one function and the
user relies on the library to choose the appropriate one. The user
would not be able to choose a special multiplication algorithm
because he knows the numbers will be, say, larger than 256 bits in
size or modularly related. This puts the burden on the compiler to
optimise the library for the user, if possible. Meanwhile, as a
primitive Integers in Haskell only support a basic set of operations:
any extra operations would have to be provided in a special library.
(Though I may add extra primitive operations to GHC, this would be
nonstandard). As an aesthetic consideration, I mentioned bitwise
operations because they are atomic, primitive operations typically
associated with machine types (int32_t, uint64_t, etc.).
>> Modifying ARPREC will result in a dynamic library (or DLL) that
>> outputs simple arrays of doubles--this is how ARPREC currently
>> functions; there is no need to wrap a C++ class in an opaque C
>> pointer you cannot operate on independently. The huge downside to
>> ARPREC and one of the least trivial of my problems is how to
>> efficiently convert an array of doubles to precise integers,
>> especially for cryptographic purposes.
>
> Could you explain exactly what this problem is? If it still makes
> sense to
> spend time on it, of course.
This point is somewhat moot since I am no longer using ARPREC :) The
first problem is opaque data types from C++/C. If ARPREC stored the
resulting number as a Class object with members of different types
(ints and doubles), like this:
-- apint.hpp
class APInt {
public:
// ...
private:
int size;
double exponent;
double* mantissa;
};
Then, in a separate .cpp file I would create functions callable from C:
-- c_apint.h
extern "C" {
APInt*
newAPInt (int size, ...);
void
destroyAPInt (APInt* x);
int
addAPInt (APInt* x, APInt* y, APInt* res);
// ...
} // end extern "C"
The way to work with this in C (and Haskell) would be an opaque pointer:
typedef struct APInt* pAPInt;
From C, I could never access the members of the struct directly so
any special Haskell functions for modifying those members would have
to be special functions in the c-interface file, 'c_apint.h'.
An additional problem, somewhat related to the opaque-object problem
is converting an array of doubles to a (precise) array of uint32_t.
Individual conversions between doubles and ints are expensive and you
have to be very careful about rounding. Alternatives that avoid
rounding are:
(1) normalise the arbitrary precision double (and all the doubles in
the array) and use a union to access the mantissa of each double in
the array directly, and convert each to a uint32_t while carrying the
overflow from each mantissa to the next uint32_t; or,
(2) convert the array of doubles to a string and convert the string
to an array of uint32_t.
Both methods are expensive--you would have to multiply the arbitrary
precision double by log_10 and normalise it, to say nothing of the
conversion. Worse, for Haskell Integers some conversion would be
necessary, unless you want to use arithmetic means to perform the
bitwise primitive operations.
>>> I better be specific here: What I do is integer factorization
>>> (splitting integers into their prime factors). And, for example,
>>> I have implemented a version of the ECM (Elliptic Curve Method)
>>> in Haskell. Having done that (independently and mainly to learn
>>> what it was about), I eventually compared it in performance to
>>> GMP-ECM which Paul Zimmermann wrote. And I was initially
>>> disappointed to find that GMP-ECM was about 7 times faster than
>>> my ECM code. Fortunately, various sessions of "hand-waving"
>>> brought this factor down to about 1.5. So, whenever GMP-ECM used
>>> 2 minutes, my program would use 3 minutes.
>>
>> One of the reasons I am working on a library written in C++ is that
>> Haskell may be too slow for such things.
>
> I am not sure what you mean here. I would happily accept a price
> of, say, 10
> or 20 percent reduced performance, if I got the ease of programming
> (mostly)
> in Haskell in return. And I am still not convinced that, in a
> computation
> that involves heavy Integer work, multiplications, divisions,
> greatest common
> divisor computations with, say, 200-1000 digit numbers, that the
> Haskell
> reduced performance price could not be kept as low as this.
You could be right--at present I am fairly pessimistic about the
various overheads imposed by Haskell for such things as arrays but
that does not mean I am correct. You may have noticed my mention of
the BigFloat library in a somewhat related thread, "bignums, gmp,
bytestring ... ?," at http://www.haskell.org/pipermail/glasgow-
haskell-users/2006-November/011601.html . BigFloat <http://
bignum.sourceforge.net/> is an arbitrary precision library Martin Guy
originally wrote in Miranda and ported to Haskell: it is a good
library but a little slow (see timing results for the Many Digits
"Friendly" Competition, where BigFloat competed as Bignum, at http://
www.cs.ru.nl/~milad/manydigits/final_timings_rankings.html ). Of
course this is only one library and it is performing some complex
calculations.
It is interesting to note that BigFloat *is* precise and uses [Int]
for the mantissa. There is a note on the BigFloat page about coming
close to finding the highest computed power of 2 after more than a
month of computation (before someone turned the power off).
Of course, you are correct in assuming some of this moot. I am no
longer working on a C++ library: it is written in C with some
assembler and in my spare time I have been toying with a hybrid
Haskell-C system where the Haskell portion of the system works to
optimise basic bits of C code. (I mentioned the idea for this in the
November 2006 thread linked above.) This should give users the full
flexibility of Haskell with the speed of C, though writing it
portably is proving to be a real hassle. I have gotten a little side-
tracked porting GHC to be Windows-native in my quest to use a better
compiler than gcc 3.4 (the best available for MinGW), namely CL, but
it is all to the good. I had a lot to learn about Cmm and the Native
Code Generator (NCG)--at one time I thought I knew something about
Cmm; it was naive. There is a lot of room for NCG optimisations,
such as auto-vectorization and floating point exceptions.
>> My best question is: are you using pure mathematical formulas for
>> this?
>
> I am not sure what this question means. Please elaborate.
It was a question out of curiosity: for your ECM work, are you able
to implement the mathematical formulae directly in Haskell? (Yes, I
believe "formulae" is the correct plural of "formula.") C programs
require many special operations for dealing with the data and a fast
Haskell program would probably have to use a few (maybe I am a bad
Haskell programmer?).
>> If you are using pure mathematical functions, ARPREC is fine (at
>> least it is precise). Mathematically, there are alternative
>> algorithms for floating point numbers that are not available for
>> integers...
>
> Again, I am not sure what you mean by this. It is certainly
> possible to
> compute useful inverses of integers without using a floating point
> representation. To be sure, the inverse (reciprocal) of an integer
> is not an
> integer, but it could still be computed in a suitable fixed point
> representation. See, for example, Knuth's "Seminumerical
> Algorithms" ("The
> art of computer programming", vol 2), end of section 4.3.1 and
> algorithm R in
> section 4.3.3.
Fixed point is certainly an option but it is not a "natural"
operation over ints and may add more overhead than necessary
considering the relative speed advantages of modern floating point
processors. (By "not available," I was talking about hardware
support though I'll admit my typo-riddled prose may have thrown you
off--"Mathematically," may have been an artifact.) All of the
fastest integer division algorithms I am aware of do not use fixed
point for division by inverse multiplication.
> I still believe that the excessive memory usage was caused by GMP
> allocating a
> lot of storage using the GHC allocator during a single mpz_powm
> call, so that
> the GHC garbage collector never got the opportunity to actually
> free the
> allocated storage. The mpz_powm function of gmp-4.2 that I am
> looking at
> calls mpn_sqr_n inside its main loop and that function itself
> allocates
> storage in some cases. In other cases, it calls mpn_mul_fft_full
> which,
> again, allocates storage.
O.K. I misunderstood your program: I was under the impression that
you wrote the GMP program in C. That is an interesting insight
concerning the GHC garbage collector.
> so it may very well be that some of the above remarks are not
> relevant any
> more. In addition, you wrote
> (http://www.haskell.org/pipermail/glasgow-haskell-users/2006-
> November/011601.html):
>
>> ...
>> Integers are mostly a convenience.
>> ...
>
> ...
>
> My highest priority is still correctness. The thought that there
> might, in
> some special case, have crept an error into my basic arithmetic
> operators
> makes me shudder. In the present case of multiprecision integer
> operations,
> what worries me is the complications and the many special cases that
> implementors deal with. And not least special cases that address some
> particular architecture: This opens the possibility that code
> working on one
> machine stops working on another.
The general trend for floating or fixed point libraries that seek
precision is low to medium speed; there are many fast libraries out
there which are faster but not as precise. One example is the BOOST
uBLAS library: it is slower than BLAS but more precise.
> It seems that my level of tolerance of such risks is not as high as
> most other
> people's. Personally, if I were to write a multiprecision library
> myself, I
> would build in all sorts of possibilities for verifying that
> everything
> remains in place and results computed were indeed correct.
Ideally, these should be handled by the compiler-system rather than
run-time checks in code.
> For example, I understand that memory errors occur, not often, but
> often
> enough to be significant for such longwinding computations (days,
> months,
> years) that I envision. Other people are doing long computations (e.g.
> http://www.mersenne.org/) and they exercise some care in the form
> of test
> programs that are intended to verify that a particular machine is
> sufficiently stable. I would probably carry this even further, by
> including,
> even in "production" runs, various checks (individual arithmetic
> operations
> verified by checks modulo some number, say).
That is an excellent point. No arbitrary (multi) precision library I
know of provides a facility for performing such checks. It would be
a nice thing to add.
> Another thing that bothers me is the unpredictability of the
> performance of
> these multiprecision packages. When I look at the GMP code, there are
> countless special cases that are taken into account. Each of them,
> of course,
> represents the implementor's desire to provide me, the user, with
> the best
> service (performance) possible. But it makes measuring performance a
> nightmare, because you never know, for sure, whether some special
> case has
> been taken into account in a particular situation.
One reason I kept the comparison of different libraries very basic
and general. The best performance comes from special tuning tied to
a particular equation, as with FFTW.
> This was brought home to me most forcefully some months back: I have
> GHC-compiled programs running on a couple of Linux machines (not
> the same
> Linux distribution), but had used some precompiled GMP package on
> both of
> these machines. After looking at the GMP code, I realised (as I
> should have
> much earlier, of course) that I should use a package that was
> compiled for
> the specific machine I was working on to ensure that all the
> various special
> performance tricks could come into play. So I did. And performance
> improved
> on one machine, but degraded on another. I am still not sure what the
> explanation of this behaviour really is. But one thing I am certainly
> concerned about is whether, in fact, the GMP build process is
> actually able
> to guess some machine architecture correctly and ensure that its
> potential is
> used to the fullest extent.
This is a very big problem with every fast mathematical library. The
GMP build system is primitive (it tends to rely on assembler blocks
written for particular types of machines but does not perform further
testing). You know how ATLAS uses a specialised build process to
automatically optimise code for a particular machine and
architecture. It is very delicate and once you have built the
library as a developer you cannot expect the same performance if you
package the library with a program designed for one operating system
and many different (but related) machines: a program for OS X 10.4
built on a ppc7450 (G4) will run differently on a G3 or G5. My own
problem involves the different variations for Altivec and SSE on
different PPC and Intel/AMD processors, especially if I include any
assembler code but also if the library is compiled by a different
compiler or compiler version. I should probably make precision the
highest priority, portability, second and speed, third.
> One thing that seems missing in many packages is the possibility to
> experiment
> with the tuning parameters more dynamically. For example, GMP
> includes a
> tuning program that is able to suggest values for certain
> parameters. But
> these parameters then have to be compiled into the package for
> final use. Not
> particularly elegant and something which, surely, could be changed,
> without
> significant overhead.
GMP is unfortunately an open-source project, academic or commercial.
Some developer would have to spend the time to do the tedious work of
weighing the different parameter combinations so the tuning program
could determine and input the best performance choices and report back.
> The conclusion, for me at least, is this: If you are a naive user
> of Integers,
> you are much better off with a completely simple, no-special-cases,
> solid and
> working, multiprecision package. Which might even be fairly portable.
>
> For heavy users (such as me) with special requirements, it seems
> that they can
> never get the ultimate performance without having considerable
> insight into
> the details of the matter anyway. And, hence, they need to do some
> complex
> work themselves to get the desired performance.
A hybrid solution that gave basic Integer functionality but allowed
for variations in a special library would be better. Integer may be
a Primitive but it should really be a separate Haskell library.
Claus Reinke had a good comment about not automatically importing
Integer when a user doesn't need it.
> For Haskell, this would mean ensuring that the interface between
> GHC-compiled
> code and some multiprecision package was reasonably comprehensible and
> allowed the multiprecision support to be replaced. And here I try
> to avoid
> the usual dream of "easy replacement", that seems entirely
> unrealistic. But
> something reasonable that the rest of GHC (and not least: The GHC
> developers)
> can live with without too much work. And that heavy users, with
> sufficient
> insight and probably a significant dose of hard work (such as that
> you have
> put into this) would be able to use for their own purposes.
GMP *would* have been easy to replace but, like so much else in GHC,
it was hacked in and then hacked in again. No offense to those who
have spent more than a decade working on GHC--there is still working
code from the early 90's, at least in the Haskell libraries--but I
guess it suffers from how fun and easy it is to write in Haskell and
the comparative drudgery of cleaning things up. Modularity is key.
I really have to stop playing around and post a prototype so people
(like you) can have something to work with and criticise :)
Cheers,
Pete
More information about the Glasgow-haskell-users
mailing list