Replacement for GMP: Update

Fri Dec 29 08:32:33 EST 2006

Hello,

Thanks a lot for your reply. Here are some comments to this. As is customary, 
I must apologise for the late reply (the response time for this conversation 
seems to increase exponentially with time) which also may very well make some 
of the comments totally out-dated.

On Friday 01 September 2006 06:43, Peter Tanski wrote:
> ...
> > For a brief overview of the speed of the libraries I looked at   
> carefully, see
> http://hackage.haskell.org/trac/ghc/wiki/ReplacingGMPNotes
> (I added a few charts to show the speed of ARPREC, OpenSSL's BN, GMP  
> and LibTomMath.  ....)  I tested GMP and OpenSSL's BN for reference.

I must confess that it took me a while to understand what you were doing here 
and I am still uncertain: For example, how many multiplications were actually 
performed for bit size 4096? In addition, for "Powers", the markings on the 
horizontal axis ("256 pow(n,3)", "512 pow(n,4)", "1024 pow(n5)" (missing a 
"," here?), ...) on your graph seem to indicate that you are changing two 
different variables (the bit size and the exponent) at the same time. I would 
suggest that you also quoted the original measurements directly. And perhaps 
(especially in case of the "Powers" and "Division") some more details about 
what was actually measured.

It is distinctly odd that squaring seems to be more expensive than 
multiplication for some bit sizes in 3 of the 4 measured cases. Also, I 
wonder what divisions these libraries are capable of carrying out so much 
faster than comparable multiplications. For the division measurements, I may 
again simply be presuming something about your experiments that simply isn't 
true.

> ...
> I keep talking about ARPREC--why?  For three reasons:
> (1) I trust its level of precision--this has been tested, see FFTW's  
> benchmark page for accuracy: http://www.fftw.org/accuracy/G4-1.06GHz- 
> gcc4/

I am not sure I understand here: With respect to Haskell Integers, I would not 
say that there is any concept of a "level of precision": If Integer 
computations are not accurate, they are wrong.

> (2) if you look at the charts, although ARPREC is bad in division it  
> simply blows GMP, OpenSSL and LibTomMath away: at 4096 bits (85  
> doubles--each double has conservatively only 48.3 or so bits of  
> integer precision), ARPREC can take a full random number to pow(n,7)  
> in .98 seconds, compared to 77.61 or so seconds for the leader of the  
> Integer-based libraries, GMP.  (I ran the tests many times to make  
> sure readings weren't flukes.)

Again, some additional details about these measurements would be most welcome.

> (3) of all the FFT-based arbitrary precision libraries available,  
> ARPREC is the only BSD-licensed one--Takuya Ooura's library (used in  
> MAPM) is only a FFT algorithm and not necessarily either fast or  
> accurate.  The rest of the FFT libraries available are essentially  
> GMP-licensed.
> 
> So I am in an unenviable position: I intend to fulfill my promise and  
> get a replacement for GHC (and maybe more), but I have to essentially  
> build better functionality into the front-runner, ARPREC.  At present  
> I have been working with vector-based algorithms that would enable me  
> to use hardware-tuned code for Single Instruction Multiple Data  
> (SIMD) chipsets.  Currently I am researching algorithms based on  
> operating-system supplied vector libraries.  Part of this  
> modification involves a fast transformation between a vector of large  
> integers and an array of doubles, without loss of precision (although  
> vectors of doubles are also working well, they do not have the same  
> library support.)  I am also working on enhancing ARPREC's division  
> algorithm.
> 

I looked briefly at ARPREC: It seems that it works with an explicitly set 
maximum bound on the number of digits desired. Although it also appears that 
it is possible to change this bound as computations proceed, such additional 
administration seems difficult to embrace.

> This is the problem I face: GHC unfortunately does not use Integer as  
> a mathematical operation but as a primitive type, complete with  
> bitwise operations.  

I do not understand what problem you are referring to here.

> ...
> On Aug 24, 2006, at 2:39 PM, Thorkil Naur wrote:
> 
> > I am truly unable to tell what I would consider the ideal  
> > situation. On the
> > one hand, I value greatly the freedom of choosing my circumstances  
> > without
> > restraints. And I appreciate the desire of noble people like the GHC
> > developers to be equally free of such restraints. This would point  
> > in the
> > direction of basing Integer computations on a fairly generic  
> > implementation.
> > Which insightful consideration would then force us to admit would  
> > probably
> > cost a factor of, say, 2 or more in performance in some cases.
> 
> The problem I face is: performance at which level?  The low-bit-size  
> (low precision) range or the high precision range?
> 
> > Two problems seem to be lurking here: The first is that, although  
> > taking a
> > copy of some existing library and massaging it to become working  
> > under some
> > presently available circumstances may be fine, what is really  
> > needed is
> > something that keeps on working, even under changing and future
> > circumstances. The second is that if I wish to mix Haskell and non- 
> > Haskell
> > code that processes the same data, I may find myself in need of a  
> > way to
> > convert between whatever is required by this copy of some existing  
> > library
> > that has been created and the presently available, but potentially  
> > quite
> > different, version of that same library.
> 
> Absolutely correct.  As you may have observed from your own  
> experience, there are advantages to having GHC's RunTime System (RTS)  
> handle the operations, either because garbage collected memory is  
> faster to allocate or because the natural laziness between normally  
> eager mathematical operations (mathematical operators are strict in  
> their arguments but Haskell functions wrapping those operators may be  
> lazy) means you can use GHC to strategically cache results of higher  
> level polynomial operations.  What I am working on is intended to be  
> an independent library, designed to be stable for the long-term and  
> interoperable with non-Haskell programs.  Modifying ARPREC will  
> result in a dynamic library (or DLL) that outputs simple arrays of  
> doubles--this is how ARPREC currently functions; there is no need to  
> wrap a C++ class in an opaque C pointer you cannot operate on  
> independently.  The huge downside to ARPREC and one of the least  
> trivial of my problems is how to efficiently convert an array of  
> doubles to precise integers, especially for cryptographic purposes.

Could you explain exactly what this problem is? If it still makes sense to 
spend time on it, of course.

> 
> > I better be specific here: What I do is integer factorization  
> > (splitting
> > integers into their prime factors). And, for example, I have  
> > implemented a
> > version of the ECM (Elliptic Curve Method) in Haskell. Having done  
> > that
> > (independently and mainly to learn what it was about), I eventually  
> > compared
> > it in performance to GMP-ECM which Paul Zimmermann wrote. And I was  
> > initially
> > disappointed to find that GMP-ECM was about 7 times faster than my  
> > ECM code.
> > Fortunately, various sessions of "hand-waving" brought this factor  
> > down to
> > about 1.5. So, whenever GMP-ECM used 2 minutes, my program would use 3
> > minutes.
> 
> One of the reasons I am working on a library written in C++ is that  
> Haskell may be too slow for such things.

I am not sure what you mean here. I would happily accept a price of, say, 10 
or 20 percent reduced performance, if I got the ease of programming (mostly) 
in Haskell in return. And I am still not convinced that, in a computation 
that involves heavy Integer work, multiplications, divisions, greatest common 
divisor computations with, say, 200-1000 digit numbers, that the Haskell 
reduced performance price could not be kept as low as this.

> ...
> My best question is: are you using pure mathematical formulas for  
> this? 

I am not sure what this question means. Please elaborate.

> GMP, LibTomMath and of course OpenSSL's BN, the pure-integer   
> based libraries, at the polynomial level mostly use modular  
> arithmetic algorithms, Karatsuba multiplication, Montgomery  or  
> Barrett reduction.  On the low-level they use bitwise operations  
> (say, right shift and add for multiplication--the gcc compiler seems  
> to use this as the default).  If you are using pure mathematical  
> functions, ARPREC is fine (at least it is precise).  Mathematically,  
> there are alternative algorithms for floating point numbers that are  
> not available for integers.  Floating point numbers may use division  
> by inverse (converting division into a multiplication operation by  
> taking the inverse of the divisor (1/d) at a low precision and using  
> addition, multiplication and subtraction until you reach the desired  
> precision.

Again, I am not sure what you mean by this. It is certainly possible to 
compute useful inverses of integers without using a floating point 
representation. To be sure, the inverse (reciprocal) of an integer is not an 
integer, but it could still be computed in a suitable fixed point 
representation. See, for example, Knuth's "Seminumerical Algorithms" ("The 
art of computer programming", vol 2), end of section 4.3.1 and algorithm R in 
section 4.3.3.

> Integer-based numbers may use Montgomery reduction (which   
> you are no doubt already familiar with.)
> 
> > Some additional experience is this: To gain some potentially easier- 
> > to-compare
> > measurements between Haskell-with-GMP and C-with-GMP, I implemented  
> > the GMP
> > function mpz_powm (raising a number to a power modulo a third  
> > number) as a
> > primitive operation in GHC. And compared it to my own power-raising  
> > operation
> > that was implemented in Haskell.
> >
> > Overall, the GMP implementation of the power operation was 2 times  
> > faster than
> > my Haskell implememntation. But, to make sure, using some  
> > techniques that my
> > code certainly didn't use.
> >
> > An interesting thing happened, however, for very large numbers: The  
> > GMP power
> > operation started using a lot of memory, thrashing, and then,  
> > eventually,
> > with sufficiently large operands, crashed for lack of memory. Where my
> > Haskell-implemented version continued to provide usable results.
> >
> > The explanation of this should undoubtably be found in the method  
> > (some
> > FFT-thing) used for the power operation by GMP for very large  
> > numbers: My
> > guess is that GMP when using this method issues a large number of
> > allocate/free requests. And, again I am guessing, assuming that each
> > allocate request causes the GHC runtime to allocate, whereas each  
> > free only
> > causes a "we'll handle this at the next GC", then such large  
> > computations
> > may very well display such behaviour.
> 
> Agreed.  This is one of those cases where I think Haskell's laziness  
> may have cached your intermediate results. I couldn't tell without    
> looking at the source code.  I do not think GMP uses an FFT for  
> multiplication; I have not investigated this extensively.

I still believe that the excessive memory usage was caused by GMP allocating a 
lot of storage using the GHC allocator during a single mpz_powm call, so that 
the GHC garbage collector never got the opportunity to actually free the 
allocated storage. The mpz_powm function of gmp-4.2 that I am looking at 
calls mpn_sqr_n inside its main loop and that function itself allocates 
storage in some cases. In other cases, it calls mpn_mul_fft_full which, 
again, allocates storage.

> 
> -Pete
> ...

Elsewhere (http://www.haskell.org/pipermail/cvs-ghc/2006-October/032345.html), 
you wrote:
> ...
> I have  been working on an arbitrary precision integer library from scratch 
> but I have a version of GHC without any GMP in it right now 
> ...

so it may very well be that some of the above remarks are not relevant any 
more. In addition, you wrote 
(http://www.haskell.org/pipermail/glasgow-haskell-users/2006-November/011601.html):

> ...
> Integers are mostly a convenience.
> ...

I fully agree with that. Therefore, as I outline what I now consider the ideal 
situation with respect to Haskell and Integer support, keep in mind that I 
have very specific desires that may not be shared by any other Haskell users 
and which should be prioritized accordingly.

My highest priority is still correctness. The thought that there might, in 
some special case, have crept an error into my basic arithmetic operators
makes me shudder. In the present case of multiprecision integer operations, 
what worries me is the complications and the many special cases that 
implementors deal with. And not least special cases that address some 
particular architecture: This opens the possibility that code working on one 
machine stops working on another.

It seems that my level of tolerance of such risks is not as high as most other 
people's. Personally, if I were to write a multiprecision library myself, I 
would build in all sorts of possibilities for verifying that everything 
remains in place and results computed were indeed correct.

For example, I understand that memory errors occur, not often, but often 
enough to be significant for such longwinding computations (days, months, 
years) that I envision. Other people are doing long computations (e.g. 
http://www.mersenne.org/) and they exercise some care in the form of test 
programs that are intended to verify that a particular machine is 
sufficiently stable. I would probably carry this even further, by including, 
even in "production" runs, various checks (individual arithmetic operations 
verified by checks modulo some number, say).

Another thing that bothers me is the unpredictability of the performance of 
these multiprecision packages. When I look at the GMP code, there are 
countless special cases that are taken into account. Each of them, of course, 
represents the implementor's desire to provide me, the user, with the best 
service (performance) possible. But it makes measuring performance a 
nightmare, because you never know, for sure, whether some special case has 
been taken into account in a particular situation.

This was brought home to me most forcefully some months back: I have 
GHC-compiled programs running on a couple of Linux machines (not the same 
Linux distribution), but had used some precompiled GMP package on both of 
these machines. After looking at the GMP code, I realised (as I should have 
much earlier, of course) that I should use a package that was compiled for 
the specific machine I was working on to ensure that all the various special 
performance tricks could come into play. So I did. And performance improved 
on one machine, but degraded on another. I am still not sure what the 
explanation of this behaviour really is. But one thing I am certainly 
concerned about is whether, in fact, the GMP build process is actually able 
to guess some machine architecture correctly and ensure that its potential is 
used to the fullest extent.

One thing that seems missing in many packages is the possibility to experiment 
with the tuning parameters more dynamically. For example, GMP includes a 
tuning program that is able to suggest values for certain parameters. But 
these parameters then have to be compiled into the package for final use. Not 
particularly elegant and something which, surely, could be changed, without 
significant overhead.

The conclusion, for me at least, is this: If you are a naive user of Integers, 
you are much better off with a completely simple, no-special-cases, solid and 
working, multiprecision package. Which might even be fairly portable.

For heavy users (such as me) with special requirements, it seems that they can 
never get the ultimate performance without having considerable insight into 
the details of the matter anyway. And, hence, they need to do some complex 
work themselves to get the desired performance.

For Haskell, this would mean ensuring that the interface between GHC-compiled 
code and some multiprecision package was reasonably comprehensible and 
allowed the multiprecision support to be replaced. And here I try to avoid 
the usual dream of "easy replacement", that seems entirely unrealistic. But 
something reasonable that the rest of GHC (and not least: The GHC developers) 
can live with without too much work. And that heavy users, with sufficient 
insight and probably a significant dose of hard work (such as that you have 
put into this) would be able to use for their own purposes.

Best regards
Thorkil