[Haskell-cafe] Re: Code and Perf. Data for PrimeFinders(was:Genuine Eratosthenes sieve)

Sat Mar 3 13:59:57 EST 2007

Hello all felllow primefinders,

I have followed this discussion of Haskell programs to generate small primes 
with the greatest interest. The thing is, I have been running my Haskell 
implementation of the so-called Elliptic Curve Method (ECM) of factoring for 
a number of years. And that method uses a sequence of small primes, precisely 
like the one discussed here.

I have been using the method of trial division to generate primes for the ECM. 
I have been aware that sieving would very likely be more efficient, but since 
the amount of time spent generating small primes is, after all, not an 
extremely large part of running the ECM program, I have postponed the task of 
looking into this. So this discussion of Eratostenes' sieve has provided a 
most wonderful opportunity to do something about this.

Initially, I wanted to establish some sort of base line for generating small 
primes. This I did by writing my own C program for generating small primes, 
both using sieving and trial division.

The result was most sobering: Finding primes with sieving with a C program is 
much faster than finding them with trial division. The speed advantage factor 
of sieving over trial division for small numbers (say, finding the first 
couple of hundred thousand primes) is around 5 and it gets larger with larger 
numbers. (I will not quote specific measurements here, only rough tendencies. 
Details clutter and Melissa is much better at this anyway. I would like to 
state, however, that I am, in fact, measuring performance accurately, but I 
use ancient machinery that runs slowly: Measuring performance on fast 
equipment often tends to cloud significant issues, in my experience.)

So there is my goal now: Find a Haskell program that generates small primes 
that is around 5 times faster than the one I have already that uses trial 
division. And I mean a Haskell program, not a C program that gets called from 
Haskell. Sure, if I wanted ultimate performance, I could perhaps just call 
some C code from Haskell. But the insistance on pure Haskell widens the 
perspective and makes it more likely that the programmer or the language 
designer and implementer learn something valuable.

The C sieving program uses an array to attain its impressive performance, so 
let us try that in Haskell as well. Something like this has been on my mind 
for a long time:

  accumArray (\x ->\y ->False) True (m,n) [(q,()) | q<-ps ]

Haskell arrays, like every other value, cannot be changed, but accumArray 
provides a facility that we just might manage to use for our purpose. The 
particular array computed by the above expression using accumArray requires a 
low and high limit (m,n) of the numbers to be sieved as well as a list of 
sieving values - ps - numbers that need to be crossed out as composite in the 
interval (m,n) given.

[The overhead involved in evaluating this expression with its unused () value 
and an indicator that is changed from True to False through throwing away two 
dummy values seems excessive. Any suggestion to write this up in a way that 
is more in agreement with the small amount of work that is actually required 
to do this would be most welcome. Or perhaps GHC (which is the compiler I use 
when I want things to run fast) takes care of things for me? In any case, 
this is my inner loop and even small improvements will matter.]

When I tried this initially, I was actually not particularly surprised to find 
that it performed rather worse than trial division. Although I am unable to 
give a precise explanation, the GHC profiling runs indicated that, somehow, 
too much memory was accumulating before it was used, leading to thrashing and 
perhaps also a significant amount of overhead spent administering this 
excessive amount of memory.

The initial sieve that I used had a fixed size that covered all the numbers 
that I wanted to test. An improvement might come about by splitting this 
sieve into smaller pieces that were thrown away when no longer needed.

OK, what pieces? Well, a possibility would be to use the splitting of all 
integers into subsequences that matched the interval between squares of 
consecutive primes, because that would match the entry of a new prime into 
the set of primes that needs to be taken into consideration.

And this actually worked rather well. In addition, the problem solved was now 
the one of expressing the generation of an infinite sequence of primes, not 
just the primes within some predetermined interval.

Further (yes, you will get the code eventually, but be patient, I don't want 
to have to present more than the final version), there was the wheel idea 
also used by Melissa of somehow eliminating from consideration all the 
numbers divisible by a few of the smallest primes: If we, for example, could 
get away with considering only the odd numbers as candidates for being 
primes, this would potentially save half the work.

This I didn't find to be an easy idea to implement. Essentially what I did 
was, both for the sequence of candidates and for the sequence of potential 
divisors, changing the representation into one in which consecutive numbers 
represented only the numbers that didn't have some basic set of primes as 
divisors.

An example may help here. Let's say that we have selected 2 and 3 as factors 
to be eliminated from consideration. Then we have the numbers 1, 5, 7, 11, 
13, ... that are neither divisible by 2 nor 3. And we establish a 
correspondence 0<->1, 1<->5, 2<->7, 3<->11, 4<->13, ... so that every number 
not divisible by 2 or 3 has precisely one representaton in [0..]. I call this 
the wheel representation. So, once we have emitted the special primes 2 and 
3, we can limit our attention to the numbers whose wheel represention is the 
list [1..].

The situation becomes more complicated when considering the effective 
generation of the sequence of numbers that we use to cross off the numbers in 
this basic list. Taking the next prime 5, for example, we have to generate 
the list of divisors [25,30,..] in the wheel representation. This task is 
made more complicated by the fact that some of these numbers (like 30) should 
first be eliminated, because they are divisible by 2 or 3. The end result is 
a finite list of differences which can be repeated endlessly to generate the 
actual divisors in the wheel representation.

Then there is the use of specialization, something that GHC can use to 
generate more efficient code for special instances of generic functions. The 
way I understand this, GHC can generate a special Int version of code that 
otherwise handles the general Integral class. But in order for this code to 
be selected for use by GHC, it must be called under circumstances where it is 
clear that the involved type is really Int. To capture the full potential of 
this seems to require some special handling in general.

My trial division code was speeded up by a factor of about 5 using this 
technique. For the present sieving code, the factor was only about 2. But I 
believe that additional opportunities for speeding up using this technique 
remain.

A final complication concerns the need to be able to specify a lower limit on 
the list of primes generated. This is for possible use of this code to 
generate primes for the ECM.

Enough talk, the code can be found at

  thorkilnaur.dk/~tn/T64_20070303_1819.tar.gz

and includes both the C code that I have used (ErathoS.c), a Haskell module 
ErathoS (ErathoS.hs), a simple driver for ErathoS (T64.hs), and a Shell 
script that demonstrates the basic possibilities (T64.sh).

Overall, the speed compares well with the fastest of the sieves that we have 
seen in this discussion. And I am able to use sieving in Haskell, as in C, to 
gain a speed advantage over trial division. The Haskell code is still 
considerably slower than the C code in spite of the fact that the C code has 
been written mostly with an attention to correctness and could probably be 
improved rather easily.

Best regards
Thorkil