[Haskell-beginners] Performance of Prime Generator

Daniel Fischer daniel.is.fischer at googlemail.com
Sun Feb 5 03:30:49 CET 2012

Sorry for the late reply, haven't checked my mail for a while.

On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
> Daniel Fischer <daniel.is.fischer at googlemail.com> wrote:
> > > > Well, thanks, so far I have tried wheel only and Sieve of
> > > > Eratosthenes only approaches. They both failed. When the numbers
> > > > is between 999900000 and 1000000000, they can take more than a
> > > > hour on my laptop.
> > 
> > They shouldn't. But you have to use the right data structures.
> > 
> > For a prime sieve, you need unboxed mutable arrays if you want it to
> > be fast. You can use STUArrays from Data.Array.ST or mutable unboxed
> > vectors from Data.Vector.Mutable.Unboxed.
> That's what I've used.  You find the code on hpaste [1].  It's a
> carefully optimized Sieve of Eratosthenes and needs around 20 seconds
> for the primes up to 10^9.  See the refinement in the annotation, which
> I've added just now.  Before that it took around 35 seconds.
> I considered that to be "too slow".

Right you are. Unless you need the primes to perform some time-consuming 
calculations with them after sieving, that is too slow.

But let us compare it with a similar C implementation.

> [1] http://hpaste.org/56866

The interesting parts are:
soeST :: forall s. Int -> ST s (STUArray s Int Bool)
soeST n = do
    arr <- newArray (0, n) True
    mapM_ (\i -> writeArray arr i False) [0, 1]
    let n2 = floor (sqrt (fromIntegral n))

    let loop :: Int -> ST s ()
        loop i | i > n2 = return ()
        loop i = do
            b <- readArray arr i

            let reset :: Int -> ST s ()
                reset j | j > n = return ()
                reset j = writeArray arr j False >> reset (j + i)

            when b (reset (i*i))

            loop (succ i)

    loop 2
    return arr

soeCount :: Int -> Int
soeCount = length . filter id . elems . soeA

With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534 
primes up to 10^9. That's in the same ballpark as your time, maybe a bit 
faster, depends on what 'around 20 seconds' means exactly.

Let's be unfriendly first:

#include <stdint.h>

inline int readArray(uint64_t *arr, int n, int i);
inline void writeArray(uint64_t *arr, int n, int i, int b);

#include <stdlib.h>
#include "Arracc.h"

inline int readArray(uint64_t *arr, int n, int i){
    if (i < 0 || i > n){
        perror("Error in array index\n");
    return (arr[i >> 6] >> (i & 63)) & 1;

inline void writeArray(uint64_t *arr, int n, int i, int b){
    if (i < 0 || i > n){
        perror("Error in array index\n");
    if (b) {
        arr[i >> 6] |= (1ull << (i&63));
    } else {
        arr[i >> 6] &= ~(1ull << (i&63));

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <math.h>
#include "Arracc.h"

uint64_t *soeA(int n);
int pCount(uint64_t *arr, int n);

int main(int argc, char *argv[]){
    int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000;
    uint64_t *sieve = soeA(limit);
    return EXIT_SUCCESS;

uint64_t *soeA(int n){
    int s = (n >> 6)+1;
    uint64_t *arr = malloc(s*sizeof *arr);
    if (arr == NULL){
        perror("Allocation failed\n");
    int i, j, n2 = (int)sqrt(n);
    for(i = 0; i < s; ++i){
        arr[i] = -1;
    for(i = 2; i <= n2; ++i){
        if (readArray(arr,n,i)){
            for(j = i*i; j <= n; j += i){
    return arr;

int pCount(uint64_t *arr, int n){
    int i, count = 0;
    for(i = 0; i <= n; ++i){
        if (readArray(arr,n,i)){
    return count;

$ gcc -c -O3 Arracc.c
$ gcc -c -O3 main.c
$ gcc -O3 main.o Arracc.o -lm

Takes 18.3 seconds. Oops, slower than the Haskell programme.

Okay, ghc does some cross-module inlining and that enables further 
optimisations which we denied gcc here, so recompile everything with -flto 
additionally. That brings the C version down to 12.8 seconds.

Much better, but still not overwhelming. gcc can do a little better with 
everything in one file, 12.72 seconds.

We can squeeze out a little more by omitting the bounds checks for array 
indexing, 12.63 seconds.

But if we do the same for our Haskell programme, replace 
readArray/writeArray with unsafeRead/unsafeWrite, that becomes faster too, 
14.3 seconds.

I have to admit that I don't know why the bounds-checking costs very little 
in C and a substantial amount in Haskell, but hey, who refuses free 

Lesson 1: Omit pointless array bounds checks.
An array bounds check is pointless if you have just checked the validity of 
the index on the line above.

Where are we?

We have two not overwhelmingly fast programmes, Haskell (ghc-7.2.2) about 
13% slower than C (gcc-4.5.1).

Not a bad result for ghc, but the algorithm is less than optimal.

What's the deal?

We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8 
bytes of memory for the sieve.

Both numbers are higher than is good for us. Since the sieve is so large 
and we cross off the multiples of each prime sequentially until we reach 
the sieve limit, we have lots of cache misses. Bad for performance.
And each crossing-off takes a couple of clock cycles,

arr[i >> 6] &= ~(1ull << (i&63));

shift index, fetch value, (index & 63), shift 1, complement, and with 
value, store value.

I'm no CPU expert, so I don't know how many cycles that takes, but even 
though some of the operations can be done in parallel on modern CPUs, it's 
still several cycles.

Lesson 2: Avoid duplicate work and redundant data.
It's easy to separate even and odd numbers, there aren't many even primes, 
and it's unnecessary to mark even numbers as multiples of odd primes.

Separating the crossing-off of even and odd numbers, crossing off only odd 
multiples of odd primes, reduces the number of crossings-off by almost 40% 
and the running time for the C programme to 8.63 seconds (I haven't done 
that for the Haskell programme).

If we go the small step further and eliminate the even numbers from the 
sieve completely, we reduce the required memory by half and the crossings-
off by almost 60% (the fraction of eliminated crossings-off slowly 
decreases to 50% for higher limits, but as far as today's RAM takes us, it 
remains close to 60%).

At the cost of very slightly more complicated code, we can thus reduce the 
running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7% 
slower than C, not bad at all).

That's a pretty good start, for some purposes it's already good enough.

If we need more, the next step is to eliminate multiples of 3 from the 
sieve. That reduces the memory requirements by a further factor of 2/3, and 
the number of crossings-off by a further (mumble, didn't do an exact 
calculation, educated guess) roughly 40% (that fraction decreases to 1/3 
for higher limits, but again very slowly).

The code becomes a bit more complicated again - more than in the previous 
step, but still fairly straightforward.

The running time reduces by another factor of 0.65-0.7 or so. We'd be in 
the region of 4-5 seconds then.

One can eliminate the multiples of further small primes, for each prime the 
additional code complexity increases and the reduction in work/running time 
decreases. At some point, the slowdown from the more complex code 
annihilates the gain from the reduced work. I stopped after eliminating 
multiples of 5, 7 might be worth it, but I'm not convinced.

Using a segmented sieve provides better locality at the cost of more 
complex code. If the sieve is small enough to fit in the L2 cache and large 
enough that some significant work can be done per segment, it's a net gain.

tl,dr: The naive algorithm we started from is too slow for higher limits, 
to get goodish performance, it has to be tweaked heavily. But Haskell plays 
in the same league as C (at least the C I've written).

> > > I haven't tried it, but an equivalent C implementation can easily
> > > compute the first 10^9 prime numbers within seconds.
> > 
> > You mean the primes up to 10^9 here?
> Yes, sorry.  And I was referring to the Sieve of Atkin there, but you
> are right.  That one is only slightly faster.

Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted 
to my L1 cache, it counts the primes to 10^9 in 0.68 seconds.
But it's only heavily optimised for primes to 2^32. Above that, my sieve 
catches up (and overtakes it around 5*10^11, yay).

> Greets,
> Ertugrul


More information about the Beginners mailing list