[Haskell-beginners] Performance of Prime Generator

Sun Feb 5 06:35:52 CET 2012

Hi Guys,

I just find the STUArray you are using is strange for me. Is there any tutorial for that? Google didn't help. Thanks.

On Feb 5, 2012, at 10:30 AM, Daniel Fischer wrote:

> Sorry for the late reply, haven't checked my mail for a while.
> 
> On Tuesday 24 January 2012, 13:50:32, Ertugrul Söylemez wrote:
>> Daniel Fischer <daniel.is.fischer at googlemail.com> wrote:
>>>>> Well, thanks, so far I have tried wheel only and Sieve of
>>>>> Eratosthenes only approaches. They both failed. When the numbers
>>>>> is between 999900000 and 1000000000, they can take more than a
>>>>> hour on my laptop.
>>> 
>>> They shouldn't. But you have to use the right data structures.
>>> 
>>> For a prime sieve, you need unboxed mutable arrays if you want it to
>>> be fast. You can use STUArrays from Data.Array.ST or mutable unboxed
>>> vectors from Data.Vector.Mutable.Unboxed.
>> 
>> That's what I've used.  You find the code on hpaste [1].  It's a
>> carefully optimized Sieve of Eratosthenes and needs around 20 seconds
>> for the primes up to 10^9.  See the refinement in the annotation, which
>> I've added just now.  Before that it took around 35 seconds.
>> 
>> I considered that to be "too slow".
> 
> Right you are. Unless you need the primes to perform some time-consuming 
> calculations with them after sieving, that is too slow.
> 
> But let us compare it with a similar C implementation.
> 
>> 
>> [1] http://hpaste.org/56866
> 
> The interesting parts are:
> ========================
> soeST :: forall s. Int -> ST s (STUArray s Int Bool)
> soeST n = do
>    arr <- newArray (0, n) True
>    mapM_ (\i -> writeArray arr i False) [0, 1]
>    let n2 = floor (sqrt (fromIntegral n))
> 
>    let loop :: Int -> ST s ()
>        loop i | i > n2 = return ()
>        loop i = do
>            b <- readArray arr i
> 
>            let reset :: Int -> ST s ()
>                reset j | j > n = return ()
>                reset j = writeArray arr j False >> reset (j + i)
> 
>            when b (reset (i*i))
> 
>            loop (succ i)
> 
>    loop 2
>    return arr
> 
> soeCount :: Int -> Int
> soeCount = length . filter id . elems . soeA
> ========================
> 
> With ghc-7.2.2 and -O2, that took 16.8 seconds here to count the 50847534 
> primes up to 10^9. That's in the same ballpark as your time, maybe a bit 
> faster, depends on what 'around 20 seconds' means exactly.
> 
> Let's be unfriendly first:
> 
> Arracc.h:
> ============
> #include <stdint.h>
> 
> inline int readArray(uint64_t *arr, int n, int i);
> inline void writeArray(uint64_t *arr, int n, int i, int b);
> ============
> 
> Arracc.c:
> ============
> #include <stdlib.h>
> #include "Arracc.h"
> 
> inline int readArray(uint64_t *arr, int n, int i){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    return (arr[i >> 6] >> (i & 63)) & 1;
> }
> 
> inline void writeArray(uint64_t *arr, int n, int i, int b){
>    if (i < 0 || i > n){
>        perror("Error in array index\n");
>        exit(EXIT_FAILURE);
>    }
>    if (b) {
>        arr[i >> 6] |= (1ull << (i&63));
>    } else {
>        arr[i >> 6] &= ~(1ull << (i&63));
>    }
> }
> ============
> 
> main.c:
> ============
> #include <stdlib.h>
> #include <stdio.h>
> #include <stdint.h>
> #include <math.h>
> #include "Arracc.h"
> 
> uint64_t *soeA(int n);
> int pCount(uint64_t *arr, int n);
> 
> int main(int argc, char *argv[]){
>    int limit = (argc > 1) ? strtoul(argv[1],NULL,0) : 100000000;
>    uint64_t *sieve = soeA(limit);
>    printf("%d\n",pCount(sieve,limit));
>    free(sieve);
>    return EXIT_SUCCESS;
> }
> 
> uint64_t *soeA(int n){
>    int s = (n >> 6)+1;
>    uint64_t *arr = malloc(s*sizeof *arr);
>    if (arr == NULL){
>        perror("Allocation failed\n");
>        exit(EXIT_FAILURE);
>    }
>    int i, j, n2 = (int)sqrt(n);
>    for(i = 0; i < s; ++i){
>        arr[i] = -1;
>    }
>    writeArray(arr,n,0,0);
>    writeArray(arr,n,1,0);
>    for(i = 2; i <= n2; ++i){
>        if (readArray(arr,n,i)){
>            for(j = i*i; j <= n; j += i){
>                writeArray(arr,n,j,0);
>            }
>        }
>    }
>    return arr;
> }
> 
> int pCount(uint64_t *arr, int n){
>    int i, count = 0;
>    for(i = 0; i <= n; ++i){
>        if (readArray(arr,n,i)){
>            ++count;
>        }
>    }
>    return count;
> }
> ============
> 
> 
> $ gcc -c -O3 Arracc.c
> $ gcc -c -O3 main.c
> $ gcc -O3 main.o Arracc.o -lm
> 
> Takes 18.3 seconds. Oops, slower than the Haskell programme.
> 
> Okay, ghc does some cross-module inlining and that enables further 
> optimisations which we denied gcc here, so recompile everything with -flto 
> additionally. That brings the C version down to 12.8 seconds.
> 
> Much better, but still not overwhelming. gcc can do a little better with 
> everything in one file, 12.72 seconds.
> 
> We can squeeze out a little more by omitting the bounds checks for array 
> indexing, 12.63 seconds.
> 
> But if we do the same for our Haskell programme, replace 
> readArray/writeArray with unsafeRead/unsafeWrite, that becomes faster too, 
> 14.3 seconds.
> 
> I have to admit that I don't know why the bounds-checking costs very little 
> in C and a substantial amount in Haskell, but hey, who refuses free 
> optimisations.
> 
> Lesson 1: Omit pointless array bounds checks.
> An array bounds check is pointless if you have just checked the validity of 
> the index on the line above.
> 
> Where are we?
> 
> We have two not overwhelmingly fast programmes, Haskell (ghc-7.2.2) about 
> 13% slower than C (gcc-4.5.1).
> 
> Not a bad result for ghc, but the algorithm is less than optimal.
> 
> What's the deal?
> 
> We cross off multiples of primes 2549489372 times, and we need (10^9+1)/8 
> bytes of memory for the sieve.
> 
> Both numbers are higher than is good for us. Since the sieve is so large 
> and we cross off the multiples of each prime sequentially until we reach 
> the sieve limit, we have lots of cache misses. Bad for performance.
> And each crossing-off takes a couple of clock cycles,
> 
> arr[i >> 6] &= ~(1ull << (i&63));
> 
> shift index, fetch value, (index & 63), shift 1, complement, and with 
> value, store value.
> 
> I'm no CPU expert, so I don't know how many cycles that takes, but even 
> though some of the operations can be done in parallel on modern CPUs, it's 
> still several cycles.
> 
> Lesson 2: Avoid duplicate work and redundant data.
> It's easy to separate even and odd numbers, there aren't many even primes, 
> and it's unnecessary to mark even numbers as multiples of odd primes.
> 
> Separating the crossing-off of even and odd numbers, crossing off only odd 
> multiples of odd primes, reduces the number of crossings-off by almost 40% 
> and the running time for the C programme to 8.63 seconds (I haven't done 
> that for the Haskell programme).
> 
> If we go the small step further and eliminate the even numbers from the 
> sieve completely, we reduce the required memory by half and the crossings-
> off by almost 60% (the fraction of eliminated crossings-off slowly 
> decreases to 50% for higher limits, but as far as today's RAM takes us, it 
> remains close to 60%).
> 
> At the cost of very slightly more complicated code, we can thus reduce the 
> running time to 6.06 seconds (C) resp. 6.49 seconds (Haskell) (about 7% 
> slower than C, not bad at all).
> 
> That's a pretty good start, for some purposes it's already good enough.
> 
> If we need more, the next step is to eliminate multiples of 3 from the 
> sieve. That reduces the memory requirements by a further factor of 2/3, and 
> the number of crossings-off by a further (mumble, didn't do an exact 
> calculation, educated guess) roughly 40% (that fraction decreases to 1/3 
> for higher limits, but again very slowly).
> 
> The code becomes a bit more complicated again - more than in the previous 
> step, but still fairly straightforward.
> 
> The running time reduces by another factor of 0.65-0.7 or so. We'd be in 
> the region of 4-5 seconds then.
> 
> One can eliminate the multiples of further small primes, for each prime the 
> additional code complexity increases and the reduction in work/running time 
> decreases. At some point, the slowdown from the more complex code 
> annihilates the gain from the reduced work. I stopped after eliminating 
> multiples of 5, 7 might be worth it, but I'm not convinced.
> 
> Using a segmented sieve provides better locality at the cost of more 
> complex code. If the sieve is small enough to fit in the L2 cache and large 
> enough that some significant work can be done per segment, it's a net gain.
> 
> tl,dr: The naive algorithm we started from is too slow for higher limits, 
> to get goodish performance, it has to be tweaked heavily. But Haskell plays 
> in the same league as C (at least the C I've written).
> 
>> 
>>>> I haven't tried it, but an equivalent C implementation can easily
>>>> compute the first 10^9 prime numbers within seconds.
>>> 
>>> You mean the primes up to 10^9 here?
>> 
>> Yes, sorry.  And I was referring to the Sieve of Atkin there, but you
>> are right.  That one is only slightly faster.
> 
> Well, D.J. Bernstein's primegen is much faster. With the sieve size adapted 
> to my L1 cache, it counts the primes to 10^9 in 0.68 seconds.
> But it's only heavily optimised for primes to 2^32. Above that, my sieve 
> catches up (and overtakes it around 5*10^11, yay).
> 
>> 
>> 
>> Greets,
>> Ertugrul
> 
> Cheers,
> Daniel
> 
> _______________________________________________
> Beginners mailing list
> Beginners at haskell.org
> http://www.haskell.org/mailman/listinfo/beginners

Best regards,
Zhi-Qiang Lei
zhiqiang.lei at gmail.com