[Haskell-cafe] Re: FASTER primes

Sat Jan 2 23:08:45 EST 2010

Am Samstag 02 Januar 2010 14:13:29 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Mittwoch 30 Dezember 2009 20:46:57 schrieb Will Ness:
> > > Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > > > Am Dienstag 29 Dezember 2009 20:16:59 schrieb Daniel Fischer:
> > > > > > especially the claim that going by primes squares
> > > > > > is "a pleasing but minor optimization",
> > > > >
> > > > > Which it is not. It is a major optimisation. It reduces the
> > > > > algorithmic complexity *and* reduces the constant factors
> > > > > significantly.
> > > >
> > > > D'oh! Thinko while computing sum (takeWhile (<= n) primes) without
> > > > paper. It doesn't change the complexity, and the constant factors are
> > > > reduced far less than I thought.
> > >
> > > I do not understand. Turner's sieve is
> > >
> > >   primes = sieve [2..]
> > >    where
> > >     sieve (p:xs) = p : sieve [x | x<-xs, x `mod` p /= 0]
> > >
> > > and the Postponed Filters is
> > >
> > >   primes = 2: 3: sieve (tail primes) [5,7..]
> > >    where
> > >     sieve (p:ps) xs = h ++ sieve ps [x | x<-t, x `rem` p /= 0]
> > >                       where (h,~(_:t)) = span (< p*p) xs
> > >
> > > Are you saying they both exhibit same complexity?
> >
> > No. They don't.
> > But if you're looking at an imperative (mutable array) sieve (that's
> > simpler to analyse because you don't have to take the book-keeping costs
> > of your priority queue, heap or whatever into account), if you start
> > crossing out
>
> The key question then, is _*WHEN*_ and not _*WHAT*_. As is clearly
> demonstrated by the case of Turner/Postponed filters, the work that is done
> (of crossing numbers off) is the same

Crossing off is only part of the work. Most of the work is checking whether to cross off 
the number in this round. And Turner does a lot more of that than the postponed filters.

> - _when_ it is actually done - but
> Turner's starts out so _prematurely_ that it is busy doing nothing most of
> the time.

It's not doing nothing. It's just doing a lot of superfluous work. It is _achieveing_ 
nothing most of the time, though.
Take 7919, the thousandth prime. The postponed filters decide to keep it when fitering out 
the multiples of 89, the twenty-fourth prime. Turner also divides it by all 975 primes in 
between. That is a lot of real but futile work.

> Thus its function call overhead costs pile up enormously,
> overstaging the actual calculation.

It's not only the function calls. Division is expensive, too.

>
> So analyzing that calculation in the premature execution setting is missing
> the point, although helpful after we fix this, with the Postponed Filters.
> _Only then_ the finer points of this algorithm's analysis can be applied -
> namely, of avoiding testing primes divisibility altogether. And _if_ a fast
> cheap primality test were to have existed, the filtering versions would win

Sorry, I can't follow. What's the point of a primality test here? Every number whose 
multiples we want to remove is prime, what's left to test?

> over, because they progressively cull the input sequence so there would be
> no double hits as we have when merging the multiples (whether from lists or
> inside the PQ).

But there's a lot of list constructuion and deconstruction necessary for the Euler sieve. 
That may be more work than the multiple hits cause.

>
> > the multiples of p with 2*p, you have
> >
> > sum [bound `div` p - 1 | p <- takeWhile (<= sqrt bound) primes]
> >
> > crossings-out, that is Theta(bound*log (log bound)). If you eliminate
> > multiples of some small primes a priori (wheel), you can reduce the
> > constant factor significantly, but the complexity remains the same (you
> > drop a few terms from the front of the sum and multiply the remaining
> > terms with phi(n)/n, where n is the product of the excluded primes).
> >
> > If you start crossing out at p^2, the number is
> >
> > sum [bound `div` p - (p-1) | p <- takeWhile (<= sqrt bound) primes].
> >
> > The difference is basically sum (takeWhile (<= sqrt bound) primes), which
> > I stupidly - I don't remember how - believed to cancel out the main term.
> > It doesn't, it's O(bound/log bound), so the complexity is the same.
> >
> > Now if you take a stream of numbers from which you remove composites,
> > having a priority queue of multiples of primes, things are a little
> > different. If you start crossing out at 2*p, when you are looking at n,
> > you have more multiples in your PQ than if you start crossing out at p^2
> > (about pi(n/2) vs. pi(sqrt n)), so updating the PQ will be more
> > expensive. But updating the PQ is O(log size), I believe, and log pi(n)
> > is O(log pi(sqrt n)), so I think it shouldn't change the complexity here
> > either. I think this would have complexity O(bound*log bound*log (log
> > bound)).
>
> There are two questions here - where to start crossing off numbers, and
> when. If you'd start at 2*p maybe the overall complexity would remain the
> same but it'll add enormous overhead with all those duplicate multiples.

The additional duplicate multiples aren't the problem. Sure, the numbers having a prime 
divisor larger than the square root would be crossed off one additional time,  but that 
isn't so much per se. The additional crossings off are O(bound), so they don't harm the 
time complexity. But they can have effects which multiply the running time by a large 
constant.

> No, the question is not where to start, but when. PQ might hide the problem
> until the memory blows up. Anything that we add that won't have any chance
> of contributing to the final result, is added for nothing and only drives
> the total cost up needlessly.

Well, if you start crossing off at p^2 and feed your PQ from a separate prime generator, 
the memory blow-up is far away. E.g., when the main sieve is at 10^12, the PQ contains 
less than 80000 multiples. No space problem in sight. At 10^14, the PQ's size is still 
less than 700000.  That's noticeable, but there's still plenty of space.
If, on the other hand, you start crossung off at 2*p, when the main sieve is at 10^7, the 
size of the PQ is > 650000, at 10^8, the size is more than 5.5 million. That starts to 
become a memory problem rather soon.

>
> > > I was under impression that the first shows O(n^2) approx., and the
> > > second one O(n^1.5) (for n primes produced).
> >
> > In Turner/postponed filters, things are really different. Actually, Ms.
> > O'Neill is right, it is a different algorithm. In the above, we match
> > each
>
> what _is_ different is divisibility testing vs composites removal, which
> follows from her in-depth analysis although is never quite formulated in
> such words in her article. But nothing matters until the premature starting
> up is eliminated, and that key observation is missing for the article
> either - worse, it is brushed off with the casual remark that it is "a
> pleasing but minor optimization". Which remark, as you show here, is true
> in the imperative, mutable-storage setting, but is made in an article abut
> functional code, in regard to the functional code of Turner's sieve.

I think that remark was meant to apply to composite removal, not Turner's sieve.
But even then, the 'minor' isn't adequate considering her PQ approach where, although it 
doesn't change time complexity (in theory), it changes space complexity - and that may 
affect running time drastically. Probably she was thinking of the typical array sieve when 
she made that remark.

> So the key understanding is overlooked.
>
> Her article even adds the primes into the PQ prematurely itself, as soon as
> the prime is discovered (she fixes that in her ZIP package). With the PQ
> keeping these prematurely added elements deep inside its guts, the problem
> might not even manifest itself immediately, but the memory blow-up would
> eventually hit the wall (having the PQ contain all the preceding primes,
> instead of just those below the square root of a limit - all the entries
> above the square root not contributing to the calculation at all).
>
> And what we're after here is the insight anyway. Skipping steps of natural
> development does not contribute to garnering an insight. Most prominent
> problem with Turner's /code/ _specification_, is the premature start ups,
> and divisibility testing of primes (as Melissa O'Neill's analysis shows).
> Fix one, and you get Postponed Filters (which should really be used as a
> basis reference point for all the rest). Fix the other - and you've got the
> Euler's sieve:
>
>     primes = 2: 3: sieve (tail primes) [5,7..]
>      where
>       sieve (p:ps) xs = h ++ sieve ps [t `minus` tail [p*p, p*p+2*p..]]
>                         where (h,~(_:t)) = span (< p*p) xs
>
> Clear, succinct, and plenty efficient for an introductory textbook
> functional lazy code, of the same order of magnitude performance as PQ code
> on odds only.
>
> Now comparing the PQ performance against _that_ would only be fare, and IMO
> would only add to its value - it is faster, has better asymptotics, and is
> greatly amenable to the wheel optimization right away.
>
> > prime only with its multiples (plus the next composite in the PQ
> > version). In Turner's algorithm, we match each prime with all smaller
> > primes (and each composite with all primes <= its smallest prime factor,
> > but that's less work than the primes). That gives indeed a complexity of
> > O(pi(bound)^2).
> >
> > In the postponed filters, we match each prime p with all primes <= sqrt p
> > (again, the composites contribute less). I think that gives a complexity
> > of O(pi(bound)^1.5*log (log bound)) or O(n^1.5*log (log n)) for n primes
> > produced.
>
> Empirically, it was always _below_ 1.5, and above 1.4 , and PQ/merged
> multiples removal around 1.25..1.17 .

I should really stop doing those calculations in my head, it seems I'm getting too old for 
that. Doing it properly, on paper, the cost for identifying p_k is pi(sqrt p_k) [trial 
division by all primes less than sqrt p_k].
p_k is approximately k*log k, so pi(sqrt p_k) is approximately 
(sqrt $ k* log k)/log (sqrt $ k*log k) ~ 2*sqrt (k/log k).
Summing that, the cost for identifying the first n primes is Theta(n^1.5/log n).
Using a feeder, i.e. primes = 2:3.sieve feederprimes [5, 7 .. ], where feederprimes are 
generated as above, to reduce memory pressure and GC impact, that fits pretty well with my 
measurements of the time to find p_k for several k between 500,000 and 20,000,000.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100102/0654e4d2/attachment-0001.html