[Haskell-cafe] Re: FASTER primes

Mon Jan 4 16:25:28 EST 2010

Am Montag 04 Januar 2010 18:08:42 schrieb Will Ness:
> Daniel Fischer <daniel.is.fischer <at> web.de> writes:
> > Am Sonntag 03 Januar 2010 09:54:37 schrieb Will Ness:
> > > The quesion of a memory blowup with the treefolding merge still
> > > remains. For some reason using its second copy for a feeder doesn't
> > > reduce the memory (as reported by standalone compiled program, GHCi
> > > reported values are useless) - it causes it to increase twice.....
> >
> > I have a partial solution. The big problem is that the feeder holds on to
> > the beginning of comps while the main runner holds on to a later part.
> > Thus the entire segment between these two points must be in memory.
> >
> > So have two lists of composites (yeah, you know that, but it didn't work
> > so far).
> >
> > But you have to force the compiler not to share them: enter -fno-cse.
> > The attached code does that (I've also expanded the wheel), it reduces
> > the memory requirements much (a small part is due to the larger wheel, a
> > factor of ~5 due to the non-sharing).
>
> I don't understand. What is there to be shared? Each multiples list is
> consumed only at one point; there's nothing to be shared. Do you mean the
> compiler still hangs on to them? If so, why??

Okay, let's look at the code.
Your Primes.hs says

{-# SPECIALIZE primes :: () -> [Int] #-}
{-# SPECIALIZE primes :: () -> [Integer] #-}
primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes'
    fromList (x:xs) = ([x],xs)

What's going on evaluating primes'?
First, we know primes' is 11 : 13 : more.
Then we can start looking at 'more'. For that, we need 
rollFrom 11 `minus` comps.
rollFrom 11 is easy. What about comps? We need the start of primes' for that. Cool, we 
already know the first two elements, start tfolding, we know comps is 
121:143:169:notYetKnown

So the rollFrom 11 produces happily, minus doesn't veto anything for a while, then says no 
to 121, 143, 169. Now rollFrom 11 says "look at 173". Fine, what's next in comps?
To find that out, we need the multiples of 17 and 19 (and 23 and 29, but those can still 
be deferred for a while). That gives ([289,323,361],whatever), then the mergeSP with the 
merged multiples of 11 and 13 produces (187:209:221:...) for notYetKnown.
Everything's going on fine, but the problem is, when primes' produces primes in the region 
of n, it needs the comps in that region. Those contain the multiples of primes in the 
region of sqrt n, thus the small primes cannot yet be released.
Bottom line, to produce p, all the primes from about sqrt p to p must be in memory. Oops.

Enter the feeder. Say

primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = (rollFrom 11) `minus` comps
    primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    fromList (x:xs) = ([x],xs)

Now primes'', the feeder of composites advances much more slowly and for primes' to 
produce p, primes'' must produce the next prime after sqrt p, thus holds on to all primes 
between sqrt (sqrt p) and sqrt p - not much of a problem. primes' doesn't hold on to 
smaller primes, great.
But! primes' needs the part of comps around p. primes'' needs the part of comps around 
sqrt p. That means we must keep the part of the list of composites from sqrt p to p in 
memory. Even after removing the multiples of 2, 3, 5 and 7, it doesn't take long until 
there are far more composites between sqrt p and p than primes. Double oops.

So we must make sure that the list of composites that primes' consumes is not the same as 
that which primes'' consumes.

You can try

primes :: Integral a => () -> [a]
primes () = 2:3:5:7:primes'
   where
    primes' = (rollFrom 11) `minus` comps
    primes'' = [11,13] ++ drop 2 (rollFrom 11) `minus` comps2
    (comps,_)  = tfold mergeSP (pairwise mergeSP mults)
    mults   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    (comps2,_)  = tfold mergeSP (pairwise mergeSP mults2)
    mults2   = map (\p-> fromList $ map (p*) $ rollFrom p) $ primes''
    fromList (x:xs) = ([x],xs)

If you compile without optimisations, it works. It just is rather slow. Unfortunately, the 
optimiser sees that comps2 and comps are the same and thinks, "why should I produce it 
twice if they're the same?" - whoops, back to square two.
Tell the compiler that you do *not* want it shared via -fno-cse and it isn't shared, you 
really have two distinct lists of composites, happiness ensues - almost.

>
> I used the switch; it didn't help at all. The only thing I can see is
> different is that all my interim data which I named with inner vars you
> moved out to the top level as functions. Is that what did the trick?

Not really, the above works too, with inner local bindings.
If you introduce separate names for the lists of multiples and for the lists of 
composites, -fno-cse ought to do the trick. Maybe it would even work with one list of 
multiples. Can I see what you did that -fno-cse didn't manage to separate the lists?

> What would be the reason to hang on to the already consumed data that is
> inaccessible to any active consumer?

No reason. And that's not what happened. We always had two consumers gnawing on the same 
list, one proceeding fast, the other slow. The part the slow consumer was done with could 
be dropped (and probably was) - the part between the slow and the fast: Not.

> Why not make the forgetful behaviour the norm -
> especially where remembering is pointless??

It is.

>
> > It still uses much more memory than the PQ, and while the PQ's memory
> > requirements grow very slowly, the tree-fold merge's still grow rather
> > fast (don't go much beyond the 10,000,000th prime), I'm not sure why.
>
> You did it! It's now 7M for 1,000,000th prime, instead of 52M before.
> Making the pattern lazy in mergeSP was probably an important fix too. :)

Did I forget to remove them? (checks) Yes. 
No, I put them in without thinking, then realised "Hey, stupid, where-bindings *are* 
already lazy!" But somehow I have forgotten to remove them.
Since let- and where-bindings are lazy, all those could do is make the compiler say 
"Sheesh".

>
> Unfortunately it grows, as you've said - 23MB for 2 mln. :|

And I've found out why. Change the definition of tfold to

tfold f (a: ~(b: ~(c:xs)))
                     = (a `f` (b `f` c)) `f` tfold f xs

and memory stays low (things are going much slower, though).
By always doing a (pairwise f xs), you're getting the composites faster, but at the 
expense of pretty huge thunks (I think, if the merging isn't lazy enough, you get some 
pretty long lists).

You can make a compromise by using the above tfold (which is no longer a tree-fold) and 
grouping (and merging) the multiples in a slower-growing manner, e.g.

compos ps = fst (tfold mergeSP $ nwise 1 mergeSP (pairwise mergeSP (multip ps)))

tfold f (a: ~(b: ~(c:xs)))
                     = (a `f` (b `f` c)) `f` tfold f xs

nwise k f xs = let (ys,zs) = splitAt k xs in rfold f ys : nwise (k+1) f zs

rfold f [x] = x
rfold f (x:xs) = x `f` rfold f xs

memory still grows, but much slower, in my tests, due to the much smaller GC time, it's a 
bit faster than the version with the original tfold.

>
> PQ stays at just 2MB.
>
No doubling the size of merge trees, no big thunks, just occasionally add a filter to the 
PQ. Memory requirements are O(pi(sqrt n)) [off the top of my head, may be wrong again], 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100104/bec158ae/attachment.html