help wrt semantics / primops for pure prefetches

Thu Nov 27 17:07:32 UTC 2014

Actually, he was already working on them. I just joined him because I've
wanted to have them for a while too.

I don't have a real example using the prefetch (we just got it working!)
but I do have the test case we used to make sure they were implemented
correctly.

The code is http://lpaste.net/4551930812748005376 (I recommend using llvm
when testing this but it shows the improvement either way) and, two
separate runs of 5 times each:

http://lpaste.net/5476270150657245184
http://lpaste.net/1229927210207412224

The first prefetches the location we're just about to look at, the second
prefetches 10 lookups ahead. Its completely untunned, the GC seems to throw
a bunch of garbage into the nursery but the prefetch version is notably
faster.
We were just using this to make sure the implementation worked so we didn't
put too much effort into it. Originally we were working with binary search
and speculative look-ahead, but that was far harder to maintain as the
prefetch ops changed. (A bundled binary search is a lot easier of course)
Sadly the early binary search results are of no use because in their early
form using them created massive nursery garbage pressure when in pure code.
We had one (very briefly) in IO I believe but I lack any information on it.

Yes - using prefetch well is hard. Yes - the optimal use of prefetch
depends on the exact CPU and memory you have.
The best way to deal with this is of course to plan for it and that can be
warranted with some code. Code that does a lot of random access but doesn't
use too much memory bandwidth can be fairly tolerant to prefetch distance
though. For example the demo should drop to near-optimal quickly and its
performance should take quite a while to start dropping again. I believe on
my system the "nearly optimal" range was around 5 untill around 40.
One nice version of a prefetch that I talked to Carter about was "stick the
prefetch approximately the right number of instructions ahead of here" I
expect that is too complicated to implement though.

As we don't have thunk prefetchers currently (sad), I can't give you a demo
showing a speed up in the containers package, but I could write the bundled
binary search demo for you if you like.

On Thu, Nov 27, 2014 at 5:20 AM, Edward Kmett <ekmett at gmail.com> wrote:

> My general experience with prefetching is that it is almost never a win
> when done just on trees, as in the usual mark-sweep or copy-collection
> garbage collector walk. Why? Because the time from the time you prefetch to
> the time you use the data is too variable. Stack disciplines and prefetch
> don't mix nicely.
>
> If you want to see a win out of it you have to free up some of the
> ordering of your walk, and tweak your whole application to support it. e.g.
> if you want to use prefetching in garbage collection, the way to do it is
> to switch from a strict stack discipline to using a small fixed-sized queue
> on the output of the stack, then feed prefetch on the way into the queue
> rather than as you walk the stack. That paid out for me as a 10-15% speedup
> last time I used it after factoring in the overhead of the extra queue. Not
> too bad for a weekend project. =)
>
> Without that sort of known lead-in time, it works out that prefetching is
> usually a net loss or vanishes into the noise.
>
> As for the array ops, davean has a couple of cases w/ those for which the
> prefetching operations are a 20-25% speedup, which is what motivated Carter
> to start playing around with these again. I don't know off hand how easily
> those can be turned into public test cases though.
>
> -Edward
>
> On Thu, Nov 27, 2014 at 4:36 AM, Simon Marlow <marlowsd at gmail.com> wrote:
>
>> I haven't been watching this, but I have one question: does prefetching
>> actually *work*?  Do you have benchmarks (or better still, actual
>> library/application code) that show some improvement?  I admit to being
>> slightly sceptical - when I've tried using prefetching in the GC it has
>> always been a struggle to get something that shows an improvement, and even
>> when I get things tuned on one machine it typically makes things slower on
>> a different processor.  And that's in the GC, doing it at the Haskell level
>> should be even harder.
>>
>> Cheers,
>> Simon
>>
>>
>> On 22/11/2014 05:43, Carter Schonwald wrote:
>>
>>> Hey Everyone,
>>> in
>>> https://ghc.haskell.org/trac/ghc/ticket/9353
>>> and
>>> https://phabricator.haskell.org/D350
>>>
>>> is some preliminary work to fix up how the pure versions of the prefetch
>>> primops work is laid out and prototyped.
>>>
>>> However, while it nominally fixes up some of the problems with how the
>>> current pure prefetch apis are fundamentally borken,  the simple design
>>> in D350 isn't quite ideal, and i sketch out some other ideas in the
>>> associated ticket #9353
>>>
>>> I'd like to make sure  pure prefetch in 7.10 is slightly less broken
>>> than in 7.8, but either way, its pretty clear that working out the right
>>> fixed up design wont happen till 7.12. Ie, whatever makes 7.10, there
>>> WILL have to be breaking changes to fix those primops for 7.12
>>>
>>> thanks and any feedback / thoughts appreciated
>>> -Carter
>>>
>>>
>>> _______________________________________________
>>> ghc-devs mailing list
>>> ghc-devs at haskell.org
>>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>>
>>>  _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141127/9706d302/attachment.html>