[Haskell-cafe] [14/16] SBM: Behind the measurements (rationale)
Peter Firefly Brodersen Lund
firefly at vax64.dk
Sat Dec 22 04:17:26 EST 2007
// I am getting sick and tired of working on this project and it's probably
// better to get it fired off than polishing it any further.
//
// This email could benefit from being rewritten from a rough draft into a
// well-crafted letter but that would take a couple of hours.
//
// So here it is, a lot rougher than I'd like -- but it *IS* :)
why so big input files?
easiest way to spot non-linearity and bad memory behaviour. Anyway, files
should be big enough to overflow caches and kick the gc in.
(short files interesting, too, but the big ones cause more complex behaviour
of run-time system and CPU. If complex behaviour is behaved then simple
behaviour probably is too -- but can still get its constant factors
improved. If complex behaviour bad, then shouldn't that be fixed in any
case?)
waitpid4() has a struct w/ info about the child program's resource usage.
unfortunately, the peakrss field is not filled in. Seems to be a general
Unix problem. I've seen complaints on the net that Solaris doesn't fill it
in, either. Other solution needed.
pause-at-end, /proc/self/maps + /proc/self/status. VmmHWM = peak of VmmRSS,
which is Resident (working) Set Size. It doesn't say what is shared with
other processes or the operating system, though. In our case, we don't
expect to share anything but some libraries -- which nobody else wants to
share with us anyway (except for the C library). We are the only user of
them.
Discovered about a week ago that I could probably have used waitid()
w/ WNOWAIT flag but didn't know. Was quick to write pause-at-end, anyway.
It took about 15 minutes from the desire to know the peak memory use to
having written and tested the first cut of it. Pause-at-end not completely
bullet-proof in case of dyn libraries that get unloaded before the end of
the program has been reached. On the other hand, plenty good enough for
these tests + can conceivably allow more intricate poking around than
waitid() solution.
getting good measurements - eatmem, dd, probably should also dd library.
good to have a "sacrificial run". Good to measure how good the measurements
are (rel. std.dev + user/sys/real check).
why average -- disturbances are mostly interrupts, daemons that everybody
have anyway, slightly luckier/unluckier physical pages. These are real
effects that nobody can control anyway. I'm not interested in the best
possible times on an ideal, undisturbed machine with a helpful kernel.
I'm interested in clean times under realistic circumstances. Therefore
average instead of minimum.
why I use real and not user/sys -- handling of blocking reads vs. mmap vs.
madvise/fadvise vs. reading in separate thread in the future. User+sys
would probably give me better numbers at the moment and I could change to
real later. Still, I choose to stay with real (and the difference is marginal,
anyway).
Funny that the exact distribution of time between sys and user fluctuates a
lot. In space-bslc8-lenfil-2 sys varies between 0.160s and 0.244s. Real is
completely stable with 5x 1.396s and 1x 1.397s.
look at /proc/interrupts, perhaps copy before/after to .intr? Warn if more
than 100 (or 1000) Hz + 10%?
write date/time + runlevel to platforminfo and/or sysinfo.
barcharts
why barcharts.
should the time/mem barcharts be equal length? don't think so (hard to
colour them in a text file. Would work with less -r and the console but
not in an email or a text editor. Visual difference is good).
But should perhaps not be /that/ different.
visible markers if measurements bad.
(5% real/user/sys check, typically within 0.1% on old laptop when doing a
quick or thorough benchmark. Occasionally up to 1% - and 3% on c/byte-4k
because it only takes 56ms in total.)
prints out how tight the user/sys/real thing is.
microarchitecture -- performance counters. Would be interesting to look at
once the obvious performance problems have been handled. Let's fix the
memory usage of bytestrings, the performance of lazy bytestrings, and
start using registers in the machine code first.
regularity of input file probably means that branch predictor on all three
CPUs can remember pattern of spaces vs. non-spaces (or at least part of the
pattern). Branch predictors not only use two-bit saturating counter for
strongly non-taken/weakly non-taken/weakly taken/strongly taken. They also
try to remember the pattern of jumps/non-jumps. A more realistic test
would have less regular input file. This effect is very small given the
current performance limiters, though.
cache -- turned out to be pretty regular (by eyeballing cachegrind reports).
Go up a factor of 10 in filesize and the number of access also went up a
factor of 10. The miss ratios stayed the same. The miss ratios differed
a bit between the benchmarks but I don't think it's time to look into that
yet. The data are available, though, for those who can't wait to look
into that.
minor page faults
we gather that through /usr/bin/time -- and could also get the same info
from dumping the right file inside /proc/self/. Probably not important yet.
Probably will be once all the low-lying fruit has been gathered up from the
ground.
More of a factor on slower OS'es than Linux.
C files. Buffer size.
reading it all in one go is slower than (re)using a small buffer. Cache
effects, both in the operating system when copying (because the destination
will be cached with a small buffer but non-cached with a big buffer) and in
the application (everything will be cached with the small buffer, nothing
with the big buffer). Note that at least the Core and the Athlon64 have
automatic prefetchers that tries to fill the cache in advance so we don't
have to wait for the cache misses. Doesn't quite seem to work.
Older caches had a different write behaviour, they were write-through instead
of the modern (lazy) write-back. For those caches the writing to the user-space
buffer should be slow even when a small buffer is reused all the time (because
we would have to wait for all the writes to be flushed out to main memory).
C files.
getchar/getchar_unlocked. A comparison with getchar() is NOT what the
simple haskell program does [grammar!]. Thread-safe by default (cause of
libraries). getchar_unlocked() is what the haskell programs do. getchar() and
getchar_unlocked() use a single buffer for stdin.
getwchar() and getwchar_unlocked() included at the insistence of wli. Much
slower, because it is run-time dependent on locale (to choose encoding).
Therefore, can't be a macro like getchar_unlocked() is. With an indirect
jump, should be same speed as getchar() on Core and Athlon64 -- but
curiously isn't.
C and Haskell integer sizes and other limitations.
Haskell uses unboxed 32-bit signed integers, except in lazy lenfil tests.
Most of the C programs are simple and just use an int for the space count.
One of them (space-megabuf)is more complicated.
off_t is 64-bit. ssize_t is 32-bit. Potential overflow in c/space-megabuf.
Potential 32-bit wrap-around in all my C tests. Same problem with all
Haskell tests, except for the two lenfil tests that use lazy bytestrings,
because they use a 64-bit int for the length of the intermediate string of
just the spaces filtered out from stdin.
Also potentially out of memory.
In practice, they have almost the same limit because they use about
107MB for the 143MB input file. In other words, it will run out
of virtual address space or RAM or swap at about the same time that the
others will run out of bits in a 32-bit signed integer.
-Peter
More information about the Haskell-Cafe
mailing list