A Faster whatIs

Sat, 30 Nov 2002 11:10:47 +0200

Hello everyone:

I made some changes to the Hugs sources which give small runtime
speed gains.  The results are in the table below.  The gains are
minor, but I thought some people in the group might be interested.

                                   WHATIS   WHATIS1    FHEAP

Speed Increase (percent)
              gamteb                 8         9         5
              parser                 6         8       gcfail
              prolog                 9         9         2
              queens                10        13         4

Extra BSS memory (kBytes)           24        24         0

Nbr source lines changed            50       144        41

---Notes On The Table

The table headings WHATIS, WHATIS1, and FHEAP refer to three
different sets of changes.  WHATIS and WHATIS1 are described
below.  FHEAP was an experiment that simply used a fixed size
heap (an array instead of a malloc) and is impractical because
it rules out use of the Hugs -h flag.

The (motley) collection of test programs are from nofib.  
I timed runhugs for each program without command line flags,
except that for queens I used -h1000000.  Different programs
would probably give different results: the usual caveats apply.
These programs were just the nofib ones I had at hand.

I compiled with MSC 5.0 Service Pack 3 using modified Nov 2002
Hugs sources. I ran the timing tests on a Windows 98 machine.

---Strategy

I wanted to see what speed gains could be achieved by taking a
worm's eye view of the code, with no changes to fundamental
algorithms.  I also wanted to make changes that would be limited
to just a few places in the code -- a minimal force approach.

To start off, I ran the MSC profiler.  This confirmed that
whatIs() is a candidate for optimization, as already noted in
Mark's Gopher implementation document and in comments in
storage.c.  But the profiler showed a few other hotspots too.
(An interesting thing to do is sort the profiler results on
execution line-count, which immediately tells you which are the
most executed lines in the program.)

---WHATIS Change

To reduce the time spent in whatIs, I created a byte array
whatCode of whatIs codes.  I then created a whatIs macro which
replaces the whatIs function:

  #define whatIs(c) (isPair(c)?                        \
                      (isTag(fst(c)) ? fst(c) : AP ) : \
                      whatCode[c])

Negative indexing into whatCode is prevented by the isPair.  To
keep the size of whatCode reasonable, I had to reduce the range
of unboxed ints -- the bigger the range, the bigger the whatCode
array.  I settled on a range of 2048 i.e. ints between -1023 and
1024 are unboxed, all others are boxed. The extra boxing will
actually slow down execution of certain Haskell programs, but as
far as I can tell the majority of real Hugs programs would be
unaffected.  Increasing this range to, say, 10 000 would increase
memory usage by 10 000 bytes -- nothing really when you consider
that the memory footprint of winhugs is about 10 MBytes.

This change yields the speed improvements under the WHATIS
column in the table.

---WHATIS1 Change

IsTag is defined as

  #define isTag(c)   (TAGMIN<=(c) && (c)<SPECMIN)

I reduced the cost of IsTag slightly by changing defines in
storage.h so that only box-cell-tags are in the range 1 to
0x7f. (I shifted special cell values down to start at
0x80).  I then defined variants for IsTag and friends:

  #define isTag1(c)  (((c) & TAG_MASK) == 0)

  #define whatIs1(c) (isPair(c)?                         \
                        (isTag1(fst(c))? fst(c) : AP ) : \
                        whatCode[c])

  #define isAp1(c)   (isPair(c) && !isTag1(fst(c)))

whatIs1 is a faster version of whatIs, with the "bug" that it
will return the wrong value if fst(c) is NIL.  When used in
several places as a replacement for whatIs, it gains us a few
speed percentage points as shown in the WHATIS1 column above.
WHATIS1 sits on top of WHATIS so the gains attributable to
WHATIS1 alone are the differences between the percentages in the
two columns.

---Another Candidate

Another candidate for this kind of optimization is the line
in eval():

   if (!isCfun(n) && (ar=name(n).arity)<=(sp-base)) {...

This is one of the most executed lines in the entire program.  

If the stored arity of Cfun's was offset by a largish number,
say 10000, then the above test against isCfun wouldn't be
needed.  This change would introduce inefficiencies elsewhere
(we would have to un-offset the stored arity before using it
elsewhere) but the net effect would probably be a speed gain.  I
shied away from this change because it would require ubiquitous
(though easy) code changes.

---Final Comments

Optimizing code written by Mark Jones is a challenge.  It
becomes a little less daunting if we change the rules by
allowing ourselves to waste some memory in the pursuit of speed,
and to introduce some ugliness into the code. Even so, the speed
improvements I got were small.

If anyone is interested, I would be happy to send the sources.
They are modified Nov 2002 sources with all the changes
demarcated by #define's. Six files changed.

I would be interested to know what results these changes give on
other machines.

---
Stephen Milborrow