isSpace is too slow

Sun May 20 11:59:07 EDT 2007

On Sun, 2007-05-20 at 13:42 +0100, Neil Mitchell wrote:
> Hi
> 
> > Want to try also the
> >     Data.ByteString.Base.isSpaceWord8 :: Word8 -> Bool
> 
> isspace: 0.375
> iswspace: 0.400
> ByteString: 0.460
> Char: 0.672
> 
> Not as fast as isspace/iswspace, but quite a bit faster than Char.
> Perhaps someone needs to take a peek at the generated ASM for each of
> these routines and see where the Haskell one is falling down.

When I looked at this before I think it was mainly because they're doing
a lot of Unicode stuff. Here's the code

-- | Selects white-space characters in the Latin-1 range.
-- (In Unicode terms, this includes spaces and some control characters.)
isSpace                 :: Char -> Bool
-- isSpace includes non-breaking space
-- Done with explicit equalities both for efficiency, and to avoid a tiresome
-- recursion with GHC.List elem
isSpace c               =  c == ' '     ||
                           c == '\t'    ||
                           c == '\n'    ||
                           c == '\r'    ||
                           c == '\f'    ||
                           c == '\v'    ||
                           c == '\xa0'  ||
                           iswspace (fromIntegral (ord c)) /= 0

iswspace does a generic lookup in the unicode property database I think.
We could short-cut that for ascii characters. If it's less than 255 and
not one of the listed space chars then we know it's not a space char
without having to consult iswspace. In fact it might be best to check if
it's less than 255 initially and then go straight for the generic
unicode iswspace or do the specialised ascii test.

It's also worth testing if on the ascii subset if a bitvector test is
faster than several comparisons (in the average case of a non-space).

Note also that the space chars in the ascii subset are within the first
33 code points (ord ' ' = 32). I'm not sure if that fact actually helps
us to optimise a bitvector test.

Duncan