Improving Data.Char.isSpace performance

John MacFarlane jgm at berkeley.edu
Mon Oct 29 18:52:37 CET 2012


I have experimented with a couple of variants that seem
better than the definition I originally proposed.

The most promising is

isSpace_Alt6 :: Char -> Bool
{-# INLINE isSpace_Alt6 #-}
isSpace_Alt6 ' '               = True
isSpace_Alt6 '\n'              = True
isSpace_Alt6 '\t'              = True
isSpace_Alt6 '\r'              = True
isSpace_Alt6 '\x0B'            = True
isSpace_Alt6 '\x0C'            = True
isSpace_Alt6 '\xA0'            = True
isSpace_Alt6 c | c <  '\x1680' = False
               | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0

Benchmarks can be found here:

the program          : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
results:
  with ghc --make    : http://johnmacfarlane.net/isSpace/unoptimized.html
  with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html

John

+++ John MacFarlane [Oct 28 12 12:16 ]:
> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be.  Here is the definition
> (which is actually found in GHC.Unicode):
> 
> isSpace :: Char -> Bool
> isSpace c =
>     c == ' '     ||
>     c == '\t'    ||
>     c == '\n'    ||
>     c == '\r'    ||
>     c == '\f'    ||
>     c == '\v'    ||
>     c == '\xa0'  ||
>     iswspace (fromIntegral (C.ord c)) /= 0
> 
> I presume that the point of the disjuncts at the beginning is to
> avoid the call to iswspace for the most common space characters.
> The problem is that most characters (in most texts) are not space
> characters, and for nonspace characters iswspace will always be
> called.
> 
> So I investigated a possible optimization that would also check for
> the most common nonspace characters before calling iswspace:
> 
> isSpace_Alt :: Char -> Bool
> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
>               | c == ' '                 = True
>               | '\t' <= c && c <= '\r'   = True
>               | c == '\xa0'              = True
>               | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
> 
> In my benchmarks, this function significantly outperforms isSpace.
> I also found that a version of isSpace that does not check
> for nonspace characters, but uses case matching instead of
> a disjunction of equality tests, outperformed isSpace (but
> was usually not as fast as isSpace_Alt, and the difference
> mostly disappears with -O2):
> 
> isSpace_Pattern :: Char -> Bool
> isSpace_Pattern c
>    | c == ' '                  = True
>    | '\t' <= c && c <= '\r'    = True
>    | c == '\xa0'               = True
>    | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
> 
> I benchmarked all three functions against five types of text
> (all ascii, all Greek, Haskell code, characters 0..255, and
> all spaces), and here are the (normalized) results:
> 
> Compiled with 'ghc --make':
> --------------------------------------------------------------
> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
> --------------- ----------------  ---------------  -----------
> ascii text      1.0               0.54             0.17
> greek text      1.0               0.57             0.71
> haskell code    1.0               0.57             0.24
> chars 0..255    1.0               0.54             0.39
> all space chars 1.0               0.70             0.90
> --------------------------------------------------------------
> 
> Compiled with 'ghc --make -O2':
> --------------------------------------------------------------
> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
> --------------- ----------------  ---------------  -----------
> ascii text      1.0               0.93             0.40
> greek text      1.0               0.98             0.99
> haskell code    1.0               1.03             0.58
> chars 0..255    1.0               0.92             0.62
> all space chars 1.0               0.88             0.92
> --------------------------------------------------------------
> 
> My benchmark program can be found here:
> https://gist.github.com/3967761
> 
> I'd like to propose that we consider replacing the definition
> of isSpace with isSpace_Alt.
> 
> John
> 
> 
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries



More information about the Libraries mailing list