Improving Data.Char.isSpace performance

Edward A Kmett ekmett at gmail.com
Sun Oct 28 22:08:18 CET 2012


+1

It might be worth hunting for similar low hanging fruit as well.

-Edward

On Oct 28, 2012, at 3:16 PM, John MacFarlane <jgm at berkeley.edu> wrote:

> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be.  Here is the definition
> (which is actually found in GHC.Unicode):
> 
> isSpace :: Char -> Bool
> isSpace c =
>    c == ' '     ||
>    c == '\t'    ||
>    c == '\n'    ||
>    c == '\r'    ||
>    c == '\f'    ||
>    c == '\v'    ||
>    c == '\xa0'  ||
>    iswspace (fromIntegral (C.ord c)) /= 0
> 
> I presume that the point of the disjuncts at the beginning is to
> avoid the call to iswspace for the most common space characters.
> The problem is that most characters (in most texts) are not space
> characters, and for nonspace characters iswspace will always be
> called.
> 
> So I investigated a possible optimization that would also check for
> the most common nonspace characters before calling iswspace:
> 
> isSpace_Alt :: Char -> Bool
> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
>              | c == ' '                 = True
>              | '\t' <= c && c <= '\r'   = True
>              | c == '\xa0'              = True
>              | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
> 
> In my benchmarks, this function significantly outperforms isSpace.
> I also found that a version of isSpace that does not check
> for nonspace characters, but uses case matching instead of
> a disjunction of equality tests, outperformed isSpace (but
> was usually not as fast as isSpace_Alt, and the difference
> mostly disappears with -O2):
> 
> isSpace_Pattern :: Char -> Bool
> isSpace_Pattern c
>   | c == ' '                  = True
>   | '\t' <= c && c <= '\r'    = True
>   | c == '\xa0'               = True
>   | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
> 
> I benchmarked all three functions against five types of text
> (all ascii, all Greek, Haskell code, characters 0..255, and
> all spaces), and here are the (normalized) results:
> 
> Compiled with 'ghc --make':
> --------------------------------------------------------------
> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
> --------------- ----------------  ---------------  -----------
> ascii text      1.0               0.54             0.17
> greek text      1.0               0.57             0.71
> haskell code    1.0               0.57             0.24
> chars 0..255    1.0               0.54             0.39
> all space chars 1.0               0.70             0.90
> --------------------------------------------------------------
> 
> Compiled with 'ghc --make -O2':
> --------------------------------------------------------------
> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
> --------------- ----------------  ---------------  -----------
> ascii text      1.0               0.93             0.40
> greek text      1.0               0.98             0.99
> haskell code    1.0               1.03             0.58
> chars 0..255    1.0               0.92             0.62
> all space chars 1.0               0.88             0.92
> --------------------------------------------------------------
> 
> My benchmark program can be found here:
> https://gist.github.com/3967761
> 
> I'd like to propose that we consider replacing the definition
> of isSpace with isSpace_Alt.
> 
> John
> 
> 
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries



More information about the Libraries mailing list