Improving Data.Char.isSpace performance
John MacFarlane
jgm at berkeley.edu
Mon Oct 29 18:52:37 CET 2012
I have experimented with a couple of variants that seem
better than the definition I originally proposed.
The most promising is
isSpace_Alt6 :: Char -> Bool
{-# INLINE isSpace_Alt6 #-}
isSpace_Alt6 ' ' = True
isSpace_Alt6 '\n' = True
isSpace_Alt6 '\t' = True
isSpace_Alt6 '\r' = True
isSpace_Alt6 '\x0B' = True
isSpace_Alt6 '\x0C' = True
isSpace_Alt6 '\xA0' = True
isSpace_Alt6 c | c < '\x1680' = False
| otherwise = iswspace (fromIntegral (C.ord c)) /= 0
Benchmarks can be found here:
the program : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
results:
with ghc --make : http://johnmacfarlane.net/isSpace/unoptimized.html
with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
John
+++ John MacFarlane [Oct 28 12 12:16 ]:
> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be. Here is the definition
> (which is actually found in GHC.Unicode):
>
> isSpace :: Char -> Bool
> isSpace c =
> c == ' ' ||
> c == '\t' ||
> c == '\n' ||
> c == '\r' ||
> c == '\f' ||
> c == '\v' ||
> c == '\xa0' ||
> iswspace (fromIntegral (C.ord c)) /= 0
>
> I presume that the point of the disjuncts at the beginning is to
> avoid the call to iswspace for the most common space characters.
> The problem is that most characters (in most texts) are not space
> characters, and for nonspace characters iswspace will always be
> called.
>
> So I investigated a possible optimization that would also check for
> the most common nonspace characters before calling iswspace:
>
> isSpace_Alt :: Char -> Bool
> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
> | c == ' ' = True
> | '\t' <= c && c <= '\r' = True
> | c == '\xa0' = True
> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>
> In my benchmarks, this function significantly outperforms isSpace.
> I also found that a version of isSpace that does not check
> for nonspace characters, but uses case matching instead of
> a disjunction of equality tests, outperformed isSpace (but
> was usually not as fast as isSpace_Alt, and the difference
> mostly disappears with -O2):
>
> isSpace_Pattern :: Char -> Bool
> isSpace_Pattern c
> | c == ' ' = True
> | '\t' <= c && c <= '\r' = True
> | c == '\xa0' = True
> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>
> I benchmarked all three functions against five types of text
> (all ascii, all Greek, Haskell code, characters 0..255, and
> all spaces), and here are the (normalized) results:
>
> Compiled with 'ghc --make':
> --------------------------------------------------------------
> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
> --------------- ---------------- --------------- -----------
> ascii text 1.0 0.54 0.17
> greek text 1.0 0.57 0.71
> haskell code 1.0 0.57 0.24
> chars 0..255 1.0 0.54 0.39
> all space chars 1.0 0.70 0.90
> --------------------------------------------------------------
>
> Compiled with 'ghc --make -O2':
> --------------------------------------------------------------
> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
> --------------- ---------------- --------------- -----------
> ascii text 1.0 0.93 0.40
> greek text 1.0 0.98 0.99
> haskell code 1.0 1.03 0.58
> chars 0..255 1.0 0.92 0.62
> all space chars 1.0 0.88 0.92
> --------------------------------------------------------------
>
> My benchmark program can be found here:
> https://gist.github.com/3967761
>
> I'd like to propose that we consider replacing the definition
> of isSpace with isSpace_Alt.
>
> John
>
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
More information about the Libraries
mailing list