Improving Data.Char.isSpace performance
Edward A Kmett
ekmett at gmail.com
Sun Oct 28 22:08:18 CET 2012
+1
It might be worth hunting for similar low hanging fruit as well.
-Edward
On Oct 28, 2012, at 3:16 PM, John MacFarlane <jgm at berkeley.edu> wrote:
> I think that 'isSpace' from Data.Char (and hence also 'words' from the
> Prelude) is not as fast as it could be. Here is the definition
> (which is actually found in GHC.Unicode):
>
> isSpace :: Char -> Bool
> isSpace c =
> c == ' ' ||
> c == '\t' ||
> c == '\n' ||
> c == '\r' ||
> c == '\f' ||
> c == '\v' ||
> c == '\xa0' ||
> iswspace (fromIntegral (C.ord c)) /= 0
>
> I presume that the point of the disjuncts at the beginning is to
> avoid the call to iswspace for the most common space characters.
> The problem is that most characters (in most texts) are not space
> characters, and for nonspace characters iswspace will always be
> called.
>
> So I investigated a possible optimization that would also check for
> the most common nonspace characters before calling iswspace:
>
> isSpace_Alt :: Char -> Bool
> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
> | c == ' ' = True
> | '\t' <= c && c <= '\r' = True
> | c == '\xa0' = True
> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>
> In my benchmarks, this function significantly outperforms isSpace.
> I also found that a version of isSpace that does not check
> for nonspace characters, but uses case matching instead of
> a disjunction of equality tests, outperformed isSpace (but
> was usually not as fast as isSpace_Alt, and the difference
> mostly disappears with -O2):
>
> isSpace_Pattern :: Char -> Bool
> isSpace_Pattern c
> | c == ' ' = True
> | '\t' <= c && c <= '\r' = True
> | c == '\xa0' = True
> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>
> I benchmarked all three functions against five types of text
> (all ascii, all Greek, Haskell code, characters 0..255, and
> all spaces), and here are the (normalized) results:
>
> Compiled with 'ghc --make':
> --------------------------------------------------------------
> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
> --------------- ---------------- --------------- -----------
> ascii text 1.0 0.54 0.17
> greek text 1.0 0.57 0.71
> haskell code 1.0 0.57 0.24
> chars 0..255 1.0 0.54 0.39
> all space chars 1.0 0.70 0.90
> --------------------------------------------------------------
>
> Compiled with 'ghc --make -O2':
> --------------------------------------------------------------
> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
> --------------- ---------------- --------------- -----------
> ascii text 1.0 0.93 0.40
> greek text 1.0 0.98 0.99
> haskell code 1.0 1.03 0.58
> chars 0..255 1.0 0.92 0.62
> all space chars 1.0 0.88 0.92
> --------------------------------------------------------------
>
> My benchmark program can be found here:
> https://gist.github.com/3967761
>
> I'd like to propose that we consider replacing the definition
> of isSpace with isSpace_Alt.
>
> John
>
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
More information about the Libraries
mailing list