Improving Data.Char.isSpace performance
Simon Peyton-Jones
simonpj at microsoft.com
Mon Oct 29 23:29:15 CET 2012
Sounds good to me. Thanks for doing this.
When you think you are ready, just submit a patch. (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.)
Simon
| -----Original Message-----
| From: libraries-bounces at haskell.org [mailto:libraries-bounces at haskell.org] On
| Behalf Of John MacFarlane
| Sent: 28 October 2012 19:16
| To: libraries at haskell.org
| Subject: Improving Data.Char.isSpace performance
|
| I think that 'isSpace' from Data.Char (and hence also 'words' from the
| Prelude) is not as fast as it could be. Here is the definition
| (which is actually found in GHC.Unicode):
|
| isSpace :: Char -> Bool
| isSpace c =
| c == ' ' ||
| c == '\t' ||
| c == '\n' ||
| c == '\r' ||
| c == '\f' ||
| c == '\v' ||
| c == '\xa0' ||
| iswspace (fromIntegral (C.ord c)) /= 0
|
| I presume that the point of the disjuncts at the beginning is to
| avoid the call to iswspace for the most common space characters.
| The problem is that most characters (in most texts) are not space
| characters, and for nonspace characters iswspace will always be
| called.
|
| So I investigated a possible optimization that would also check for
| the most common nonspace characters before calling iswspace:
|
| isSpace_Alt :: Char -> Bool
| isSpace_Alt c | c > '\x20' && c < '\xa0' = False
| | c == ' ' = True
| | '\t' <= c && c <= '\r' = True
| | c == '\xa0' = True
| | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
|
| In my benchmarks, this function significantly outperforms isSpace.
| I also found that a version of isSpace that does not check
| for nonspace characters, but uses case matching instead of
| a disjunction of equality tests, outperformed isSpace (but
| was usually not as fast as isSpace_Alt, and the difference
| mostly disappears with -O2):
|
| isSpace_Pattern :: Char -> Bool
| isSpace_Pattern c
| | c == ' ' = True
| | '\t' <= c && c <= '\r' = True
| | c == '\xa0' = True
| | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
|
| I benchmarked all three functions against five types of text
| (all ascii, all Greek, Haskell code, characters 0..255, and
| all spaces), and here are the (normalized) results:
|
| Compiled with 'ghc --make':
| --------------------------------------------------------------
| Input isSpace_DataChar isSpace_Pattern isSpace_Alt
| --------------- ---------------- --------------- -----------
| ascii text 1.0 0.54 0.17
| greek text 1.0 0.57 0.71
| haskell code 1.0 0.57 0.24
| chars 0..255 1.0 0.54 0.39
| all space chars 1.0 0.70 0.90
| --------------------------------------------------------------
|
| Compiled with 'ghc --make -O2':
| --------------------------------------------------------------
| Input isSpace_DataChar isSpace_Pattern isSpace_Alt
| --------------- ---------------- --------------- -----------
| ascii text 1.0 0.93 0.40
| greek text 1.0 0.98 0.99
| haskell code 1.0 1.03 0.58
| chars 0..255 1.0 0.92 0.62
| all space chars 1.0 0.88 0.92
| --------------------------------------------------------------
|
| My benchmark program can be found here:
| https://gist.github.com/3967761
|
| I'd like to propose that we consider replacing the definition
| of isSpace with isSpace_Alt.
|
| John
|
|
| _______________________________________________
| Libraries mailing list
| Libraries at haskell.org
| http://www.haskell.org/mailman/listinfo/libraries
More information about the Libraries
mailing list