Improving Data.Char.isSpace performance
Ivan Lazar Miljenovic
ivan.miljenovic at gmail.com
Mon Oct 29 23:30:43 CET 2012
On 30 October 2012 04:52, John MacFarlane <jgm at berkeley.edu> wrote:
> I have experimented with a couple of variants that seem
> better than the definition I originally proposed.
>
> The most promising is
>
> isSpace_Alt6 :: Char -> Bool
> {-# INLINE isSpace_Alt6 #-}
> isSpace_Alt6 ' ' = True
> isSpace_Alt6 '\n' = True
> isSpace_Alt6 '\t' = True
> isSpace_Alt6 '\r' = True
> isSpace_Alt6 '\x0B' = True
> isSpace_Alt6 '\x0C' = True
> isSpace_Alt6 '\xA0' = True
> isSpace_Alt6 c | c < '\x1680' = False
> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
Is there any particular reason you're using a guard rather than a
pattern match for the \x1680 case?
>
> Benchmarks can be found here:
>
> the program : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
> results:
> with ghc --make : http://johnmacfarlane.net/isSpace/unoptimized.html
> with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
>
> John
>
> +++ John MacFarlane [Oct 28 12 12:16 ]:
>> I think that 'isSpace' from Data.Char (and hence also 'words' from the
>> Prelude) is not as fast as it could be. Here is the definition
>> (which is actually found in GHC.Unicode):
>>
>> isSpace :: Char -> Bool
>> isSpace c =
>> c == ' ' ||
>> c == '\t' ||
>> c == '\n' ||
>> c == '\r' ||
>> c == '\f' ||
>> c == '\v' ||
>> c == '\xa0' ||
>> iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I presume that the point of the disjuncts at the beginning is to
>> avoid the call to iswspace for the most common space characters.
>> The problem is that most characters (in most texts) are not space
>> characters, and for nonspace characters iswspace will always be
>> called.
>>
>> So I investigated a possible optimization that would also check for
>> the most common nonspace characters before calling iswspace:
>>
>> isSpace_Alt :: Char -> Bool
>> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
>> | c == ' ' = True
>> | '\t' <= c && c <= '\r' = True
>> | c == '\xa0' = True
>> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> In my benchmarks, this function significantly outperforms isSpace.
>> I also found that a version of isSpace that does not check
>> for nonspace characters, but uses case matching instead of
>> a disjunction of equality tests, outperformed isSpace (but
>> was usually not as fast as isSpace_Alt, and the difference
>> mostly disappears with -O2):
>>
>> isSpace_Pattern :: Char -> Bool
>> isSpace_Pattern c
>> | c == ' ' = True
>> | '\t' <= c && c <= '\r' = True
>> | c == '\xa0' = True
>> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I benchmarked all three functions against five types of text
>> (all ascii, all Greek, Haskell code, characters 0..255, and
>> all spaces), and here are the (normalized) results:
>>
>> Compiled with 'ghc --make':
>> --------------------------------------------------------------
>> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
>> --------------- ---------------- --------------- -----------
>> ascii text 1.0 0.54 0.17
>> greek text 1.0 0.57 0.71
>> haskell code 1.0 0.57 0.24
>> chars 0..255 1.0 0.54 0.39
>> all space chars 1.0 0.70 0.90
>> --------------------------------------------------------------
>>
>> Compiled with 'ghc --make -O2':
>> --------------------------------------------------------------
>> Input isSpace_DataChar isSpace_Pattern isSpace_Alt
>> --------------- ---------------- --------------- -----------
>> ascii text 1.0 0.93 0.40
>> greek text 1.0 0.98 0.99
>> haskell code 1.0 1.03 0.58
>> chars 0..255 1.0 0.92 0.62
>> all space chars 1.0 0.88 0.92
>> --------------------------------------------------------------
>>
>> My benchmark program can be found here:
>> https://gist.github.com/3967761
>>
>> I'd like to propose that we consider replacing the definition
>> of isSpace with isSpace_Alt.
>>
>> John
>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://www.haskell.org/mailman/listinfo/libraries
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries
--
Ivan Lazar Miljenovic
Ivan.Miljenovic at gmail.com
http://IvanMiljenovic.wordpress.com
More information about the Libraries
mailing list