Improving Data.Char.isSpace performance

Ivan Lazar Miljenovic ivan.miljenovic at gmail.com
Mon Oct 29 23:30:43 CET 2012


On 30 October 2012 04:52, John MacFarlane <jgm at berkeley.edu> wrote:
> I have experimented with a couple of variants that seem
> better than the definition I originally proposed.
>
> The most promising is
>
> isSpace_Alt6 :: Char -> Bool
> {-# INLINE isSpace_Alt6 #-}
> isSpace_Alt6 ' '               = True
> isSpace_Alt6 '\n'              = True
> isSpace_Alt6 '\t'              = True
> isSpace_Alt6 '\r'              = True
> isSpace_Alt6 '\x0B'            = True
> isSpace_Alt6 '\x0C'            = True
> isSpace_Alt6 '\xA0'            = True
> isSpace_Alt6 c | c <  '\x1680' = False
>                | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0

Is there any particular reason you're using a guard rather than a
pattern match for the \x1680 case?

>
> Benchmarks can be found here:
>
> the program          : http://johnmacfarlane.net/isSpace/BenchIsSpace.hs
> results:
>   with ghc --make    : http://johnmacfarlane.net/isSpace/unoptimized.html
>   with ghc --make -O2: http://johnmacfarlane.net/isSpace/optimized.html
>
> John
>
> +++ John MacFarlane [Oct 28 12 12:16 ]:
>> I think that 'isSpace' from Data.Char (and hence also 'words' from the
>> Prelude) is not as fast as it could be.  Here is the definition
>> (which is actually found in GHC.Unicode):
>>
>> isSpace :: Char -> Bool
>> isSpace c =
>>     c == ' '     ||
>>     c == '\t'    ||
>>     c == '\n'    ||
>>     c == '\r'    ||
>>     c == '\f'    ||
>>     c == '\v'    ||
>>     c == '\xa0'  ||
>>     iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I presume that the point of the disjuncts at the beginning is to
>> avoid the call to iswspace for the most common space characters.
>> The problem is that most characters (in most texts) are not space
>> characters, and for nonspace characters iswspace will always be
>> called.
>>
>> So I investigated a possible optimization that would also check for
>> the most common nonspace characters before calling iswspace:
>>
>> isSpace_Alt :: Char -> Bool
>> isSpace_Alt c | c > '\x20' && c < '\xa0' = False
>>               | c == ' '                 = True
>>               | '\t' <= c && c <= '\r'   = True
>>               | c == '\xa0'              = True
>>               | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> In my benchmarks, this function significantly outperforms isSpace.
>> I also found that a version of isSpace that does not check
>> for nonspace characters, but uses case matching instead of
>> a disjunction of equality tests, outperformed isSpace (but
>> was usually not as fast as isSpace_Alt, and the difference
>> mostly disappears with -O2):
>>
>> isSpace_Pattern :: Char -> Bool
>> isSpace_Pattern c
>>    | c == ' '                  = True
>>    | '\t' <= c && c <= '\r'    = True
>>    | c == '\xa0'               = True
>>    | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
>>
>> I benchmarked all three functions against five types of text
>> (all ascii, all Greek, Haskell code, characters 0..255, and
>> all spaces), and here are the (normalized) results:
>>
>> Compiled with 'ghc --make':
>> --------------------------------------------------------------
>> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
>> --------------- ----------------  ---------------  -----------
>> ascii text      1.0               0.54             0.17
>> greek text      1.0               0.57             0.71
>> haskell code    1.0               0.57             0.24
>> chars 0..255    1.0               0.54             0.39
>> all space chars 1.0               0.70             0.90
>> --------------------------------------------------------------
>>
>> Compiled with 'ghc --make -O2':
>> --------------------------------------------------------------
>> Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
>> --------------- ----------------  ---------------  -----------
>> ascii text      1.0               0.93             0.40
>> greek text      1.0               0.98             0.99
>> haskell code    1.0               1.03             0.58
>> chars 0..255    1.0               0.92             0.62
>> all space chars 1.0               0.88             0.92
>> --------------------------------------------------------------
>>
>> My benchmark program can be found here:
>> https://gist.github.com/3967761
>>
>> I'd like to propose that we consider replacing the definition
>> of isSpace with isSpace_Alt.
>>
>> John
>>
>>
>> _______________________________________________
>> Libraries mailing list
>> Libraries at haskell.org
>> http://www.haskell.org/mailman/listinfo/libraries
>
> _______________________________________________
> Libraries mailing list
> Libraries at haskell.org
> http://www.haskell.org/mailman/listinfo/libraries



-- 
Ivan Lazar Miljenovic
Ivan.Miljenovic at gmail.com
http://IvanMiljenovic.wordpress.com



More information about the Libraries mailing list