Improving Data.Char.isSpace performance

Thu Nov 1 03:39:46 CET 2012

On 10/29/12 6:30 PM, Ivan Lazar Miljenovic wrote:
> On 30 October 2012 04:52, John MacFarlane <jgm at berkeley.edu> wrote:
>> I have experimented with a couple of variants that seem
>> better than the definition I originally proposed.
>>
>> The most promising is
>>
>> isSpace_Alt6 :: Char -> Bool
>> {-# INLINE isSpace_Alt6 #-}
>> isSpace_Alt6 ' '               = True
>> isSpace_Alt6 '\n'              = True
>> isSpace_Alt6 '\t'              = True
>> isSpace_Alt6 '\r'              = True
>> isSpace_Alt6 '\x0B'            = True
>> isSpace_Alt6 '\x0C'            = True
>> isSpace_Alt6 '\xA0'            = True
>> isSpace_Alt6 c | c <  '\x1680' = False
>>                 | otherwise     = iswspace (fromIntegral (C.ord c)) /= 0
>
> Is there any particular reason you're using a guard rather than a
> pattern match for the \x1680 case?

The \x1680 case comes from the fact that (at present) no characters 
below there are spaces other than the ones listed. Note that it's using 
(<) not (==). The only other thing we could do here is to replace the 
otherwise branch with:

     isSpace_Alt6 c = iswspace (fromIntegral (C.ord c)) /= 0

but I doubt that'd help either performance or clarity.

The one thing I worry about using \x1680 as the threshold[1] is that I'm 
not sure whether every character below \x1680 has been allocated or 
whether some are still free. If any of them are free, then this will 
become incorrect in subsequent versions of Unicode so it's a maintenance 
timebomb. (Whereas if they're all specified then it should be fine.) Can 
someone verify that using \x1680 is sound in this manner?

[1] I discovered \x1680 by simply checking map isSpace ['\x0'..]. 
However, in my original proposal of this particular optimization I used 
\xFF as the boundary, since this is guaranteed by the fact that Unicode 
is interoperable with ISO-8859-1 (Latin-1)

-- 
Live well,
~wren