Improving Data.Char.isSpace performance
wren ng thornton
wren at freegeek.org
Thu Nov 1 03:39:46 CET 2012
On 10/29/12 6:30 PM, Ivan Lazar Miljenovic wrote:
> On 30 October 2012 04:52, John MacFarlane <jgm at berkeley.edu> wrote:
>> I have experimented with a couple of variants that seem
>> better than the definition I originally proposed.
>>
>> The most promising is
>>
>> isSpace_Alt6 :: Char -> Bool
>> {-# INLINE isSpace_Alt6 #-}
>> isSpace_Alt6 ' ' = True
>> isSpace_Alt6 '\n' = True
>> isSpace_Alt6 '\t' = True
>> isSpace_Alt6 '\r' = True
>> isSpace_Alt6 '\x0B' = True
>> isSpace_Alt6 '\x0C' = True
>> isSpace_Alt6 '\xA0' = True
>> isSpace_Alt6 c | c < '\x1680' = False
>> | otherwise = iswspace (fromIntegral (C.ord c)) /= 0
>
> Is there any particular reason you're using a guard rather than a
> pattern match for the \x1680 case?
The \x1680 case comes from the fact that (at present) no characters
below there are spaces other than the ones listed. Note that it's using
(<) not (==). The only other thing we could do here is to replace the
otherwise branch with:
isSpace_Alt6 c = iswspace (fromIntegral (C.ord c)) /= 0
but I doubt that'd help either performance or clarity.
The one thing I worry about using \x1680 as the threshold[1] is that I'm
not sure whether every character below \x1680 has been allocated or
whether some are still free. If any of them are free, then this will
become incorrect in subsequent versions of Unicode so it's a maintenance
timebomb. (Whereas if they're all specified then it should be fine.) Can
someone verify that using \x1680 is sound in this manner?
[1] I discovered \x1680 by simply checking map isSpace ['\x0'..].
However, in my original proposal of this particular optimization I used
\xFF as the boundary, since this is guaranteed by the fact that Unicode
is interoperable with ISO-8859-1 (Latin-1)
--
Live well,
~wren
More information about the Libraries
mailing list