Improving Data.Char.isSpace performance

Thu Nov 1 04:10:59 CET 2012

+++ wren ng thornton [Oct 31 12 22:39 ]:
> The one thing I worry about using \x1680 as the threshold[1] is that
> I'm not sure whether every character below \x1680 has been allocated
> or whether some are still free. If any of them are free, then this
> will become incorrect in subsequent versions of Unicode so it's a
> maintenance timebomb. (Whereas if they're all specified then it
> should be fine.) Can someone verify that using \x1680 is sound in
> this manner?

Really good point.  This page [1], if I read it correctly,
seems to indicate that c < \x860 would be safe.
But I'm not a unicode expert and maybe there's a better
place to look.

[1] http://www.unicode.org/alloc/CurrentAllocation.html

One thing I discovered is that if I use c < \xFF, the Greek
benchmark goes out the window -- we are twice as slow as
the original GHC.Unicode.isChar.  Whatever threshold we
choose, it seems, the performance gains below the threshold
will be balanced by performance losses above the threshold.
This makes me disinclined to submit the patch.  It just
seems wrong to change the library in a way that
helps people who use western alphabets at the expense of
people who don't.