Improving Data.Char.isSpace performance

Simon Peyton-Jones simonpj at microsoft.com
Mon Oct 29 23:29:15 CET 2012


Sounds good to me.  Thanks for doing this.

When you think you are ready, just submit a patch.  (As others have noted, maybe isSpace isn't the only function that could benefit from this kind of attention.)

Simon

|  -----Original Message-----
|  From: libraries-bounces at haskell.org [mailto:libraries-bounces at haskell.org] On
|  Behalf Of John MacFarlane
|  Sent: 28 October 2012 19:16
|  To: libraries at haskell.org
|  Subject: Improving Data.Char.isSpace performance
|  
|  I think that 'isSpace' from Data.Char (and hence also 'words' from the
|  Prelude) is not as fast as it could be.  Here is the definition
|  (which is actually found in GHC.Unicode):
|  
|  isSpace :: Char -> Bool
|  isSpace c =
|      c == ' '     ||
|      c == '\t'    ||
|      c == '\n'    ||
|      c == '\r'    ||
|      c == '\f'    ||
|      c == '\v'    ||
|      c == '\xa0'  ||
|      iswspace (fromIntegral (C.ord c)) /= 0
|  
|  I presume that the point of the disjuncts at the beginning is to
|  avoid the call to iswspace for the most common space characters.
|  The problem is that most characters (in most texts) are not space
|  characters, and for nonspace characters iswspace will always be
|  called.
|  
|  So I investigated a possible optimization that would also check for
|  the most common nonspace characters before calling iswspace:
|  
|  isSpace_Alt :: Char -> Bool
|  isSpace_Alt c | c > '\x20' && c < '\xa0' = False
|                | c == ' '                 = True
|                | '\t' <= c && c <= '\r'   = True
|                | c == '\xa0'              = True
|                | otherwise                = iswspace (fromIntegral (C.ord c)) /= 0
|  
|  In my benchmarks, this function significantly outperforms isSpace.
|  I also found that a version of isSpace that does not check
|  for nonspace characters, but uses case matching instead of
|  a disjunction of equality tests, outperformed isSpace (but
|  was usually not as fast as isSpace_Alt, and the difference
|  mostly disappears with -O2):
|  
|  isSpace_Pattern :: Char -> Bool
|  isSpace_Pattern c
|     | c == ' '                  = True
|     | '\t' <= c && c <= '\r'    = True
|     | c == '\xa0'               = True
|     | otherwise                 = iswspace (fromIntegral (C.ord c)) /= 0
|  
|  I benchmarked all three functions against five types of text
|  (all ascii, all Greek, Haskell code, characters 0..255, and
|  all spaces), and here are the (normalized) results:
|  
|  Compiled with 'ghc --make':
|  --------------------------------------------------------------
|  Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
|  --------------- ----------------  ---------------  -----------
|  ascii text      1.0               0.54             0.17
|  greek text      1.0               0.57             0.71
|  haskell code    1.0               0.57             0.24
|  chars 0..255    1.0               0.54             0.39
|  all space chars 1.0               0.70             0.90
|  --------------------------------------------------------------
|  
|  Compiled with 'ghc --make -O2':
|  --------------------------------------------------------------
|  Input           isSpace_DataChar  isSpace_Pattern  isSpace_Alt
|  --------------- ----------------  ---------------  -----------
|  ascii text      1.0               0.93             0.40
|  greek text      1.0               0.98             0.99
|  haskell code    1.0               1.03             0.58
|  chars 0..255    1.0               0.92             0.62
|  all space chars 1.0               0.88             0.92
|  --------------------------------------------------------------
|  
|  My benchmark program can be found here:
|  https://gist.github.com/3967761
|  
|  I'd like to propose that we consider replacing the definition
|  of isSpace with isSpace_Alt.
|  
|  John
|  
|  
|  _______________________________________________
|  Libraries mailing list
|  Libraries at haskell.org
|  http://www.haskell.org/mailman/listinfo/libraries



More information about the Libraries mailing list