[Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Mon Jun 4 04:39:01 EDT 2007

Hello cafe,

(Following up on my own optimisation question, and Duncan's advice
to look at http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs)

> If you want to look at some existing optimised UTF8 encoding/decoding
> code then take a look at the code used in GHC:
>
> http://darcs.haskell.org/ghc/compiler/utils/Encoding.hs
>
> Duncan

I took a look at the UTF8 decoder in GHC. This inspired me to write
one that also used unboxed types directly. Pleasingly, it goes like
a cut cat, and uses far less space than the naive version, but it's
not portable, which is a bummer.

(The docs tell me that using GHC.Exts is the "approved" way of
accessing GHC-specific extensions, but all of the useful stuff seems
to be in GHC.Prim.)

After some expriments with the simplifier, I think I have a portable
version of a direct-from-buffer decoder which seems to perform nearly
as well as one written directly against GHC primitive unboxed functions.
I'm wondering if there's anything further I can do to improve performance.
The "portable" unboxed version is within about 15% of the unboxed version
in terms of time and allocation.

Changes I made:
 - added strictness "annotations" in the form of a strictness guards
   that are always False
 - unrolled loops (they were always short loops anyway, with a maximum
   of 3 or 4 iterations)
 - replaced shiftL with multiplication, because multiplication unboxes,
   while shiftL doesn't.

Some things I've noticed in the simplifier output:
 - the shiftL call hasn't unboxed or inlined into a call to
   uncheckedShiftL#, which I would prefer.
   Would this be possible if we added unchecked versions of
   the shiftL/R functions to Data.Bits?
 - Ptrs don't get unboxed. Why is this? Some IO monad thing?
 - the chr function tests that its Int argument is less than 1114111,
   before constructing the Char. It'd be nice to avoid this test.
 - why does this code:

      | x <= 0xF7 = remaining 3 (bAND x 0x07) xs
      | otherwise = err x

   turn into this
   i.e. the <= turns into two identical case-branches, using eqword#
   and ltword#, rather than one case-branch using leword# ?

  case GHC.Prim.eqWord# a11_a2PJ __word 247 of wild25_X2SU {
    GHC.Base.False ->
      case GHC.Prim.ltWord# a11_a2PJ __word 247 of wild6_Xcw {
        GHC.Base.False -> <error call>
        GHC.Base.True ->
          $wremaining_r3dD
            3
            (__scc {fromUTF8 main:Foreign.C.UTF8 !}
             GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7)))
            xs_aVm
      };
    GHC.Base.True ->
      $wremaining_r3dD
        3
        (__scc {fromUTF8 main:Foreign.C.UTF8 !}
         GHC.Base.I# (GHC.Prim.word2Int# (GHC.Prim.and# a11_a2PJ __word 7)))
        xs_aVm
  };

BTW, what's the difference between the indexXxxxOffAddr# and
readXxxxOffAddr# functions in GHC.Prim? AFAICT they are equivalent,
except that the read* functions take an extra State# s parameter.
Presumably this is to thread the IO monad's RealWorld value through,
to create some sort of data dependency between the functions (and so
to ensure ordered evaluation?)

Alistair
-------------- next part --------------
A non-text attachment was scrubbed...
Name: UTF8.hs
Type: text/x-haskell
Size: 14845 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/haskell-cafe/attachments/20070604/2beadc3e/UTF8-0001.bin