[Haskell-cafe] Optimising UTF8-CString -> String marshaling, plus comments on withCStringLen/peekCStringLen

Alistair Bayley alistair at abayley.org
Mon Jun 4 08:12:03 EDT 2007

On 04/06/07, Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:
> On Mon, 2007-06-04 at 09:43 +0100, Alistair Bayley wrote:
> > After some experiments with the simplifier, ...
> > The "portable" unboxed version is within about 15% of the unboxed version
> > in terms of time and allocation.
> Well done.

Of course, that might be saying more about the performance of the
unboxed version...

> Yeah. In Data.ByteString.Char8 we invent this w2c & c2w functions to
> avoid the test. There should probably be a standard version of this
> unchecked conversion.

Bulat suggested unsafeChr from GHC.Exts, but I can't see this. I guess
I could roll my own; after all it's just (C# (chr# x)).

> > BTW, what's the difference between the indexXxxxOffAddr# and
> > readXxxxOffAddr# functions in GHC.Prim?
> Right. So it'd only be safe to use the index ones on immutable arrays
> because there's no way to enforce sequencing with respect to array
> writes when using the index version.

In this case I'm reading from a CString buffer, which is (hopefully)
not changing during the function invocation, and never written to by
my code. So presumably it'd be pretty safe to use the index-

> >  - Ptrs don't get unboxed. Why is this? Some IO monad thing?
> Got any more detail?

OK. readUTF8Char's transformation starts with this:

$wreadUTF8Char_r3de =
  \ (ww_s33v :: GHC.Prim.Int#) (w_s33x :: GHC.Ptr.Ptr GHC.Word.Word8) ->

If we expect it to unbox, I'd expect the Ptr to become Addr#. Later,
this (w_s33x) gets unboxed just before it's used:

      case w_s33x of wild6_a2JM { GHC.Ptr.Ptr a_a2JO ->
      case GHC.Prim.readWord8OffAddr# @ GHC.Prim.RealWorld a_a2JO 1 s_a2Jf

readUTF8Char is called by fromUTF8Ptr, where there's a little Ptr
arithmetic. The Ptr argument to fromUTF8Ptr is unboxed, offset is
added, and the result is reboxed so that it can be consumed by
readUTF8Char. All a bit unnecessary, I think e.g.

Foreign.C.UTF8.$wfromUTF8Ptr =
    let {
      p'_s38N [Just D(T)] :: GHC.Ptr.Ptr GHC.Word.Word8
      [Str: DmdType]
      p'_s38N =
	__scc {fromUTF8Ptr main:Foreign.C.UTF8 !}
	case w_s33J of wild11_a2DW { GHC.Ptr.Ptr addr_a2DY ->
	GHC.Ptr.Ptr @ GHC.Word.Word8 (GHC.Prim.plusAddr# addr_a2DY ww_s33H)
    } in

I'd prefer the Ptr arg to fromUTF8Ptr to also be unboxed, so that the
primitive plusAddr# can be used directly on it before it's passed to
readUTF8Char. Perhaps instead I could push this Ptr arithmetic down to
readUTF8Char, and pass it the constant Ptr to the start of the buffer,
and the offset into it, rather than a Ptr to the current position.


More information about the Haskell-Cafe mailing list