[Haskell-cafe] Optimising UTF8-CString -> String marshaling,
plus comments on withCStringLen/peekCStringLen
Alistair Bayley
alistair at abayley.org
Mon Jun 4 08:12:03 EDT 2007
On 04/06/07, Duncan Coutts <duncan.coutts at worc.ox.ac.uk> wrote:
> On Mon, 2007-06-04 at 09:43 +0100, Alistair Bayley wrote:
>
> > After some experiments with the simplifier, ...
> > The "portable" unboxed version is within about 15% of the unboxed version
> > in terms of time and allocation.
>
> Well done.
Of course, that might be saying more about the performance of the
unboxed version...
> Yeah. In Data.ByteString.Char8 we invent this w2c & c2w functions to
> avoid the test. There should probably be a standard version of this
> unchecked conversion.
Bulat suggested unsafeChr from GHC.Exts, but I can't see this. I guess
I could roll my own; after all it's just (C# (chr# x)).
> > BTW, what's the difference between the indexXxxxOffAddr# and
> > readXxxxOffAddr# functions in GHC.Prim?
>
> Right. So it'd only be safe to use the index ones on immutable arrays
> because there's no way to enforce sequencing with respect to array
> writes when using the index version.
In this case I'm reading from a CString buffer, which is (hopefully)
not changing during the function invocation, and never written to by
my code. So presumably it'd be pretty safe to use the index-
functions.
> > - Ptrs don't get unboxed. Why is this? Some IO monad thing?
>
> Got any more detail?
OK. readUTF8Char's transformation starts with this:
$wreadUTF8Char_r3de =
\ (ww_s33v :: GHC.Prim.Int#) (w_s33x :: GHC.Ptr.Ptr GHC.Word.Word8) ->
If we expect it to unbox, I'd expect the Ptr to become Addr#. Later,
this (w_s33x) gets unboxed just before it's used:
case w_s33x of wild6_a2JM { GHC.Ptr.Ptr a_a2JO ->
case GHC.Prim.readWord8OffAddr# @ GHC.Prim.RealWorld a_a2JO 1 s_a2Jf
readUTF8Char is called by fromUTF8Ptr, where there's a little Ptr
arithmetic. The Ptr argument to fromUTF8Ptr is unboxed, offset is
added, and the result is reboxed so that it can be consumed by
readUTF8Char. All a bit unnecessary, I think e.g.
Foreign.C.UTF8.$wfromUTF8Ptr =
...
let {
p'_s38N [Just D(T)] :: GHC.Ptr.Ptr GHC.Word.Word8
[Str: DmdType]
p'_s38N =
__scc {fromUTF8Ptr main:Foreign.C.UTF8 !}
case w_s33J of wild11_a2DW { GHC.Ptr.Ptr addr_a2DY ->
GHC.Ptr.Ptr @ GHC.Word.Word8 (GHC.Prim.plusAddr# addr_a2DY ww_s33H)
}
} in
...
I'd prefer the Ptr arg to fromUTF8Ptr to also be unboxed, so that the
primitive plusAddr# can be used directly on it before it's passed to
readUTF8Char. Perhaps instead I could push this Ptr arithmetic down to
readUTF8Char, and pass it the constant Ptr to the start of the buffer,
and the offset into it, rather than a Ptr to the current position.
Alistair
More information about the Haskell-Cafe
mailing list