[Haskell-cafe] Optimising UTF8-CString -> String marshaling,
plus comments on withCStringLen/peekCStringLen
Alistair Bayley
alistair at abayley.org
Tue Jun 5 04:36:51 EDT 2007
> > {- Arity: 4 Strictness: LSSL -}
>
> Right. Unboxed args are always given the annotation L. So that function
> is strict in that pointer arg, but GHC is choosing not to unbox it. I'm
> not sure why that's the case.
I thought maybe it was because I hadn't said -funbox-strict-fields,
but it didn't change when I did.
> > Is there some semantic advantage to bang-patterns, or is it just a
> > syntactic convenience?
>
> It's syntactic convenience.
I've noticed differences between strictness guards and bang-patterns
with GHC. This is a problem, I think, because bang-patterns are GHC
only, and I wanted to keep the code portable. This strictness guard:
readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char x offset p
| () !x !offset !p !False = undefined
| otherwise =
...
where
x ! y = seq x y
results in this simplifier output:
[Arity 3
Str: DmdType LSS]
$wreadUTF8Char_r38I =
\ (ww_s30B :: GHC.Prim.Int#)
(w_s30D :: GHC.Base.Int)
(w1_s30E :: GHC.Ptr.Ptr GHC.Word.Word8) ->
However, if I change this to:
{-# OPTIONS_GHC -fbang-patterns #-}
...
readUTF8Char :: Int -> Int -> Ptr Word8 -> IO Char
readUTF8Char !x !offset !p
| otherwise =
then I get this simpifier output:
[Arity 3
Str: DmdType LLL]
$wreadUTF8Char_r38n =
\ (ww_s2Zk :: GHC.Prim.Int#)
(ww1_s2Zo :: GHC.Prim.Int#)
(ww2_s2Zs :: GHC.Prim.Addr#) ->
Also, with bang-patterns I've noticed that fromUTF8Ptr transforms
into two functions, which contain very similar code:
[Arity 5]
Foreign.C.UTF8.$s$wfromUTF8Ptr =
\ (acc_X151 :: GHC.Base.String)
(new_s_a2GQ :: GHC.Prim.State# GHC.Prim.RealWorld)
(a87_a2GR :: GHC.Base.Char)
(ww_s2ZH :: GHC.Prim.Addr#)
(sc_s36j :: GHC.Prim.Int#) ->
[Arity 4
Str: DmdType LLSL]
Foreign.C.UTF8.$wfromUTF8Ptr =
\ (ww_s2ZD :: GHC.Prim.Int#)
(ww1_s2ZH :: GHC.Prim.Addr#)
(w_s2ZJ :: GHC.Base.String)
(w1_s2ZK :: GHC.Prim.State# GHC.Prim.RealWorld) ->
$wfromUTF8Ptr calls itself and $s$wfromUTF8Ptr, but $s$wfromUTF8Ptr
only every calls itself, so $wfromUTF8Ptr could be considered the
wrapper, and $s$wfromUTF8Ptr the worker, I guess.
AFAICT it's a transformation of the various cases in fromUTF8Ptr:
| x <= 0x7F -> fromUTF8Ptr (bytes-1) p (chr x:acc)
| x <= 0xBF && bytes == 0 -> error "fromUTF8Ptr: ..."
| x <= 0xBF -> fromUTF8Ptr (bytes-1) p acc
| otherwise -> do
c <- readUTF8Char x bytes p
fromUTF8Ptr (bytes-1) p (c:acc)
The first case, x <= 0x7F, results in a call to $s$wfromUTF8Ptr.
The third case, x <= 0xBF, results in a call to $wfromUTF8Ptr.
The last case, otherwise, results in a call to $s$wfromUTF8Ptr.
The calls to $s$wfromUTF8Ptr pass the newly constructed Char and the
rest of the String separately, and they are cons'ed in $s$wfromUTF8Ptr.
Not sure what benefit this gives...
I don't know what transformation causes this, but it was a bit of a surprise.
Alistair
More information about the Haskell-Cafe
mailing list