[GHC] #5218: Add unpackCStringLen# to create Strings from string literals
GHC
ghc-devs at haskell.org
Wed Jul 26 15:38:13 UTC 2017
#5218: Add unpackCStringLen# to create Strings from string literals
-------------------------------------+-------------------------------------
Reporter: tibbe | Owner: thoughtpolice
Type: feature request | Status: patch
Priority: normal | Milestone:
Component: Compiler | Version: 7.0.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #5877 #10064 | Differential Rev(s): Phab:D2443
#11312 |
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by bgamari):
Here is where we stand on this:
This bug seeks to address the fact that we currently have few good ways of
encoding literal strings verbatim (e.g. as raw, unchanged bytes) in object
code. This is because we insist on encoding primitive strings as null-
terminated modified UTF-8. This means that things like `bytestring` and
`text` have a rather complicated and inefficient handling of these
literals. This inefficiency stems from two reasons,
* One needs to look for and correctly handle the U+0000 codepoints
(encoded as `0xc0 0x80`) in the primitive string
* It's impossible to know what the length of the string is without
walking it
The solution here is to rework our desugaring of primitive strings such
that,
{{{#!hs
"hello"#
}}}
Will be desugared as,
{{{#!hs
let x = "hello"# :: Addr#
in (# 5#, "hello"# #)
}}}
This means that we can encode the string contents in plain UTF-8 without a
NULL terminator. The type of `unpackCString#` then becomes,
{{{#!hs
unpackCString# :: (# Int#, Addr# #) -> String
}}}
and the implementation gets a tiny bit simpler (since it simply decodes a
fixed number of bytes, instead of looking for a NULL). Consequently,
libraries can then provide rules matching on `unpackCString#`
applications, replacing them with what is essentially `memcpy`.
This is for the most part a simple change, with the exception being GHCi
support due to the need for unboxed tuples. jscholl started implementing
this nearly a year ago but stalled. I recently rebased his work
(Phab:D2443) and addressed several of the issues that came up in review.
Unfortunately currently GHCi segmentation faults, which will take some to
work out.
Note that are two related problems that this does not address,
* pure ASCII literals (which might be used to, for instance, encode a
binary representation of a static `Array`)
* `ByteArray#` literals, as requested in ticket:11312
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/5218#comment:73>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list