[GHC] #5218: Add unpackCStringLen# to create Strings from string literals
GHC
ghc-devs at haskell.org
Sun Jul 31 10:40:42 UTC 2016
#5218: Add unpackCStringLen# to create Strings from string literals
-------------------------------------+-------------------------------------
Reporter: tibbe | Owner: thoughtpolice
Type: feature request | Status: new
Priority: normal | Milestone:
Component: Compiler | Version: 7.0.3
Resolution: | Keywords:
Operating System: Unknown/Multiple | Architecture:
Type of failure: Runtime | Unknown/Multiple
performance bug | Test Case:
Blocked By: | Blocking:
Related Tickets: #5877 #10064 | Differential Rev(s):
Wiki Page: |
-------------------------------------+-------------------------------------
Comment (by jscholl):
I tried implementing a {{{String#}}} type which would carry the length as
an {{{Int#}}} at its beginning and two functions to extract the length and
address of the string literal. However, it quickly got a little bit out of
hand:
- {{{unpackCString#}}} etc. had to be adopted, breaking backwards
compatibility. To avoid this, I tried to create wrapper functions
{{{unpackCStringLit#}}}, which would extract the address and call the
original {{{unpackCString#}}} function.
- I could not solve the question how to adopt the rewrite rules dealing
with strings without duplicating them for the {{{Addr#}}} and
{{{String#}}} versions. I could also not figure out when
{{{unpackCStringLit#}}} should inline to avoid the overhead of the new
address computation.
- It took a while to find all (most?) of the places (library, some
hardcoded types in {{{Id}}}s, a place in the type checker, generating
record selector errors) where types where wired in, especially for
exceptions like {{{absentError}}} and {{{recSelError}}}.
- Implementing a new {{{String#}}} also asked the question whether
{{{"foo"##}}} should be the corresponding literal for it. However, adding
it from the parser to the backend seemed quite complex, so I tried a
different approach.
Instead of creating a new type {{{String#}}}, I rewrote
{{{unpackCStringLit#}}} to have the type {{{Addr# -> Int# -> [Char]}}}. It
would then just throw its second argument away and inline in some phase.
However, it still meant duplicating rewrite rules, which seemed not like
an idea solution.
My next idea was to push the length information into an ignored argument
to a function giving us the address: {{{cStringLitAddr# :: Addr# -> Int#
-> Addr#}}}. This could just be passed as an argument to
{{{unpackCString#}}}, thus I was quite confident that it would remain
backwards compatible and no extra rewrite rules were needed to maintain
the current behavior (but extra rules to use the length information, e.g.
to construct bytestrings, but this seems like an acceptable cost).
However, I did not anticipate the let/app invariant, thus my original
design of {{{unpackCString# (cStringLitAddr# "foo"# 3#)}}} caused lint to
warn me. After reading up about the invariant, I decided that
{{{cStringLitAddr#}}}, applied to two literals, should be okay for
speculation, as it did not have side effects nor could fail or anything.
However, while now the generated core was accepted, it was useless, as it
would not match the rewrite rules written by a user. Their rules would be
translated to something like {{{case cStringLitAddr# addr len of { tmp ->
unpackCString# tmp } }}}.
Thus, I decided to generate matching core and removed my fix to make
{{{cStringLitAddr#}}} okay for speculation. In the current version, it is
possible to create a bytestring in O(1) with rewrite rules. However, I
have broken the general list fusion (or at least the built-in rules
{{{match_eq_string}}} and {{{match_append_lit}}}), as the case statement
gets in the way between {{{foldr}}} and {{{build}}}, causing them to not
be optimized out (but maybe this is generally a missed opportunity, if I
have {{{foo (case something of { tmp -> bar tmp }) }}}, maybe it should be
possible to rewrite {{{foo (bar x) = baz x}}} anyway, leading to {{{case
something of { tmp -> baz tmp } }}}, iff {{{something}}} is safe to
evaluate with regards to time, space and exceptions (this is okay-for-
speculation, right?)).
So right now I am stuck. Maybe it is okay to break backwards compatibility
and just change the types of {{{unpackCString#}}} etc. to include an
additional (ignored) {{{Int#}}} argument, pushing some #ifs to everyone
using {{{unpackCString#}}} (I think this is basically text, bytestring and
ghc itself) for the next few years. However, {{{unpackCString#}}} is
called at some additional places, namely when constructing modules for
{{{Typeable}}}. Right now the types only carry the {{{Addr#}}} to call it,
but would then also need the length information (or there would be the
risk that something rewrites it and gets a bogus length, if one just
passes {{{0#}}} as length information). On the other hand, maybe it would
be a good thing to actually pass the length along to {{{unpackCString#}}},
making it mandatory, as this would avoid the need to null-terminate the
strings, allowing {{{'\NUL'}}} characters to be encoded with one byte
instead of two (which may be of interest for bytestring). On the other
hand, I could imagine this breaking stuff if strings are no longer null-
terminated in subtle ways...
--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/5218#comment:41>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler
More information about the ghc-tickets
mailing list