[GHC] #5218: Add unpackCStringLen# to create Strings from string literals

Sun Jul 31 10:40:42 UTC 2016

#5218: Add unpackCStringLen# to create Strings from string literals
-------------------------------------+-------------------------------------
        Reporter:  tibbe             |                Owner:  thoughtpolice
            Type:  feature request   |               Status:  new
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.0.3
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #5877 #10064      |  Differential Rev(s):
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by jscholl):

 I tried implementing a {{{String#}}} type which would carry the length as
 an {{{Int#}}} at its beginning and two functions to extract the length and
 address of the string literal. However, it quickly got a little bit out of
 hand:

 - {{{unpackCString#}}} etc. had to be adopted, breaking backwards
 compatibility. To avoid this, I tried to create wrapper functions
 {{{unpackCStringLit#}}}, which would extract the address and call the
 original {{{unpackCString#}}} function.
 - I could not solve the question how to adopt the rewrite rules dealing
 with strings without duplicating them for the {{{Addr#}}} and
 {{{String#}}} versions. I could also not figure out when
 {{{unpackCStringLit#}}} should inline to avoid the overhead of the new
 address computation.
 - It took a while to find all (most?) of the places (library, some
 hardcoded types in {{{Id}}}s, a place in the type checker, generating
 record selector errors) where types where wired in, especially for
 exceptions like {{{absentError}}} and {{{recSelError}}}.
 - Implementing a new {{{String#}}} also asked the question whether
 {{{"foo"##}}} should be the corresponding literal for it. However, adding
 it from the parser to the backend seemed quite complex, so I tried a
 different approach.

 Instead of creating a new type {{{String#}}}, I rewrote
 {{{unpackCStringLit#}}} to have the type {{{Addr# -> Int# -> [Char]}}}. It
 would then just throw its second argument away and inline in some phase.
 However, it still meant duplicating rewrite rules, which seemed not like
 an idea solution.

 My next idea was to push the length information into an ignored argument
 to a function giving us the address: {{{cStringLitAddr# :: Addr# -> Int#
 -> Addr#}}}. This could just be passed as an argument to
 {{{unpackCString#}}}, thus I was quite confident that it would remain
 backwards compatible and no extra rewrite rules were needed to maintain
 the current behavior (but extra rules to use the length information, e.g.
 to construct bytestrings, but this seems like an acceptable cost).

 However, I did not anticipate the let/app invariant, thus my original
 design of {{{unpackCString# (cStringLitAddr# "foo"# 3#)}}} caused lint to
 warn me. After reading up about the invariant, I decided that
 {{{cStringLitAddr#}}}, applied to two literals, should be okay for
 speculation, as it did not have side effects nor could fail or anything.
 However, while now the generated core was accepted, it was useless, as it
 would not match the rewrite rules written by a user. Their rules would be
 translated to something like {{{case cStringLitAddr# addr len of { tmp ->
 unpackCString# tmp } }}}.

 Thus, I decided to generate matching core and removed my fix to make
 {{{cStringLitAddr#}}} okay for speculation. In the current version, it is
 possible to create a bytestring in O(1) with rewrite rules. However, I
 have broken the general list fusion (or at least the built-in rules
 {{{match_eq_string}}} and {{{match_append_lit}}}), as the case statement
 gets in the way between {{{foldr}}} and {{{build}}}, causing them to not
 be optimized out (but maybe this is generally a missed opportunity, if I
 have {{{foo (case something of { tmp -> bar tmp }) }}}, maybe it should be
 possible to rewrite {{{foo (bar x) = baz x}}} anyway, leading to {{{case
 something of { tmp -> baz tmp } }}}, iff {{{something}}} is safe to
 evaluate with regards to time, space and exceptions (this is okay-for-
 speculation, right?)).

 So right now I am stuck. Maybe it is okay to break backwards compatibility
 and just change the types of {{{unpackCString#}}} etc. to include an
 additional (ignored) {{{Int#}}} argument, pushing some #ifs to everyone
 using {{{unpackCString#}}} (I think this is basically text, bytestring and
 ghc itself) for the next few years. However, {{{unpackCString#}}} is
 called at some additional places, namely when constructing modules for
 {{{Typeable}}}. Right now the types only carry the {{{Addr#}}} to call it,
 but would then also need the length information (or there would be the
 risk that something rewrites it and gets a bogus length, if one just
 passes {{{0#}}} as length information). On the other hand, maybe it would
 be a good thing to actually pass the length along to {{{unpackCString#}}},
 making it mandatory, as this would avoid the need to null-terminate the
 strings, allowing {{{'\NUL'}}} characters to be encoded with one byte
 instead of two (which may be of interest for bytestring). On the other
 hand, I could imagine this breaking stuff if strings are no longer null-
 terminated in subtle ways...

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/5218#comment:41>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler