[GHC] #5218: Add unpackCStringLen# to create Strings from string literals

Wed Jul 26 15:38:13 UTC 2017

#5218: Add unpackCStringLen# to create Strings from string literals
-------------------------------------+-------------------------------------
        Reporter:  tibbe             |                Owner:  thoughtpolice
            Type:  feature request   |               Status:  patch
        Priority:  normal            |            Milestone:
       Component:  Compiler          |              Version:  7.0.3
      Resolution:                    |             Keywords:
Operating System:  Unknown/Multiple  |         Architecture:
 Type of failure:  Runtime           |  Unknown/Multiple
  performance bug                    |            Test Case:
      Blocked By:                    |             Blocking:
 Related Tickets:  #5877 #10064      |  Differential Rev(s):  Phab:D2443
  #11312                             |
       Wiki Page:                    |
-------------------------------------+-------------------------------------

Comment (by bgamari):

 Here is where we stand on this:

 This bug seeks to address the fact that we currently have few good ways of
 encoding literal strings verbatim (e.g. as raw, unchanged bytes) in object
 code. This is because we insist on encoding primitive strings as null-
 terminated modified UTF-8. This means that things like `bytestring` and
 `text` have a rather complicated and inefficient handling of these
 literals. This inefficiency stems from two reasons,
  * One needs to look for and correctly handle the U+0000 codepoints
 (encoded as `0xc0 0x80`) in the primitive string
  * It's impossible to know what the length of the string is without
 walking it

 The solution here is to rework our desugaring of primitive strings such
 that,
 {{{#!hs
 "hello"#
 }}}
 Will be desugared as,
 {{{#!hs
 let x = "hello"# :: Addr#
 in (# 5#, "hello"# #)
 }}}

 This means that we can encode the string contents in plain UTF-8 without a
 NULL terminator. The type of `unpackCString#` then becomes,
 {{{#!hs
 unpackCString# :: (# Int#, Addr# #) -> String
 }}}
 and the implementation gets a tiny bit simpler (since it simply decodes a
 fixed number of bytes, instead of looking for a NULL). Consequently,
 libraries can then provide rules matching on `unpackCString#`
 applications, replacing them with what is essentially `memcpy`.

 This is for the most part a simple change, with the exception being GHCi
 support due to the need for unboxed tuples. jscholl started implementing
 this nearly a year ago but stalled. I recently rebased his work
 (Phab:D2443) and addressed several of the issues that came up in review.
 Unfortunately currently GHCi segmentation faults, which will take some to
 work out.

 Note that are two related problems that this does not address,
  * pure ASCII literals (which might be used to, for instance, encode a
 binary representation of a static `Array`)
  * `ByteArray#` literals, as requested in ticket:11312

-- 
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/5218#comment:73>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler