[GHC] #8730: Invalid Unicode Codepoints in Char

GHC ghc-devs at haskell.org
Sun Nov 16 21:38:45 UTC 2014


#8730: Invalid Unicode Codepoints in Char
-------------------------------------+-------------------------------------
              Reporter:  mdmenzel    |            Owner:  ekmett
                  Type:  bug         |           Status:  new
              Priority:  low         |        Milestone:
             Component:  Core        |          Version:  7.6.3
  Libraries                          |         Keywords:  unicode
            Resolution:              |     Architecture:  Unknown/Multiple
      Operating System:              |       Difficulty:  Unknown
  Unknown/Multiple                   |       Blocked By:
       Type of failure:              |  Related Tickets:
  None/Unknown                       |
             Test Case:              |
              Blocking:              |
Differential Revisions:              |
-------------------------------------+-------------------------------------
Changes (by thomie):

 * cc: batterseapower, core-libraries-committee@… (added)
 * owner:   => ekmett
 * component:  Compiler => Core Libraries


Comment:

 Thank you for the report. I am just adding some references.

 {{{
 Prelude Data.Char> all ((==) Surrogate . generalCategory) ['\xdc80' ..
 '\xdfff']
 True
 }}}

 * http://www.unicode.org/versions/Unicode7.0.0/ch23.pdf
 * http://tools.ietf.org/html/rfc3629
 * http://en.wikipedia.org/wiki/UTF-8#Invalid_code_points:
 >According to the UTF-8 definition (RFC 3629) the high and low surrogate
 halves used by UTF-16 (U+D800 through U+DFFF) are not legal Unicode
 values, and their UTF-8 encoding should be treated as an invalid byte
 sequence.
 >Whether an actual application should do this is debatable, as it makes it
 impossible to store invalid UTF-16 (that is, UTF-16 with unpaired
 surrogate halves) in a UTF-8 string. This is necessary to store unchecked
 UTF-16 such as Windows filenames as UTF-8. It is also incompatible with
 CESU encoding (described below).

 In commit dc58b7398910a433259a6c0f58a0d05a48555191:
 {{{
 Author: Max Bolingbroke <>
 Date:   Sat May 14 22:50:46 2011 +0100

     Big patch to improve Unicode support in GHC. Validated on OS X and
 Windows, this
     patch series fixes #5061, #1414, #3309, #3308, #3307, #4006 and #4855.
 }}}
 This commit adds checks like `... if isSurrogate c then done
 InvalidSequence ir ow else do ...` to GHC/IO/Encoding/UTF{8|16|32}.hs

--
Ticket URL: <http://ghc.haskell.org/trac/ghc/ticket/8730#comment:1>
GHC <http://www.haskell.org/ghc/>
The Glasgow Haskell Compiler


More information about the ghc-tickets mailing list