[commit: base] master: Improve documentation for mkTextEncoding (f0ca95f)
Max Bolingbroke
batterseapower at hotmail.com
Tue Apr 23 22:14:13 CEST 2013
Repository : ssh://darcs.haskell.org//srv/darcs/packages/base
On branch : master
https://github.com/ghc/packages-base/commit/f0ca95f7628b37733e343a1d9ce96ca367fe8001
>---------------------------------------------------------------
commit f0ca95f7628b37733e343a1d9ce96ca367fe8001
Author: Max Bolingbroke <batterseapower at hotmail.com>
Date: Tue Apr 23 19:15:40 2013 +0100
Improve documentation for mkTextEncoding
>---------------------------------------------------------------
GHC/IO/Encoding.hs | 28 ++++++++++++++++++++++++----
1 files changed, 24 insertions(+), 4 deletions(-)
diff --git a/GHC/IO/Encoding.hs b/GHC/IO/Encoding.hs
index bd54182..052955c 100644
--- a/GHC/IO/Encoding.hs
+++ b/GHC/IO/Encoding.hs
@@ -175,8 +175,8 @@ char8 = Latin1.latin1
--
-- * @UTF-32@, @UTF-32BE@, @UTF-32LE@
--
--- On systems using GNU iconv (e.g. Linux), there is additional
--- notation for specifying how illegal characters are handled:
+-- There is additional notation (borrowed from GNU iconv) for specifying
+-- how illegal characters are handled:
--
-- * a suffix of @\/\/IGNORE@, e.g. @UTF-8\/\/IGNORE@, will cause
-- all illegal sequences on input to be ignored, and on output
@@ -186,6 +186,28 @@ char8 = Latin1.latin1
-- * a suffix of @\/\/TRANSLIT@ will choose a replacement character
-- for illegal sequences or code points.
--
+-- * a suffix of @\/\/ROUNDTRIP@ will use a PEP383-style escape mechanism
+-- to represent any invalid bytes in the input as Unicode codepoints (specifically,
+-- as lone surrogates, which are normally invalid in UTF-32).
+-- Upon output, these special codepoints are detected and turned back into the
+-- corresponding original byte.
+--
+-- In theory, this mechanism allows arbitrary data to be roundtripped via
+-- a 'String' with no loss of data. In practice, there are two limitations
+-- to be aware of:
+--
+-- 1. This only stands a chance of working for an encoding which is an ASCII
+-- superset, as for security reasons we refuse to escape any bytes smaller
+-- than 128. Many encodings of interest are ASCII supersets (in particular,
+-- you can assume that the locale encoding is an ASCII superset) but many
+-- (such as UTF-16) are not.
+--
+-- 2. If the underlying encoding is not itself roundtrippable, this mechanism
+-- can fail. Roundtrippable encodings are those which have an injective mapping
+-- into Unicode. Almost all encodings meet this criteria, but some do not. Notably,
+-- Shift-JIS (CP932) and Big5 contain several different encodings of the same
+-- Unicode codepoint.
+--
-- On Windows, you can access supported code pages with the prefix
-- @CP@; for example, @\"CP1250\"@.
--
@@ -194,8 +216,6 @@ mkTextEncoding e = case mb_coding_failure_mode of
Nothing -> unknownEncodingErr e
Just cfm -> mkTextEncoding' cfm enc
where
- -- The only problem with actually documenting //IGNORE and //TRANSLIT as
- -- supported suffixes is that they are not necessarily supported with non-GNU iconv
(enc, suffix) = span (/= '/') e
mb_coding_failure_mode = case suffix of
"" -> Just ErrorOnCodingFailure
More information about the ghc-commits
mailing list