Text in Haskell: A PROPOSAL
Ashley Yakeley
ashley@semantic.org
Wed, 7 Aug 2002 23:05:28 -0700
At 2002-08-07 17:37, Sven Moritz Hallberg wrote:
>Hm, what about encodeCharUTF16? Would that return Word16s?
UTF-16 may represent a single Char as one or two Word16s.
encodeCharUTF16 :: Char -> [Word16];
or
encodeCharUTF16 :: Char -> (Word16,Maybe Word16);
or
encodeCharUTF16 :: Char -> Either Word16 (Word16,Word16);
>Hrm. But then, how to write that to a file?
Depends on what order you want the halves of each Word16.
Unicode 3.0 defines four character encoding forms/schemes: UTF-8, UTF-16,
UTF-16LE, and UTF-16BE. UTF-16 encodes as 16-bit units, the other three
encode as 8-bit units.
So you might have something like this:
encodeUTF8 :: String -> [Word8];
encodeUTF16 :: String -> [Word16];
encodeUTF16LE :: String -> [Word8];
encodeUTF16BE :: String -> [Word8];
The authority here is Unicode Technical Report 17, which is part of the
Unicode Standard.
<http://www.unicode.org/unicode/reports/tr17/>
But watch out... I've noticed a certain amount of incoherence in the
Unicode standards, for instance Unicode 3.0 sec. 2.3 refers to UTF-16 one
of four "Character Encoding Schemes" which is "an encoding form plus byte
serialization" even though UTF-16 by itself doesn't include byte
serialization.
--
Ashley Yakeley, Seattle WA