[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a
haskell at list.mightyreason.com
Mon Feb 5 10:25:45 EST 2007
> Hello Twan,
> On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven <twanvl at gmail.com> wrote:
>> I would like to announce my attempt at making a Unicode version of
>> Data.ByteString. The library is named Data.CompactString to avoid
>> conflict with other (Fast)PackedString libraries.
> How about add abstract layer?
> Spencer Janssen tried to provied abstract layer for Unicode ByteString,
> last year's summer of code project.
> It has no Unicode support. But it supplied a good layer, Stringable class.
>> The library uses a variable length encoding (1 to 3 bytes) of Chars into
>> Word8s, which are then stored in a ByteString. The structure is very
>> much based on Data.ByteString, most of the implementation is copied from
>> there. Hopefully this means that fusion rules could be copied as well.
> UTF-8 also uses 4 to 6 byte encodings now.
> CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
> etc ... use 4 byte encoding.
Looking at several sources, it seems you are incorrect.
Haskell Char go up to Unicode 1114111 (decimal) or 0x10ffff Hexidecimal).
These are encoded by UTF-8 in 1,2,3,or 4 bytes.
CJK Unified Ideographs Extension B starts at 131072 or 0x20000
Tai Xuan Jing Symbols start at 119552 or 0x1d300
These are all within the official utf-8 encoding scheme.
> Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.
UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not
> But Takusen's implementation support it correctly.
The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31
bits and try to "chr" this from Int to Char. Decoding that many bits is not
consistent with the UTF-8 standard.
> How about support 4 to 6 byte encodings?
UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding.
> Best Regards,
More information about the Haskell