[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Mon Feb 5 10:25:45 EST 2007

shelarcy wrote:
> Hello Twan,
> 
> On Mon, 05 Feb 2007 08:46:35 +0900, Twan van Laarhoven <twanvl at gmail.com> wrote:
>> I would like to announce my attempt at making a Unicode version of
>> Data.ByteString. The library is named Data.CompactString to avoid
>> conflict with other (Fast)PackedString libraries.
> 
> How about add abstract layer?
> 
> Spencer Janssen tried to provied abstract layer for Unicode ByteString,
> last year's summer of code project.
> It has no Unicode support. But it supplied a good layer, Stringable class.
> 
> http://code.google.com/soc/haskell/appinfo.html?csaid=B934AEBE95120AB2
> http://darcs.haskell.org/SoC/fps-soc/
> http://darcs.haskell.org/SoC/fps-soc-aug21/
> 
> 
>> The library uses a variable length encoding (1 to 3 bytes) of Chars into
>> Word8s, which are then stored in a ByteString. The structure is very
>> much based on Data.ByteString, most of the implementation is copied from
>> there. Hopefully this means that fusion rules could be copied as well.
> 
> UTF-8 also uses 4 to 6 byte encodings now.
> CJK Unified Ideographs Extension B, Tai Xuan Jing Symbol and Music Symbol,
> etc ... use 4 byte encoding.

Looking at several sources, it seems you are incorrect.

Haskell Char go up to Unicode 1114111 (decimal) or 0x10ffff Hexidecimal).
These are encoded by UTF-8 in 1,2,3,or 4 bytes.

CJK Unified Ideographs Extension B starts at 131072 or 0x20000
Tai Xuan Jing Symbols start at 119552 or 0x1d300

These are all within the official utf-8 encoding scheme.

> 
> Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.

UTF-8 uses 1,2,3, or 4 bytes.  Anything that does not support 4 bytes does  not
support UTF-8

> But Takusen's implementation support it correctly.

The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
up to 31 bits into as many as 6 bytes.  But it does decode up to 6 bytes to 31
bits and try to "chr" this from Int to Char.  Decoding that many bits is not
consistent with the UTF-8 standard.

> 
> http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs
> http://www.haskell.org/pipermail/libraries/2007-February/006841.html
> 
> How about support 4 to 6 byte encodings?

UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte encoding.

> 
> 
> Best Regards,
>