[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a
alistair at abayley.org
Mon Feb 5 10:56:25 EST 2007
On 05/02/07, Chris Kuklewicz <haskell at list.mightyreason.com> wrote:
> shelarcy wrote:
> > Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.
> UTF-8 uses 1,2,3, or 4 bytes. Anything that does not support 4 bytes does not
> support UTF-8
Well, some of them are probably a bit dated; they likely supported an
older version of the standard.
> > But Takusen's implementation support it correctly.
> The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
> up to 31 bits into as many as 6 bytes. But it does decode up to 6 bytes to 31
> bits and try to "chr" this from Int to Char. Decoding that many bits is not
> consistent with the UTF-8 standard.
> UTF-8 is a 4 byte encoding. There is no valid UTF-8 5 or 6 byte encoding.
Chris is right here, in that Takusen's decoder is incorrect w.r.t. the
standard, in allowing up to 6 bytes to encode a single char. If it was
correct, it would reject 5 and 6 byte sequences. I copied the extended
conversion from HXT's code, which was the most correct UTF8 library I
had seen so far (it just didn't marshal directly from a CString, which
was what I was after).
Turns out darcs has the most accurate UTF8 en + de-coders:
There's nothing stopping the Unicode consortium from expanding the
range of codepoints, is there? Or have they said that'll never happen?
More information about the Haskell