[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Alistair Bayley alistair at abayley.org
Mon Feb 5 10:56:25 EST 2007


On 05/02/07, Chris Kuklewicz <haskell at list.mightyreason.com> wrote:
> shelarcy wrote:

> > Many Hasekll UTF-8 libraries doesn't support over 3 byte encodings.
>
> UTF-8 uses 1,2,3, or 4 bytes.  Anything that does not support 4 bytes does  not
> support UTF-8

Well, some of them are probably a bit dated; they likely supported an
older version of the standard.


> > But Takusen's implementation support it correctly.
>
> The Takusen does have unreachable dead code to serialize Char as (ord c :: Int)
> up to 31 bits into as many as 6 bytes.  But it does decode up to 6 bytes to 31
> bits and try to "chr" this from Int to Char.  Decoding that many bits is not
> consistent with the UTF-8 standard.
> UTF-8 is a 4 byte encoding.  There is no valid UTF-8 5 or 6 byte encoding.

Chris is right here, in that Takusen's decoder is incorrect w.r.t. the
standard, in allowing up to 6 bytes to encode a single char. If it was
correct, it would reject 5 and 6 byte sequences. I copied the extended
conversion from HXT's code, which was the most correct UTF8 library I
had seen so far (it just didn't marshal directly from a CString, which
was what I was after).

Turns out darcs has the most accurate UTF8 en + de-coders:
  http://abridgegame.org/cgi-bin/darcs.cgi/darcs/UTF8.lhs?c=annotate

There's nothing stopping the Unicode consortium from expanding the
range of codepoints, is there? Or have they said that'll never happen?

Alistair


More information about the Haskell mailing list