UTF8 (was Re: Hexdump)
Simon Marlow
simonmarhaskell at gmail.com
Tue Mar 21 12:02:41 EST 2006
Malcolm Wallace wrote:
> Oops, I wrote:
>
> fromUTF8 (w:ws)
> | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
> | w >= 0xc0 {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
> | w >= 0xe0 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
> | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
> | w >= 0xf8 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
> | w >= 0xfc {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws
>
> which should of course have been
>
> fromUTF8 (w:ws)
> | w < 0x80 {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
> | w >= 0xfc {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
> | w >= 0xf8 {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
> | w >= 0xf0 {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
> | w >= 0xe0 {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
> | w >= 0xc0 {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws
Getting a UTF-8 decoder right is quite non-trivial. Take a look at this:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
I made a half-hearted attempt to get most of this right in GHC's UTF-8
decoder, but by no means all of it is implemented. I do think it would
be nice if the Haskell implementation was correct, for some value of
correct, though.
Cheers,
Simon
More information about the Libraries
mailing list