[Haskell-cafe] How does GHC read UNICODE.

Tue May 20 03:30:57 EDT 2008

Don Stewart <dons at galois.com> writes:

> You can use either bytestrings, which will ignore any encoding, 

Uh, I am hesitant to voice my protest here, but I think this bears
some elaboration:

Bytestrings are exactly that, strings of bytes.
There are basically two interfaces, one (Data.ByteString[.Lazy]),
which operates on raw bytes (and gives you Word8s), and another
(Data.ByteString[.Lazy].Char8), which treats the contents as Chars.
The latter will only deal with Unicode code points 0..255 (or
ISO_8859-1) -- and truncate higher code point values to fit this
range.

Basically, bytestrings are the wrong tool for the job if you need more
than 8 bits per character.  I think the predecessors of bytestring
(FPS?) had support for other fixed-size encodings, that is, two-byte
and four-byte characters. Perhaps writing a Data.Word16String
bytestrings-alike using UCS-2 would be an option? 

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants