[Haskell-cafe] How does GHC read UNICODE.

Tue May 20 07:03:27 EDT 2008

On Tue, 2008-05-20 at 09:30 +0200, Ketil Malde wrote:
> Don Stewart <dons at galois.com> writes:
> 
> > You can use either bytestrings, which will ignore any encoding, 
> 
> Uh, I am hesitant to voice my protest here, but I think this bears
> some elaboration:
> 
> Bytestrings are exactly that, strings of bytes.

Yes, we tried to make it explicit.

> Basically, bytestrings are the wrong tool for the job if you need more
> than 8 bits per character.

Right. It's not intended for text, except for those 8-bit mixed binary
ASCII network protocols, file formats etc.

> I think the predecessors of bytestring (FPS?) had support for other
> fixed-size encodings, that is, two-byte and four-byte characters.

I'm not sure about that, but there is the old Data.PackedString which
uses UTF-32. There is no fixed size two-byte Unicode encoding (there is
only UTF-16 which is variable width.)

>  Perhaps writing a Data.Word16String bytestrings-alike using UCS-2
> would be an option?

I'm supervising a masters student who is working on a proper Unicode ADT
with a similar API and underlying implementation to that of ByteString.
Hopefully people will be able to start using that for an internal
representation of text instead of ByteString.

Duncan