[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Fri Feb 9 09:02:16 EST 2007

On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:
> On Tue, Feb 06, 2007 at 03:16:17PM +0900, shelarcy wrote:
> > I'm afraid that its fantasy is broken again, as no surrogate
> > pair UCS-2 cover all language that is trusted before Europe
> > and America people.
> 
> UCS-2 is a disaster in every way. someone had to say it. :)
> 
> everything should be ascii, utf8 or ucs-4 or migrating to it.

Apparently UTF-16 (which is like UCS-2 but covers all code points) is a
good internal format. It is more compact than UTF-32 in almost all cases
and a less complex encoding than UTF-8. So it's faster than either
UTF-32 (because of data-density) or UTF-8 (because of the encoding
complexity). The downside compared to UTF-32 is that it is a more
complex encoding so the code is harder to write (but apparently it
doesn't affect performance much because characters outside the BMP are
very rare).

The ICU lib uses UTF-16 internally I believe, though I can't at the
moment find on their website the bit where they explain why the use
UTF-16 rather than -8 or -32.

http://icu.sourceforge.net/

Btw, when it comes to all these encoding names, I find it helpful to
maintain the fiction that there's no such thing (any more) as UCS-N,
there's only UTF-8, 16 and 32. This is also what the Unicode consortium
tries to encourage.

My view is that we should just provide all three:
Data.PackedString.UTF8
Data.PackedString.UTF16
Data.PackedString.UTF32

that all provide the same interface. This wouldn't actually be too much
code to write since most of it can re-use the streams code, so the only
difference is the single implementation per-encoding of:
stream   :: PackedString -> Stream Char
unstream :: Stream Char -> PackedString

and then get fusion for free of course.

I have proposed this task as an MSc project in my department. Hopefully
we'll get a student to pick this up.

Duncan