[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a Unicode ByteString

Fri Feb 9 19:21:07 EST 2007

On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:
> UCS-2 is a disaster in every way. someone had to say it. :)

UCS-2 has been deprecated for many years.

>
> everything should be ascii, utf8 or ucs-4 or migrating to it.

UCS-4 has also been deprecated for many years. The main forms of  
Unicode in use are UTF-16, UTF-8, and (less frequently) UTF-32.

On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote:
> Apparently UTF-16 (which is like UCS-2 but covers all code points)  
> is a
> good internal format. It is more compact than UTF-32 in almost all  
> cases
> and a less complex encoding than UTF-8. So it's faster than either
> UTF-32 (because of data-density) or UTF-8 (because of the encoding
> complexity). The downside compared to UTF-32 is that it is a more
> complex encoding so the code is harder to write (but apparently it
> doesn't affect performance much because characters outside the BMP are
> very rare).

UTF-16 is never less compact than UTF-32. The worst case of UTF-16 is  
that it is the same size as UTF-32. This only happens when a string  
consists entirely of characters from the supplementary planes.
>
> The ICU lib uses UTF-16 internally I believe, though I can't at the
> moment find on their website the bit where they explain why the use
> UTF-16 rather than -8 or -32.

http://icu.sourceforge.net/userguide/unicodeBasics.html

UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, and  
Mac OS X.

> Btw, when it comes to all these encoding names, I find it helpful to
> maintain the fiction that there's no such thing (any more) as UCS-N,
> there's only UTF-8, 16 and 32. This is also what the Unicode  
> consortium
> tries to encourage.

It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by the  
merger between Unicode and ISO 10646 that limited the code point  
space to [0..0x10FFFF]. In addition to UTF-8, UTF-16, and UTF-32,  
there's SCSU, a compressed form used in some applications. See:

http://www.unicode.org/reports/tr17/

> My view is that we should just provide all three:
> Data.PackedString.UTF8
> Data.PackedString.UTF16
> Data.PackedString.UTF32
>
> that all provide the same interface. This wouldn't actually be too  
> much
> code to write since most of it can re-use the streams code, so the  
> only
> difference is the single implementation per-encoding of:
> stream   :: PackedString -> Stream Char
> unstream :: Stream Char -> PackedString
>
> and then get fusion for free of course.

I agree that all three should be supported. UTF-16 is used in Windows  
and Mac OS X, and UTF-8 is widely used on Unix platforms (and at the  
BSD level of Mac OS X). UTF-32 matches the Char type in Haskell, and  
is used for wchar_t on some platforms. SCSU can be handled the same  
as a non-Unicode encoding (e.g., like GB2312 or Shift JIS).

Note that some Unicode algorithms require the ability to back up in a  
stream of code points, so that may be a consideration in the design  
(maybe they could be implemented in Haskell in a way that doesn't  
require that; I'm still learning, so I'm not sure yet). And regular  
expression processing requires essentially random access (same  
Haskell-fu considerations apply).
>
> I have proposed this task as an MSc project in my department.  
> Hopefully
> we'll get a student to pick this up.

I hope so!

ICU has a BSD/MIT-style license, so feel free to steal whatever is  
appropriate from there. I would love to see Haskell support Unicode  
operations like locale-sensitive collation, text boundary analysis,  
and more on both [Char] and packed strings.

Deborah