[Haskell] ANNOUNCE: Data.CompactString 0.1 - my attempt at a
Unicode ByteString
Deborah Goldsmith
dgoldsmith at mac.com
Fri Feb 9 19:21:07 EST 2007
On Thu, 2007-02-08 at 17:01 -0800, John Meacham wrote:
> UCS-2 is a disaster in every way. someone had to say it. :)
UCS-2 has been deprecated for many years.
>
> everything should be ascii, utf8 or ucs-4 or migrating to it.
UCS-4 has also been deprecated for many years. The main forms of
Unicode in use are UTF-16, UTF-8, and (less frequently) UTF-32.
On Feb 9, 2007, at 6:02 AM, Duncan Coutts wrote:
> Apparently UTF-16 (which is like UCS-2 but covers all code points)
> is a
> good internal format. It is more compact than UTF-32 in almost all
> cases
> and a less complex encoding than UTF-8. So it's faster than either
> UTF-32 (because of data-density) or UTF-8 (because of the encoding
> complexity). The downside compared to UTF-32 is that it is a more
> complex encoding so the code is harder to write (but apparently it
> doesn't affect performance much because characters outside the BMP are
> very rare).
UTF-16 is never less compact than UTF-32. The worst case of UTF-16 is
that it is the same size as UTF-32. This only happens when a string
consists entirely of characters from the supplementary planes.
>
> The ICU lib uses UTF-16 internally I believe, though I can't at the
> moment find on their website the bit where they explain why the use
> UTF-16 rather than -8 or -32.
http://icu.sourceforge.net/userguide/unicodeBasics.html
UTF-16 is the native Unicode encoding for ICU, Microsoft Windows, and
Mac OS X.
> Btw, when it comes to all these encoding names, I find it helpful to
> maintain the fiction that there's no such thing (any more) as UCS-N,
> there's only UTF-8, 16 and 32. This is also what the Unicode
> consortium
> tries to encourage.
It's not a fiction. :-) UCS-2 and UCS-4 are *deprecated*, by the
merger between Unicode and ISO 10646 that limited the code point
space to [0..0x10FFFF]. In addition to UTF-8, UTF-16, and UTF-32,
there's SCSU, a compressed form used in some applications. See:
http://www.unicode.org/reports/tr17/
> My view is that we should just provide all three:
> Data.PackedString.UTF8
> Data.PackedString.UTF16
> Data.PackedString.UTF32
>
> that all provide the same interface. This wouldn't actually be too
> much
> code to write since most of it can re-use the streams code, so the
> only
> difference is the single implementation per-encoding of:
> stream :: PackedString -> Stream Char
> unstream :: Stream Char -> PackedString
>
> and then get fusion for free of course.
I agree that all three should be supported. UTF-16 is used in Windows
and Mac OS X, and UTF-8 is widely used on Unix platforms (and at the
BSD level of Mac OS X). UTF-32 matches the Char type in Haskell, and
is used for wchar_t on some platforms. SCSU can be handled the same
as a non-Unicode encoding (e.g., like GB2312 or Shift JIS).
Note that some Unicode algorithms require the ability to back up in a
stream of code points, so that may be a consideration in the design
(maybe they could be implemented in Haskell in a way that doesn't
require that; I'm still learning, so I'm not sure yet). And regular
expression processing requires essentially random access (same
Haskell-fu considerations apply).
>
> I have proposed this task as an MSc project in my department.
> Hopefully
> we'll get a student to pick this up.
I hope so!
ICU has a BSD/MIT-style license, so feel free to steal whatever is
appropriate from there. I would love to see Haskell support Unicode
operations like locale-sensitive collation, text boundary analysis,
and more on both [Char] and packed strings.
Deborah
More information about the Haskell
mailing list