[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.
Deborah Goldsmith
dgoldsmith at mac.com
Tue Oct 2 18:04:29 EDT 2007
On Oct 2, 2007, at 8:44 AM, Jonathan Cast wrote:
> I would like to, again, strongly argue against sacrificing
> compatibility
> with Linux/BSD/etc. for the sake of compatibility with OS X or
> Windows.
> FFI bindings have to convert data formats in any case; Haskell
> shouldn't
> gratuitously break Linux support (or make life harder on Linux) just
> to
> support proprietary operating systems better.
>
> Now, if /independent of the details of MacOS X/, UTF-16 is better
> (objectively), it can be converted to anything by the FFI. But
> doing it
> the way Java or MacOS X or Win32 or anyone else does it, at the
> expense
> of Linux, I am strongly opposed to.
No one is advocating that. Any Unicode support library needs to
support exporting text as UTF-8 since it's so widely used. It's used
on Mac OS X, too, in exactly the same contexts it would be used on
Linux. However, UTF-8 is a poor choice for internal representation.
On Oct 2, 2007, at 2:32 PM, Stefan O'Rear wrote:
> UTF-8 supports CJK languages too. The only question is efficiency,
> and
> I believe CJK is still a relatively uncommon case compared to English
> and other Latin-alphabet languages. (That said, I live in a country
> all
> of whose dominant languages use the Latin alphabet)
First of all, non-Latin countries already represent a large fraction
of computer usage and the computer market. It is not at all
"relatively uncommon." Japan alone is a huge market. China is a huge
market.
Second, it's not just CJK, but anything that's not mostly ASCII.
Russian, Greek, Thai, Arabic, Hebrew, etc. etc. etc. UTF-8 is intended
for compatibility with existing software that expects multibyte
encodings. It doesn't work well as an internal representation. Again,
no one is saying a Unicode library shouldn't have full support for
input and output of UTF-8 (and other encodings).
If you want to process ASCII text and squeeze out every last ounce of
performance, use byte strings. Unicode strings should be optimized for
representing and processing human language text, a large share of
which is not in the Latin alphabet.
Remember, speakers of English and other Latin-alphabet languages are a
minority in the world, though not in the computer-using world. Yet.
Deborah
More information about the Haskell-Cafe
mailing list