[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Tue Oct 2 18:04:29 EDT 2007

On Oct 2, 2007, at 8:44 AM, Jonathan Cast wrote:
> I would like to, again, strongly argue against sacrificing  
> compatibility
> with Linux/BSD/etc. for the sake of compatibility with OS X or  
> Windows.
> FFI bindings have to convert data formats in any case; Haskell  
> shouldn't
> gratuitously break Linux support (or make life harder on Linux) just  
> to
> support proprietary operating systems better.
>
> Now, if /independent of the details of MacOS X/, UTF-16 is better
> (objectively), it can be converted to anything by the FFI.  But  
> doing it
> the way Java or MacOS X or Win32 or anyone else does it, at the  
> expense
> of Linux, I am strongly opposed to.

No one is advocating that. Any Unicode support library needs to  
support exporting text as UTF-8 since it's so widely used. It's used  
on Mac OS X, too, in exactly the same contexts it would be used on  
Linux. However, UTF-8 is a poor choice for internal representation.

On Oct 2, 2007, at 2:32 PM, Stefan O'Rear wrote:
> UTF-8 supports CJK languages too.  The only question is efficiency,  
> and
> I believe CJK is still a relatively uncommon case compared to English
> and other Latin-alphabet languages.  (That said, I live in a country  
> all
> of whose dominant languages use the Latin alphabet)

First of all, non-Latin countries already represent a large fraction  
of computer usage and the computer market. It is not at all  
"relatively uncommon." Japan alone is a huge market. China is a huge  
market.

Second, it's not just CJK, but anything that's not mostly ASCII.  
Russian, Greek, Thai, Arabic, Hebrew, etc. etc. etc. UTF-8 is intended  
for compatibility with existing software that expects multibyte  
encodings. It doesn't work well as an internal representation. Again,  
no one is saying a Unicode library shouldn't have full support for  
input and output of UTF-8 (and other encodings).

If you want to process ASCII text and squeeze out every last ounce of  
performance, use byte strings. Unicode strings should be optimized for  
representing and processing human language text, a large share of  
which is not in the Latin alphabet.

Remember, speakers of English and other Latin-alphabet languages are a  
minority in the world, though not in the computer-using world. Yet.

Deborah