[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Tue Oct 2 11:44:52 EDT 2007

On Tue, 2007-10-02 at 08:02 -0700, Deborah Goldsmith wrote:
> On Oct 2, 2007, at 5:11 AM, ChrisK wrote:
> > Deborah Goldsmith wrote:
> >
> >> UTF-16 is the native encoding used for Cocoa, Java, ICU, and  
> >> Carbon, and
> >> is what appears in the APIs for all of them. UTF-16 is also what's
> >> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
> >> APIs for backward compatibility. It's also used in plain text  
> >> files (or
> >> XML or HTML), again for compatibility.
> >>
> >> Deborah
> >
> >
> > On OS X, Cocoa and Carbon use Core Foundation, whose API does not  
> > have a
> > one-true-encoding internally.  Follow the rather long URL for details:
> >
> > http://developer.apple.com/documentation/CoreFoundation/Conceptual/ 
> > CFStrings/index.html?http://developer.apple.com/documentation/ 
> > CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// 
> > apple_ref/doc/uid/20001179
> >
> > I would vote for an API that not just hides the internal store, but  
> > allows
> > different internal stores to be used in a mostly compatible way.
> >
> > However, There is a UniChar typedef on OS X which is the same  
> > unsigned 16 bit
> > integer as Java's JNI would use.
> 
> UTF-16 is the type used in all the APIs. Everything else is  
> considered an encoding conversion.
> 
> CoreFoundation uses UTF-16 internally except when the string fits  
> entirely in a single-byte legacy encoding like MacRoman or  
> MacCyrillic. If any kind of Unicode processing needs to be done to  
> the string, it is first coerced to UTF-16. If it weren't for  
> backwards compatibility issues, I think we'd use UTF-16 all the time  
> as the machinery for switching encodings adds complexity. I wouldn't  
> advise it for a new library.

I would like to, again, strongly argue against sacrificing compatibility
with Linux/BSD/etc. for the sake of compatibility with OS X or Windows.
FFI bindings have to convert data formats in any case; Haskell shouldn't
gratuitously break Linux support (or make life harder on Linux) just to
support proprietary operating systems better.

Now, if /independent of the details of MacOS X/, UTF-16 is better
(objectively), it can be converted to anything by the FFI.  But doing it
the way Java or MacOS X or Win32 or anyone else does it, at the expense
of Linux, I am strongly opposed to.

jcc