[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Deborah Goldsmith dgoldsmith at mac.com
Tue Oct 2 11:02:30 EDT 2007


On Oct 2, 2007, at 5:11 AM, ChrisK wrote:
> Deborah Goldsmith wrote:
>
>> UTF-16 is the native encoding used for Cocoa, Java, ICU, and  
>> Carbon, and
>> is what appears in the APIs for all of them. UTF-16 is also what's
>> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
>> APIs for backward compatibility. It's also used in plain text  
>> files (or
>> XML or HTML), again for compatibility.
>>
>> Deborah
>
>
> On OS X, Cocoa and Carbon use Core Foundation, whose API does not  
> have a
> one-true-encoding internally.  Follow the rather long URL for details:
>
> http://developer.apple.com/documentation/CoreFoundation/Conceptual/ 
> CFStrings/index.html?http://developer.apple.com/documentation/ 
> CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// 
> apple_ref/doc/uid/20001179
>
> I would vote for an API that not just hides the internal store, but  
> allows
> different internal stores to be used in a mostly compatible way.
>
> However, There is a UniChar typedef on OS X which is the same  
> unsigned 16 bit
> integer as Java's JNI would use.

UTF-16 is the type used in all the APIs. Everything else is  
considered an encoding conversion.

CoreFoundation uses UTF-16 internally except when the string fits  
entirely in a single-byte legacy encoding like MacRoman or  
MacCyrillic. If any kind of Unicode processing needs to be done to  
the string, it is first coerced to UTF-16. If it weren't for  
backwards compatibility issues, I think we'd use UTF-16 all the time  
as the machinery for switching encodings adds complexity. I wouldn't  
advise it for a new library.

Deborah



More information about the Haskell-Cafe mailing list