[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Thu Sep 27 15:25:45 EDT 2007

On 2007-09-27, Aaron Denney <wnoise at ofb.net> wrote:
> On 2007-09-27, Deborah Goldsmith <dgoldsmith at mac.com> wrote:
>> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>>>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>>>> surrogate
>>>> pairs and combining characters.
>>>
>>> Good point.
>>
>> Well, not so much. As Duncan mentioned, it's a matter of what the most  
>> common case is. UTF-16 is effectively fixed-width for the majority of  
>> text in the majority of languages. Combining sequences and surrogate  
>> pairs are relatively infrequent.
>
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.
>
>> Speaking as someone who has done a lot of Unicode implementation, I
>> would say UTF-16 represents the best time/space tradeoff for an
>> internal representation. As I mentioned, it's what's used in Windows,
>> Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that 
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.

Simplicity of algorithms is also a concern.  Validating a byte sequence
as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16.  

(I'd also like to see a reference to the Mac OS X encoding.  I know that 
the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)

-- 
Aaron Denney
-><-