[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.

Thu Sep 27 03:45:41 EDT 2007

On Thu, Sep 27, 2007 at 06:39:24AM +0000, Aaron Denney wrote:
> On 2007-09-27, Deborah Goldsmith <dgoldsmith at mac.com> wrote:
> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
> > common case is. UTF-16 is effectively fixed-width for the majority of  
> > text in the majority of languages. Combining sequences and surrogate  
> > pairs are relatively infrequent.
> 
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.

You could get rapid seeks by ignoring the UTFs and representing strings
as sequences of chunks, where each chunk is uniformly 8-bit, 16-bit or
32-bit as required to cover the characters it contains.  Hardly anyone
would need 32-bit chunks (and some of us would need only the 8-bit ones).