[Haskell-cafe] Re: PROPOSAL: New efficient Unicode string library.
dgoldsmith at mac.com
Mon Oct 1 22:50:39 EDT 2007
Sorry for the long delay, work has been really busy...
On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote:
> On 2007-09-27, Aaron Denney <wnoise at ofb.net> wrote:
>>> Well, not so much. As Duncan mentioned, it's a matter of what the
>>> common case is. UTF-16 is effectively fixed-width for the majority
>>> text in the majority of languages. Combining sequences and surrogate
>>> pairs are relatively infrequent.
>> Infrequent, but they exist, which means you can't seek x/2 bytes
>> to seek x characters ahead. All such seeking must be linear for both
>> UTF-16 *and* UTF-8.
>>> Speaking as someone who has done a lot of Unicode implementation, I
>>> would say UTF-16 represents the best time/space tradeoff for an
>>> internal representation. As I mentioned, it's what's used in
>>> Mac OS X, ICU, and Java.
> I guess why I'm being something of a pain-in-the-ass here, is that
> I want to use your Unicode implementation expertise to know what
> these time/space tradeoffs are.
> Are there any algorithmic asymptotic complexity differences, or all
> these all constant factors? The constant factors depend on projected
> workload. And are these actually tradeoffs, except between UTF-32
> (which uses native wordsizes on 32-bit platforms) and the other two?
> Smaller space means smaller cache footprint, which can dominate.
Yes, cache footprint is one reason to use UTF-16 rather than UTF-32.
Having no surrogate pairs also doesn't save you anything because you
need to handle sequences anyway, such as combining marks and clusters.
The best reference for all of this is:
Which data type is best depends on what the purpose is. If the data
will primarily be ASCII with an occasional non-ASCII characters, UTF-8
may be best. If the data is general Unicode text, UTF-16 is best. I
would think a Unicode string type would be intended for processing
natural language text, not just ASCII data.
> Simplicity of algorithms is also a concern. Validating a byte
> as UTF-8 is harder than validating a sequence of 16-bit values as
> (I'd also like to see a reference to the Mac OS X encoding. I know
> the filesystem interface is UTF-8 (decomposed a certain a way). Is it
> just that UTF-16 is a common application choice, or is there some
> common framework or library that uses that?)
UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon,
and is what appears in the APIs for all of them. UTF-16 is also what's
stored in the volume catalog on Mac disks. UTF-8 is only used in BSD
APIs for backward compatibility. It's also used in plain text files
(or XML or HTML), again for compatibility.
More information about the Haskell-Cafe