[Haskell-cafe] PROPOSAL: New efficient Unicode string library.
Jonathan Cast
jcast at ou.edu
Wed Sep 26 13:54:16 EDT 2007
On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote:
> In message <1190825044.9435.1.camel at jcchost> Jonathan Cast <jcast at ou.edu> writes:
> > On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
>
> > > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > > think that's a strong reason to use it. I don't know Unicode well
> > > enough to say otherwise.
> >
> > I disagree. I realize I'm a dissenter in this regard, but my position
> > is: excellent Unix support first, portability second, excellent support
> > for Win32/MacOS a distant third. That seems to be the opposite of every
> > language's position. Unix absolutely needs UTF-8 for backward
> > compatibility.
>
> I think you're talking about different things, internal vs external representations.
>
> Certainly we must support UTF-8 as an external representation. The choice of
> internal representation is independent of that. It could be [Char] or some
> memory efficient packed format in a standard encoding like UTF-8,16,32. The
> choice depends mostly on ease of implementation and performance. Some formats
> are easier/faster to process but there are also conversion costs so in some use
> cases there is a performance benefit to the internal representation being the
> same as the external representation.
>
> So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
> has the advantage of being the same as a common external representation so
> conversion is cheap (only need to validate rather than copy). UTF-8 is more
> compact for western languages but less compact for eastern languages compared to
> UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
> common case UTF-16 is effectively fixed width. According to the ICU implementors
> this has speed advantages (probably due to branch prediction and smaller code size).
>
> One solution is to do both and benchmark them.
OK, right.
jcc
More information about the Haskell-Cafe
mailing list