String != [Char]

Sat Mar 24 23:54:47 CET 2012

On Sat, Mar 24, 2012 at 3:45 PM, Isaac Dupree
<ml at isaac.cedarswampstudios.org> wrote:
> How is Text for small strings currently (e.g. one English word, if not one
> character)?  Can we reasonably recommend it for that?
> This recent question suggests it's still not great:
> http://stackoverflow.com/questions/9398572/memory-efficient-strings-in-haskell

It's definitely not as good as it could be with the common case being
2 bytes per code point and then some fixed overhead.

The UTF-8 GSoC project last summer was an attempt to see if we could
do better, but unfortunately GHC does a worse job streaming out of a
byte array containing utf-8 than out of a byte array containing utf-16
(due to bad branch layout.)

This resulted in some performance gains and some performance losses,
with some more wins and losses. As there are other engineering
benefits in favor of utf-16 (e.g. being able to use ICU efficiently)
we opted for not switching the decoding. If we can get GHC to the
point where it compiles an utf-8 based Text really well, we could
reconsider this decision.

There's also a design trade-off in Text that favors better asymptotic
complexity for some operations (e.g. taking substrings) that adds 2
words of overhead to every string.

-- Johan