String != [Char]

Sat Mar 24 19:50:10 CET 2012

Hi all,

On Sat, Mar 24, 2012 at 12:39 AM, Heinrich Apfelmus
<apfelmus at quantentunnel.de> wrote:
> Which brings me to the fundamental question behind this proposal: Why do we
> need Text at all? What are its virtues and how do they compare? What is the
> trade-off? (I'm not familiar enough with the Text library to answer these.)
>
> To put it very pointedly: is a %20 performance increase on the current
> generation of computers worth the cost in terms of ease-of-use, when the
> performance can equally be gained by buying a faster computer or more RAM?
> I'm not sure whether I even agree with this statement, but this is the
> trade-off we are deciding on.

Correctness
==========

Using list-based operations on Strings are almost always wrong, as
soon as you move away from English text. You almost always have to
deal with Unicode strings as blobs, considering several code points at
once. For example,

    upcase :: String -> String
    upcase = map toUpper

Is terse, beautiful, and wrong, as several languages map a single
lowercase character to two uppercase characters (as I'm sure you're
aware.)

Perhaps this is OK to ignore when teaching students Haskell, but it
really hurts those who want to use Haskell as an engineering language.

Performance
===========

Depending on the benchmark, the difference can be much bigger than
20%. For example, here's a comparison of decoding UTF-8 byte data into
a String vs a Text value:

benchmarking Pure/decode/Text
mean: 50.22202 us, lb 50.08306 us, ub 50.37669 us, ci 0.950
std dev: 751.1139 ns, lb 666.2243 ns, ub 865.8246 ns, ci 0.950
variance introduced by outliers: 7.553%
variance is slightly inflated by outliers

benchmarking Pure/decode/String
mean: 188.0507 us, lb 187.4970 us, ub 188.6955 us, ci 0.950
std dev: 3.053076 us, lb 2.647318 us, ub 3.606262 us, ci 0.950
variance introduced by outliers: 9.407%
variance is slightly inflated by outliers

A difference of almost 4x.

Many of the Text vs String benchmarks measure the performance of
operations ignoring both decoding and encoding, while any real
application would have to do both.

On top of that, String is more or less as optimized as it can be;
benchmarks are almost completely memory bound. Text on the other hand
still has potential of (large) improvements, as GHC doesn't general
optimal code for tight loops over arrays. For example, we know that
GHC generates bad code for decodeUtf8 as used by Text's stream fusion,
hurting any code that uses fusion.

Furthermore, the memory overhead of Text is smaller, which means that
applications that hold on to many string value will use less heap and
thus experience smaller "freezes" due major GC collections, which are
linear in the heap size.

Cheers,
Johan