[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 03:08:27 EDT 2010

Benedikt Huber <benjovi at gmx.net> writes:

> Despite of all this, I think the performance of the text
> package is very promising, and hope it will improve further!

I agree, Data.Text is great.  Unfortunately, its internal use of UTF-16
makes it inefficient for many purposes.

A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are
typically restricted to ASCII (except for a few text fields).

For instance, a typical project for me might be 10-100GB of data, mostly
in various text formats, "real" text only making up a few percent of
this.  The combined (all languages) Wikipedia is 2G words, probably less
than 20GB. 

Being agnostic about string encoding - viz. treating it as bytes - works
okay, but it would be nice to allow Unicode in the bits that actually
are text, like string fields and labels and such.

Due to the sizes involved, I think that in order to efficiently process
text-formatted data, UTF-8 is the no-brainer choice for encoding --
certainly in storage, but also for in-memory processing. Unfortunately,
there is no clear Data.Text-like effort here.  There's (at least):

    utf8-string - provides utf-8 encoded lazy and strict bytestrings as
                  well as some other data types (and a common class) and
                  System.Environment functionality.

    utf8-light  - provides encoding/decoding to/from (strict?) bytestrings

    regex-tdfa-utf8  - regular expressions on UTF-8 encoded lazy bytestrings
    utf8-env    - provides an UTF8 aware System.Environment

    uhexdump       - hex dumps for UTF-8 (?)

    compact-string - support for many different string encodings
    compact-string-fix - indicates that the above is unmaintained

>From a quick glance, it appears that utf8-string is the most complete
and well maintained of the crowd, but I could be wrong.  It'd be nice if
a similar effort as Data.Text has seen could be applied to
e.g. utf8-string, to produce a similarly efficient and effective library
and allow the deprecation of the others.  IMO, this could in time
replace .Char8 as the default ByteString string representation.
Hackathon, anyone? 

-k
-- 
If I haven't seen further, it is by standing in the footprints of giants