[Haskell-cafe] Re: String vs ByteString
Ketil Malde
ketil at malde.org
Tue Aug 17 03:08:27 EDT 2010
Benedikt Huber <benjovi at gmx.net> writes:
> Despite of all this, I think the performance of the text
> package is very promising, and hope it will improve further!
I agree, Data.Text is great. Unfortunately, its internal use of UTF-16
makes it inefficient for many purposes.
A large fraction - probably most - textual data isn't natural language
text, but data formatted in textual form, and these formats are
typically restricted to ASCII (except for a few text fields).
For instance, a typical project for me might be 10-100GB of data, mostly
in various text formats, "real" text only making up a few percent of
this. The combined (all languages) Wikipedia is 2G words, probably less
than 20GB.
Being agnostic about string encoding - viz. treating it as bytes - works
okay, but it would be nice to allow Unicode in the bits that actually
are text, like string fields and labels and such.
Due to the sizes involved, I think that in order to efficiently process
text-formatted data, UTF-8 is the no-brainer choice for encoding --
certainly in storage, but also for in-memory processing. Unfortunately,
there is no clear Data.Text-like effort here. There's (at least):
utf8-string - provides utf-8 encoded lazy and strict bytestrings as
well as some other data types (and a common class) and
System.Environment functionality.
utf8-light - provides encoding/decoding to/from (strict?) bytestrings
regex-tdfa-utf8 - regular expressions on UTF-8 encoded lazy bytestrings
utf8-env - provides an UTF8 aware System.Environment
uhexdump - hex dumps for UTF-8 (?)
compact-string - support for many different string encodings
compact-string-fix - indicates that the above is unmaintained
>From a quick glance, it appears that utf8-string is the most complete
and well maintained of the crowd, but I could be wrong. It'd be nice if
a similar effort as Data.Text has seen could be applied to
e.g. utf8-string, to produce a similarly efficient and effective library
and allow the deprecation of the others. IMO, this could in time
replace .Char8 as the default ByteString string representation.
Hackathon, anyone?
-k
--
If I haven't seen further, it is by standing in the footprints of giants
More information about the Haskell-Cafe
mailing list