[Haskell-cafe] Re: String vs ByteString

Sun Aug 15 02:34:58 EDT 2010

On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman <michael at snoyman.com>wrote:

>
>> When I'm writing a web app, my code is sitting on a Linux system where the
> default encoding is UTF-8, communicating with a database speaking UTF-8,
> receiving request bodies in UTF-8 and sending response bodies in UTF-8. So
> converting all of that data to UTF-16, just to be converted right back to
> UTF-8, does seem strange for that purpose.
>

Bear in mind that much of the data you're working with can't be readily
trusted. UTF-8 coming from the filesystem, the network, and often the
database may not be valid. The cost of validating it isn't all that
different from the cost of converting it to UTF-16.

And of course the internals of Data.Text are all fusion-based, so much of
the time you're not going to be allocating UTF-16 arrays at all, but instead
creating a pipeline of characters that are manipulated in a tight loop. This
eliminates a lot of the additional copying that bytestring has to do, for
instance.

To give you an idea of how competitive Data.Text can be compared to C code,
this is the system's wc command counting UTF-8 characters in a modestly
large file:

$ time wc -m huge.txt
32443330
real 0.728s

This is Data.Text performing the same task:

$ time ./FileRead text huge.txt
32443330
real 0.697s
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100815/00cf409f/attachment.html