[Haskell-cafe] Re: String vs ByteString

Tue Aug 17 07:22:26 EDT 2010

On Tue, Aug 17, 2010 at 13:00, Michael Snoyman <michael at snoyman.com> wrote:

>
>
> On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <gale at sefer.org> wrote:
>
>> Ketil Malde wrote:
>> > I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
>> > 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
>> > RAM, UTF-16 will be slower than UTF-8...
>>
>> I don't think the genome is typical text. And
>> I doubt that is true if that text is in a CJK language.
>>
>

>  As far as space usage, you are correct that CJK data will take up more
> memory in UTF-8 than UTF-16. The question still remains whether the overall
> document size will be larger: I'd be interested in taking a random sampling
> of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I
> think simply talking about this in the vacuum of data is pointless. If
> anyone can recommend a CJK website which would be considered representative
> (or a few), I'll do the test myself.
>
>
Regardless of the outcome of that investigation (which in itself is
interesting) I have to agree with Yitzchak that the human genome (or any
other ASCII based data that is not ncessarily a representation of written
human language) is not a good fir for the Text package.

A package like this should IMHO be good at handling human language, as much
of them as possible, and support the common operations as efficiently as
possible: sorting, upper/lowercase (where those exist), find word
boundaries, whatever.

Parsing some kind of file containing the human genome and the like I
think would be much better served by a package focusing on handling large
streams of bytes. No encodings to worry about, no parsing of the stream
determine code points, no calculations determine string lengths. If you need
to convert things to upper/lower case or do sorting you can just fall back
on simple ASCII processing, no need to depend on a package dedicated to
human text processing.

I do think that in-memory processing of Unicode is better served with UTF16
than UTF8 because except en very rare circumstances you can just treat the
text as an array of Char. You can't do that for UTF8 so the efficiency of
the algorithmes would suffer.

I also think that the memory problem is much easier worked around (for
example by dividing the problem in smaller parts) than sub-optimal string
processing because of increased complexity.

-Tako
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.haskell.org/pipermail/haskell-cafe/attachments/20100817/54c6aa17/attachment.html