[Haskell-cafe] Re: String vs ByteString

Gábor Lehel illissius at gmail.com
Tue Aug 17 09:50:00 EDT 2010


Someone mentioned earlier that IHHO all of this messing around with
encodings and conversions should be handled transparently, and I guess
you could do something like have the internal representation be along
the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
then implement every function in the API equivalently for each
representation (with only the performance characteristics differing),
with input/output functions being specialized for each encoding, and
then only do a conversion when necessary or explicitly requested. But
I assume that would have other problems (like the implicit conversions
causing hard-to-track-down performance bugs when they're triggered
unintentionally).

On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles <pumpkingod at gmail.com> wrote:
> Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
> UTF-16 "segments" in it list of strict text elements :) Then big chunks of
> western text will be encoded efficiently, and same with CJK! Not sure what
> to do about strict Data.Text though :)
>
> On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil at malde.org> wrote:
>>
>> Michael Snoyman <michael at snoyman.com> writes:
>>
>> > As far as space usage, you are correct that CJK data will take up more
>> > memory in UTF-8 than UTF-16.
>>
>> With the danger of sounding ... alphabetist? as well as belaboring a
>> point I agree is irrelevant (the storage format):
>>
>> I'd point out that it seems at least as unfair to optimize for CJK at
>> the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
>> ideograms, and (all, I think) characters in Western and other phonetic
>> scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
>> but three for CJK ideograms.
>>
>> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
>> while an ASCII letter is about six bits.  Thus, the information density
>> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
>> 15/16 vs 6/16 for UTF-16.  In other words a given document translated
>> between Chinese and English should occupy roughly the same space in
>> UTF-8, but be 2.5 times longer in English for UTF-16.
>>
>> -k
>> --
>> If I haven't seen further, it is by standing in the footprints of giants
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> Haskell-Cafe at haskell.org
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>



-- 
Work is punishment for failing to procrastinate effectively.


More information about the Haskell-Cafe mailing list