[Haskell-cafe] Re: String vs ByteString

Gábor Lehel illissius at gmail.com
Tue Aug 17 10:50:48 EDT 2010


(Actually, this seems more like a job for a type class.)

2010/8/17 Gábor Lehel <illissius at gmail.com>:
> Someone mentioned earlier that IHHO all of this messing around with
> encodings and conversions should be handled transparently, and I guess
> you could do something like have the internal representation be along
> the lines of Either UTF8 UTF16 (or perhaps even more encodings), and
> then implement every function in the API equivalently for each
> representation (with only the performance characteristics differing),
> with input/output functions being specialized for each encoding, and
> then only do a conversion when necessary or explicitly requested. But
> I assume that would have other problems (like the implicit conversions
> causing hard-to-track-down performance bugs when they're triggered
> unintentionally).
>
> On Tue, Aug 17, 2010 at 3:21 PM, Daniel Peebles <pumpkingod at gmail.com> wrote:
>> Sounds to me like we need a lazy Data.Text variation that allows UTF-8 and
>> UTF-16 "segments" in it list of strict text elements :) Then big chunks of
>> western text will be encoded efficiently, and same with CJK! Not sure what
>> to do about strict Data.Text though :)
>>
>> On Tue, Aug 17, 2010 at 1:40 PM, Ketil Malde <ketil at malde.org> wrote:
>>>
>>> Michael Snoyman <michael at snoyman.com> writes:
>>>
>>> > As far as space usage, you are correct that CJK data will take up more
>>> > memory in UTF-8 than UTF-16.
>>>
>>> With the danger of sounding ... alphabetist? as well as belaboring a
>>> point I agree is irrelevant (the storage format):
>>>
>>> I'd point out that it seems at least as unfair to optimize for CJK at
>>> the cost of Western languages.  UTF-16 uses two bytes for (most) CJK
>>> ideograms, and (all, I think) characters in Western and other phonetic
>>> scripts.  UTF-8 uses one to two bytes for a lot of Western alphabets,
>>> but three for CJK ideograms.
>>>
>>> Now, CJK has about 20K ideograms, which is almost 15 bits per ideogram,
>>> while an ASCII letter is about six bits.  Thus, the information density
>>> of CJK and ASCII is about equal for UTF-8, 5/8 vs 6/8 - compared to
>>> 15/16 vs 6/16 for UTF-16.  In other words a given document translated
>>> between Chinese and English should occupy roughly the same space in
>>> UTF-8, but be 2.5 times longer in English for UTF-16.
>>>
>>> -k
>>> --
>>> If I haven't seen further, it is by standing in the footprints of giants
>>> _______________________________________________
>>> Haskell-Cafe mailing list
>>> Haskell-Cafe at haskell.org
>>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> Haskell-Cafe at haskell.org
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>>
>>
>
>
>
> --
> Work is punishment for failing to procrastinate effectively.
>



-- 
Work is punishment for failing to procrastinate effectively.


More information about the Haskell-Cafe mailing list