[Haskell-cafe] Re: String vs ByteString
wren ng thornton
wren at freegeek.org
Tue Aug 17 21:25:09 EDT 2010
Bulat Ziganshin wrote:
> Johan wrote:
>> So it's not clear to me that using UTF-16 makes the program
>> noticeably slower or use more memory on a real program.
>
> it's clear misunderstanding. of course, not every program holds much
> text data in memory. but some does, and here you will double memory
> usage
I write programs that hold onto quite a good deal of natural language
text; a few million words at least. Getting efficient Unicode for that
is a high priority. However, all of that text is in Japanese, Chinese,
Arabic, Hindi, Urdu,... That's the reason I want Unicode. I'm pretty
sure UTF-16 isn't going to be causing any special problems here.
For NLP work, any language with a vaguely ASCII format isn't a problem.
We've been shoving English and western European languages into a subset
of ASCII for years (heck, we don't even allow real parentheses!).
For the mostly English files on my harddrive, UTF-8 is a clear win. But
when it comes to programming, I'm not so sure. I'd like to see some good
benchmarks and a clear explanation of where the costs are. Relying on
intuitions is notoriously bad for these kinds of encoding issues.
--
Live well,
~wren
More information about the Haskell-Cafe
mailing list