[Haskell-cafe] Re: String vs ByteString

Sun Aug 15 01:54:32 EDT 2010

On Sat, Aug 14, 2010 at 22:39, Edward Z. Yang <ezyang at mit.edu> wrote:
> Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
>> Also, despite the name, ByteString and Text are for separate purposes.
>> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
>> ByteString for binary data, and Text for...text. Most mature languages
>> have both types, though the choice of UTF-16 for Text is unusual.
>
> Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
> text representations, I cannot really agree with "unusual". :-)

Python doesn't use UTF-16; on UNIX systems it uses UCS-4, and on
WIndows it uses UCS-2. The difference is important because:

Python: len("\U0001dd1e") == 2
Haskell: length (pack "\x0001dd1e")

Java, .NET, Windows, JavaScript, and some other languages use UTF-16
because when Unicode support was added to these systems, the astral
characters had not been invented yet, and 16 bits was enough for the
entire Unicode character set. They originally used UCS-2, but then
moved to UTF-16 to minimize incompatibilities.

Anything based on UNIX generally uses UTF-8, because Unicode support
was added later after the problems of UCS-2/UTF-16 had been
discovered. C libraries written by UNIX users use UTF-8 almost
exclusively -- this includes most language bindings available on
Hackage.

I don't mean that UTF-16 is itself unusual, but it's a legacy encoding
-- there's no reason to use it in new projects. If "text" had been
started 15 years ago, I could understand, but since it's still in
active development the use of UTF-16 simply adds baggage.