[Haskell-cafe] Motion to unify all the string data types

Sat Nov 10 04:00:04 CET 2012

Hi Andrew,

On Fri, Nov 9, 2012 at 6:15 PM, Andrew Pennebaker <
andrew.pennebaker at gmail.com> wrote:

> Frequently when I'm coding in Haskell, the crux of my problem is
> converting between all the stupid string formats.
>
> You've got String, ByteString, Lazy ByteString, Text, [Word], and on and
> on... I have to constantly lookup how to convert between them, and the
> overloaded strings GHC directive doesn't work, and sometimes
> ByteString.unpack doesn't work, because it expects [Word8], not [Char].
> AAAAAAAAAAAAAAAAAAAH!!!
>
> Haskell is a wonderful playground for experimentation. I've started to
> notice that many Hackage libraries are simply instances of typeclasses
> designed a while ago, and their underlying implementations are free to play
> around with various optimizations... But they ideally all expose the same
> interface through typeclasses.
>
> Can we do the same with String? Can we pick a good compromise of lazy vs
> strict, flexible vs fast, and all use the same data structure? My vote is
> for type String = [Char], but I'm willing to switch to another data
> structure, just as long as it's consistently used.
>

tl;dr; Use strict Text and ByteStrings.

We need at least two string types, one for byte strings and one for Unicode
strings, as these are two semantically different concepts. You see that
most modern languages use two types (e.g. str and unicode in Python). For
Unicode strings, String is not a good candidate; it's slow, uses a lot of
memory, doesn't hide its representation [1], and finally, it encourages
people to do the wrong thing from a Unicode perspective [2].

As a community we should primary use strict ByteStrings and Texts. There
are uses for the lazy variants (i.e. they are sometimes more efficient),
but in general the strict versions should be preferred. Choosing to use
these two types can sometimes be a bit frustrating, as lots of code (e.g.
the base package) uses Strings. But if we don't start using them the pain
will never end. One of the main pain points is that the I/O layer using
Strings, which is both inconvenient and wrong (e.g. a socket returns bytes,
not Unicode code points, yet the recv function returns a String). We really
need to create a more sane I/O layer.

If you use ByteString and Text, you shouldn't see calls to pack/unpack in
your code (except if you want to interact with legacy code), as the correct
way to go between the two is via the encode and decode functions in the
text package.

As for type classes, I don't think we use them enough. Perhaps because
Haskell wasn't developed as an engineering language, some good software
engineering principles (code against an interface, not a concrete
implementation) aren't used in out base libraries. One specific example is
the lack of a sequence abstraction/type class, that all the string, list,
and vector types could implement. Right now all these types try to
implement a compatible interface (i.e. the traditional list interface),
without a language mechanism to express that this is what they do.

1. If String was designed as an abstract type, we could simply has switched
its implementation for a more efficient implementation and we would have to
create a new Text type.

2. By having the primary interface of a Unicode data type be a sequence, we
encourage users to work on strings element-wise, which can lead to errors
as Unicode code points don't correspond well to the human concept of a
character (for example, the Swedish ä character can be represented using
either one or two code points). A sequence view is sometimes useful, if
you're implementing more high-level transformations, but often you should
use functions that operate on the whole string, such as toUpper :: Text ->
Text.

Cheers,
  Johan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/haskell-cafe/attachments/20121109/17b5990e/attachment-0001.htm>