Haskell Platform Proposal: add the 'text' library

Tue Sep 7 18:21:19 EDT 2010

I'll answer a few of Ian's questions about the design of the text package:

On 7 September 2010 22:50, Ian Lynagh <igloo at earth.li> wrote:

>> see the 'text-icu' package
>
> Would be nice for this to link to the hackage page.
>
>> a much larger variety of encoding functions
>
> Why not bundle these in the text package, or also put this package in
> the platform? hackage doesn't have the haddocks as I write this, but I
> assume they are text-specific.

It would depend on the ICU C library. Similarly if we added a
conversion lib based on iconv. The ones in the text package now are
pure Haskell.

> Are there cases when Data.Text is significantly faster than
> Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
> Data.Text, but do we need the user to have a complete interface for
> both?)

Mm, this is a fair question. In the case of bytestring we need both
because sometimes for dealing with foreign code or IO you need the
representation to be a contigious block of memory. For text the
representation is more abstract so that need does not arrise. One
might argue that if it is simply to control strictness then one could
use the lazy version and provide a deepseq instance.

Here's an alternative argument: suppose we change the representation
of strict text to be a tree of chunks (e.g. finger tree). We could
achieve effecient concatenation. This representation would be
impossible while preserving semantics of a lazy tail. A tree impl that
has any kind of balance needs to know the overall length so cannot
have a lazy tail.

> Did you consider keeping the number of characters in the Text directly?
> Is there a reason it couldn't be done?

There's little point. Knowing the length does not usually help you
save any other O(n) operations. It'd also only help for strict text,
not lazy. Just like lists, asking for the length is usually not a good
idea.

>> unicode-unaware case conversion (map toUpper is an unsafe case
>> conversion)
>
> Surely this is something that should be added to Data.Char, irrespective
> of whether text is added to the HP?

No, not to Data.Char. Case folding is not a per-Char operation, it's
only works for [Char] / String / Text. It could be added to
Data.String or something.

>> based on unboxed Word16 arrays
>
> Why Word16?

It doesn't actually matter. It's an implementation detail. It was
originally chosen based on benchmarks. It could be changed again based
on new benchmarks without affecting the public API.

> I compared the API of Data.Text and Data.ByteString.Char8 and found a
> number of differences:

Many of these are deliberate and sensible. The thing with text as
opposed to lists/arrays is that almost all operations you want to do
are substring based and not element based. A Unicode code point (a
Char) is sadly only roughly related to the human concept of a
character. In particular there are combining characters. So even if
you want to search or split on a particular "character" that may mean
searching for a short sequence of Chars / code points.

So where the ByteString API followed the List api by being byte
oriented, the Text API is substring oriented.

> BS:   break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
> Text: break :: Text -> Text -> (Text, Text)
>      breakEnd :: Text -> Text -> (Text, Text)
>      breakBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   count :: Char -> ByteString -> Int
> Text: count :: Text -> Text -> Int
>
> BS:   find :: (Char -> Bool) -> ByteString -> Maybe Char
> Text: find :: Text -> Text -> [(Text, Text)]
>      findBy :: (Char -> Bool) -> Text -> Maybe Char
>
> BS:   replicate :: Int -> Char -> ByteString
> Text: replicate :: Int -> Text -> Text
>
> BS:   split :: Char -> ByteString -> [ByteString]
> Text: split :: Text -> Text -> [Text]
>
> BS:   span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
>      spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
> Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)
>
> BS:   splitBy :: (Char -> Bool) -> Text -> [Text]
> Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]
>
> BS:   unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
> Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text
>
> BS:   zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
> Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text
>
> I think the two APIs ought to be brought into agreement.

Perhaps. If so, then it is the ByteString.Char8 that ought to be
brought into agreement with Text, not the other way around. I think
Text is right in this area. On the other hand, perhaps it makes sense
for ByteString.Char8 to remain like the ByteString byte interface
which is byte oriented (and probably rightly so). I hope the
significance and use of ByteString.Char8 will decrease as Text becomes
more popular. ByteString.Char8 is really just for the cases where
you're handling ASCII-like protocols.

> There are a number of other differences which probably want to be tidied
> up (mostly functions which are in one package but not the other,

What are you thinking of specifically?

> ByteString has IO functions mixed in with the non-IO functions,

Which I don't think was a good idea. I would prefer to split them up.

> but those seemed to be the most significant ones. Also,

>    prefixed :: Text -> Text -> Maybe Text
> is analogous to
>    stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
> in Data.List

Ah, that one probably does make sense to change to match Data.List.

Duncan