Haskell Platform Proposal: add the 'text' library

Tue Sep 7 17:50:18 EDT 2010

On Tue, Sep 07, 2010 at 08:26:36AM -0700, Donald Bruce Stewart wrote:
> 
> = Proposal: Add Data.Text to the Haskell Platform =

I feel silly saying this, but as this will probably serve as an example
of the policy I'll say it anyway: I think this should be:
    Proposal: Add 'text' to the Haskell Platform

> Proposal Author: Don Stewart
> Maintainer: Bryan O'Sullivan (submitted with his approval) 

> Credits
> Proposal author and package maintainer: Bryan O'Sullivan, originally by
> Tom Harper, based on ByteString? and Vector (fusion) packages.
>
> The following individuals contributed to the review process: Don
> Stewart, Johan Tibell 

These two sections appear to contradict each other.

Also, the hackage page says
    Maintainer  Bryan O'Sullivan <bos at serpentine.com>
                Tom Harper <rrtomharper at googlemail.com>
                Duncan Coutts <duncan at haskell.org>

> This is a proposal for the 'text' package

Should mention the version number, and link to the hackage page.

> This package provides text processing capabilities that are optimized
> for performance critical use, both in terms of large data quantities and
> high speed. 

Are there other uses it is less suitable for, or are you just saying
that the code has been optimised?

If performance is important for the proposal, do you have evidence that
it performs well, or a way to check that performance has not regressed
in future releases?

> using several standard encodings

Just ASCII and UTF*, right?

Incidentally, I've just noticed some broken haddock markup for:
    I/O libraries /do not support locale-sensitive I\O
in
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html

> see the 'text-icu' package

Would be nice for this to link to the hackage page.

> a much larger variety of encoding functions

Why not bundle these in the text package, or also put this package in
the platform? hackage doesn't have the haddocks as I write this, but I
assume they are text-specific.

> http://hackage.haskell.org/package/text

Should link to the version-specific page.

This item of "Proposal content" on AddingPackages doesn't seem to be
covered:
    For library packages, an example of how the API is intended to be
    used should be given.

This is really a comment on the process rather than your proposal, but
    After a proposal is accepted (or conditionally accepted) the
    proposal must remain on the wiki.
and
    An explicit checklist of the package requirements below is not
    required. The proposal should state however that all the
    requirements are met
seem incompatible to me, as your
    All package requirements are met.
comment will become out of date as the requirement list evolves.

On
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
a number of haddocks say
    Subject to fusion.
but I can't see an explanation for the new user of what this means or
why they should care. Also, what it not be better to say
    Warning: Not subject to fusion.
for the handful that aren't? Currently it's hard to notice.

In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Encoding-Error.html
I would expect lenientDecode etc to use the On{En,De}codeError type
synonyms defined above.

In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy.html
the choice 'B' seems odd:
    import qualified Data.Text.Lazy as B

I would have expected
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
to mention the existence of .Lazy in its description, and an explanation
of when I should use it.

Are there cases when Data.Text is significantly faster than
Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
Data.Text, but do we need the user to have a complete interface for
both?)

In
    http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
isInfixOf's docs day:
    O(n+m) The isInfixOf function takes two Texts and returns True iff the
    first is contained, wholly and intact, anywhere within the second.
    In (unlikely) bad cases, this function's time complexity degrades
    towards O(n*m). 
I think the complexity at the start, in the same place as all the other
complexities, ought to be O(n*m), with the common case given afterwards.

And replace's docs just say
    O(m+n) Replace every occurrence of one substring with another.
but should presumably be O(n*m). It's also not necessarily clear what m
and n refer to.

> length :: Text -> Int
> O(n) Returns the number of characters in a Text. Subject to fusion.

Did you consider keeping the number of characters in the Text directly?
Is there a reason it couldn't be done?

> prevent is general use

"prevent its general use"

> a number of way:

"a number of ways:"

> unicode-unaware case conversion (map toUpper is an unsafe case
> conversion) 

Surely this is something that should be added to Data.Char, irrespective
of whether text is added to the HP?

> the data structure is element-level lazy, whereas a number of
> applications require either some level of additional strictness

This sentence looks like it has been mis-edited?

And by "a number of applications" I think you mean "high performance
applications"?

> support whole-string case conversion (thus, type correct unicode transformations)

I don't really get what you mean by "type correct" here.

> based on unboxed Word16 arrays

Why Word16?

> As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most
> popular), in particular, in web programming.

I can't work out what you mean here. Ranked 27 by what metric? Why web
programming in particular?

> A large testsuite, with coverage data, is provided.

It would be nice if this was on the text package's page, rather than in
~dons.

> RecordWildCards

I'm not a fan, but I fear I may be in the minority.

> propposal

"proposal"

> to expose only 5 modules

9, no?

> The public modules expose none of these (?).

None of what?

I compared the API of Data.Text and Data.ByteString.Char8 and found a
number of differences:

BS:   break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
Text: break :: Text -> Text -> (Text, Text)
      breakEnd :: Text -> Text -> (Text, Text)
      breakBy :: (Char -> Bool) -> Text -> (Text, Text)

BS:   count :: Char -> ByteString -> Int
Text: count :: Text -> Text -> Int

BS:   find :: (Char -> Bool) -> ByteString -> Maybe Char
Text: find :: Text -> Text -> [(Text, Text)]
      findBy :: (Char -> Bool) -> Text -> Maybe Char

BS:   replicate :: Int -> Char -> ByteString
Text: replicate :: Int -> Text -> Text

BS:   split :: Char -> ByteString -> [ByteString]
Text: split :: Text -> Text -> [Text]

BS:   span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
      spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)

BS:   splitBy :: (Char -> Bool) -> Text -> [Text]
Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]

BS:   unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text

BS:   zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text

I think the two APIs ought to be brought into agreement.

There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and
ByteString has IO functions mixed in with the non-IO functions), but
those seemed to be the most significant ones. Also,
    prefixed :: Text -> Text -> Maybe Text
is analogous to
    stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
in Data.List

This also made me notice that Text haddocks tend to use 'b' as a type
variable rather than 'a', e.g.
    foldl :: (b -> Char -> b) -> b -> Text -> b

Thanks
Ian