Haskell Platform Proposal: add the 'text' library
Ian Lynagh
igloo at earth.li
Tue Sep 7 17:50:18 EDT 2010
On Tue, Sep 07, 2010 at 08:26:36AM -0700, Donald Bruce Stewart wrote:
>
> = Proposal: Add Data.Text to the Haskell Platform =
I feel silly saying this, but as this will probably serve as an example
of the policy I'll say it anyway: I think this should be:
Proposal: Add 'text' to the Haskell Platform
> Proposal Author: Don Stewart
> Maintainer: Bryan O'Sullivan (submitted with his approval)
> Credits
> Proposal author and package maintainer: Bryan O'Sullivan, originally by
> Tom Harper, based on ByteString? and Vector (fusion) packages.
>
> The following individuals contributed to the review process: Don
> Stewart, Johan Tibell
These two sections appear to contradict each other.
Also, the hackage page says
Maintainer Bryan O'Sullivan <bos at serpentine.com>
Tom Harper <rrtomharper at googlemail.com>
Duncan Coutts <duncan at haskell.org>
> This is a proposal for the 'text' package
Should mention the version number, and link to the hackage page.
> This package provides text processing capabilities that are optimized
> for performance critical use, both in terms of large data quantities and
> high speed.
Are there other uses it is less suitable for, or are you just saying
that the code has been optimised?
If performance is important for the proposal, do you have evidence that
it performs well, or a way to check that performance has not regressed
in future releases?
> using several standard encodings
Just ASCII and UTF*, right?
Incidentally, I've just noticed some broken haddock markup for:
I/O libraries /do not support locale-sensitive I\O
in
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-IO.html
> see the 'text-icu' package
Would be nice for this to link to the hackage page.
> a much larger variety of encoding functions
Why not bundle these in the text package, or also put this package in
the platform? hackage doesn't have the haddocks as I write this, but I
assume they are text-specific.
> http://hackage.haskell.org/package/text
Should link to the version-specific page.
This item of "Proposal content" on AddingPackages doesn't seem to be
covered:
For library packages, an example of how the API is intended to be
used should be given.
This is really a comment on the process rather than your proposal, but
After a proposal is accepted (or conditionally accepted) the
proposal must remain on the wiki.
and
An explicit checklist of the package requirements below is not
required. The proposal should state however that all the
requirements are met
seem incompatible to me, as your
All package requirements are met.
comment will become out of date as the requirement list evolves.
On
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
a number of haddocks say
Subject to fusion.
but I can't see an explanation for the new user of what this means or
why they should care. Also, what it not be better to say
Warning: Not subject to fusion.
for the handful that aren't? Currently it's hard to notice.
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Encoding-Error.html
I would expect lenientDecode etc to use the On{En,De}codeError type
synonyms defined above.
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text-Lazy.html
the choice 'B' seems odd:
import qualified Data.Text.Lazy as B
I would have expected
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
to mention the existence of .Lazy in its description, and an explanation
of when I should use it.
Are there cases when Data.Text is significantly faster than
Data.Text.Lazy? Do we need both? (Presumably .Lazy is built on top of
Data.Text, but do we need the user to have a complete interface for
both?)
In
http://hackage.haskell.org/packages/archive/text/0.8.0.0/doc/html/Data-Text.html
isInfixOf's docs day:
O(n+m) The isInfixOf function takes two Texts and returns True iff the
first is contained, wholly and intact, anywhere within the second.
In (unlikely) bad cases, this function's time complexity degrades
towards O(n*m).
I think the complexity at the start, in the same place as all the other
complexities, ought to be O(n*m), with the common case given afterwards.
And replace's docs just say
O(m+n) Replace every occurrence of one substring with another.
but should presumably be O(n*m). It's also not necessarily clear what m
and n refer to.
> length :: Text -> Int
> O(n) Returns the number of characters in a Text. Subject to fusion.
Did you consider keeping the number of characters in the Text directly?
Is there a reason it couldn't be done?
> prevent is general use
"prevent its general use"
> a number of way:
"a number of ways:"
> unicode-unaware case conversion (map toUpper is an unsafe case
> conversion)
Surely this is something that should be added to Data.Char, irrespective
of whether text is added to the HP?
> the data structure is element-level lazy, whereas a number of
> applications require either some level of additional strictness
This sentence looks like it has been mis-edited?
And by "a number of applications" I think you mean "high performance
applications"?
> support whole-string case conversion (thus, type correct unicode transformations)
I don't really get what you mean by "type correct" here.
> based on unboxed Word16 arrays
Why Word16?
> As of Q2 2010, 'text' is ranked 27/2200 libraries (top 1% most
> popular), in particular, in web programming.
I can't work out what you mean here. Ranked 27 by what metric? Why web
programming in particular?
> A large testsuite, with coverage data, is provided.
It would be nice if this was on the text package's page, rather than in
~dons.
> RecordWildCards
I'm not a fan, but I fear I may be in the minority.
> propposal
"proposal"
> to expose only 5 modules
9, no?
> The public modules expose none of these (?).
None of what?
I compared the API of Data.Text and Data.ByteString.Char8 and found a
number of differences:
BS: break :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
breakEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
breakSubstring :: ByteString -> ByteString -> (ByteString, ByteString)
Text: break :: Text -> Text -> (Text, Text)
breakEnd :: Text -> Text -> (Text, Text)
breakBy :: (Char -> Bool) -> Text -> (Text, Text)
BS: count :: Char -> ByteString -> Int
Text: count :: Text -> Text -> Int
BS: find :: (Char -> Bool) -> ByteString -> Maybe Char
Text: find :: Text -> Text -> [(Text, Text)]
findBy :: (Char -> Bool) -> Text -> Maybe Char
BS: replicate :: Int -> Char -> ByteString
Text: replicate :: Int -> Text -> Text
BS: split :: Char -> ByteString -> [ByteString]
Text: split :: Text -> Text -> [Text]
BS: span :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
spanEnd :: (Char -> Bool) -> ByteString -> (ByteString, ByteString)
Text: spanBy :: (Char -> Bool) -> Text -> (Text, Text)
BS: splitBy :: (Char -> Bool) -> Text -> [Text]
Text: splitWith :: (Char -> Bool) -> ByteString -> [ByteString]
BS: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> (ByteString, Maybe a)
Text: unfoldrN :: Int -> (a -> Maybe (Char, a)) -> a -> Text
BS: zipWith :: (Char -> Char -> a) -> ByteString -> ByteString -> [a]
Text: zipWith :: (Char -> Char -> Char) -> Text -> Text -> Text
I think the two APIs ought to be brought into agreement.
There are a number of other differences which probably want to be tidied
up (mostly functions which are in one package but not the other, and
ByteString has IO functions mixed in with the non-IO functions), but
those seemed to be the most significant ones. Also,
prefixed :: Text -> Text -> Maybe Text
is analogous to
stripPrefix :: Eq a => [a] -> [a] -> Maybe [a]
in Data.List
This also made me notice that Text haddocks tend to use 'b' as a type
variable rather than 'a', e.g.
foldl :: (b -> Char -> b) -> b -> Text -> b
Thanks
Ian
More information about the Libraries
mailing list