Summary and call for discussion on text proposal
Isaac Dupree
ml at isaac.cedarswampstudios.org
Mon Nov 8 03:21:44 EST 2010
On 11/07/10 20:00, wren ng thornton wrote:
> ...The
> work on ByteString was an immense step forward and has been widely
> embraced and blessed, but its Word8-based organization means that it is
> not a complete solution to the problem of properly handling textual
> information. The text library offers such a solution and a high-quality
> solution at that; it is certainly on par with ByteString, IMO. I am of
> the opinion that adding text to the HP and encouraging it to be widely
> adopted is the only sensible solution to this larger issue of correct
> and efficient handling of textual information.
I'll use this as a jumping-off point for a doubt I have (Outside of my
steering-committee role. Duncan and Thomas did a fine job of steering
less than 24 hours ago.)
==========Intro=========
Perhaps the API in Data.Text
http://hackage.haskell.org/packages/archive/text/0.10.0.0/doc/html/Data-Text.html
is actually still too list-like and un-Unicode-ish. Functions like
justifyRight/justifyLeft/center really only make sense for strings where
one Char = one grapheme, and monospace fonts. (At least given the
current implementation of those functions). And some more (see below
"Opinions" section, which is the most important section of this email --
or at least the section I need responses to). My feeling is that these
functions should be included, but not in the base 'Data.Text' module --
perhaps in 'Data.Text.Char' or such. So that we don't regret it later.
Sort of like how we might regret 'lines'/'words' standing out as the
only element-type-specific function in Data.List. Or how people use
String-based and unsafeInterleaveIO-ish System.IO.readFile because it's
in a standard place, without thinking of why they not want those
properties. With 'text', people will try character based things like
'Data.Text.map toUpper' or 'Data.Text.length' from parallelism with
their List experience, without even stopping to look at the Text
documentation*, and never know it might be a bad choice for unicode
text... Yes, it's no worse than doing those things on a String a.k.a.
[Char], but we should do better than that. *go look at the Data.Text
"Case conversion" haddocks right now, and choose Data.Text.toUpper!
(After having explored this doubt for myself, I decided I didn't have
any of the other doubts about including 'text' in the Platform anymore,
see section "I approve the rest of 'text'".)
=========Unicode musing===========
As someone (Ian?) pointed out, even some substring-ing operations can
produce peculiar results when they split in the middle of a logical unit
(including the case of combining characters, a` vs. à, but keep in mind:
not all languages have a NFC combined version for everything they use,
and some languages have logical groupings in ways beyond just
combining-characters).
The Unicode technical report on regexes is pretty representative of how
complex it is. http://www.unicode.org/reports/tr18/
Ideally we would document each function with references to relevant
Unicode reports, documenting the behavioural tradeoffs we made between
implementation and interface simplicity, and textual meaningfulness.
That regexes document, for example, has three levels of conformance; a
couple of the Level 2 principles, 2.1 Canonical Equivalents and 2.2
Extended Grapheme Clusters, are relevant even for literal string search
("break"/"find"). I think, from a cursory inspection of the code
(Data.Text.Search.indices), that we don't meet 2.1 or 2.2. 2.1 could be
met e.g. by the search function itself doing normalizing, or by clearly
documenting the need to normalize beforehand. 2.2 would require a
separate search mode/function to say you only want to split at complete
"extended grapheme clusters" and a more complicated implementation (it
is quite plausible that core 'text' library would not provide this, but
ideally the docs would point to a library that would -- perhaps that's
'text-icu', also maintained by Bryan O'Sullivan, which is bindings to
the C library of that name: http://hackage.haskell.org/package/text-icu ).
It's also worth noting in the docs that the Ord instance is purely
lexicographical on the codepoints, and is not an algorithm to collate
for layman human consumption. (It may be obvious if you think about it.
We just ought to remind people to think about it.)
==========Opinions on the Data.Text functions=========
"Yes", "no", and "maybe" sections in that order. Disagree if you disagree.
===="Yes" -- Text-based functions====
In my opinion the following definitely makes sense for Data.Text:
pack, unpack, empty, append, null, intercalate, replace, toCaseFold,
toLower, toUpper, concat, strip, stripStart, stripEnd, break(aka
breakSubstring), breakEnd, group(?), split, lines, words, unlines,
unwords, isPrefixOf, isSuffixOf, isInfixOf, stripPrefix, stripSuffix,
find(aka breaks/breakSubstrings), count
===="No" -- Splitting-by-codepoint functions====
In my opinion, because they take apart a piece of text code-point by
code-point (a.k.a. Char by Char) or similar, the following should go in
their own module:
uncons, (unsnoc (except it doesn't exist)), head, last, tail, init,
length, compareLength, map, intersperse, transpose, reverse,
justifyRight, justifyLeft, center, fold*, concatMap, maximum, minimum,
scan*, mapAccumL/R, take, drop, splitAt, inits, tails, chunksOf, zip,
zipWith; (possibly) index and findIndex.
(in fact some piece-by-piece code even ought to encode to UTF-something
and analyze byte-by-byte. And on the flip side, of course some will
need to analyze by larger logical units than Chars, for which these
functions also are not suitable. I'm guessing usually these
code-point-based functions are mainly only useful either for
implementing higher-level Text functions, or when you know something
that limits the possible text you could be dealing with, e.g. if you're
writing an ASCII game like Angband - though beware still of future
developments - some modern Angbanders are probably coding 'á' for
ant-with-a-fedora already, etc.!)
===="Maybe" -- Somewhat codepoint based functions====
And I'm not sure about these:
These create a new Text from Chars, so they're structurally sound.
singleton, cons, snoc, unfoldr, unfoldrN (results = Text)
These merely search in a Char-based way, (somewhat subjectively
separated from splitting-by-codepoint functions -- I guess I think these
are more likely to be useful / less likely to be abused)
any, all (results = Bool)
takeWhile, dropWhile, dropWhileEnd, dropAround, spanBy(aka span),
breakBy, splitBy, findBy, partitionBy (results = Texts)
groupBy(??), filter(?) (hmm)
So perhaps (assuming they're commonly used enough to warrant remaining),
segregate them in a different section in the Data.Text documentation.
Or find some way to mark them. Or more likely we don't even need to do
that -- just mention it (a small caution) in the Data.Text module
header, since you can easily see by the presence of Char in the type --
in fact whether the 'Char' is in a contravariant position (f::Char->x)
or covariant position (f::(Char->x)->y) tells you whether it is
creation, or searching, respectively (If the splitting-by-codepoint
functions remain, it's a bit more complicated).
====================I approve the rest of 'text'===============
generally: Everything else I see about the API (meaning all modules
'text' exports) looks like the state of the art. (i.e. we might have
something better 3-5 years down the road with improved compiler/language
and a billion other things, but that's life. For example, the
type-level difference between the strict-Text world and the lazy-Text
world might not be the most fun thing in the world sometimes, but
anything else we Haskellers have come up with in the past few years has
more problems / less wondrousness.).
documentation: I don't think we need to perfect the documentation before
it is accepted into the Platform, because we can do that later (heck, I
volunteer to do the work if no one else wants to). And by 'perfect', I
do mean largely 'warn the user how it can go wrong, and give
corresponding advice' (perhaps using example). The docs are already
quite high quality.
list/bytestring/text parity: I've been convinced by now not to let the
List/Bytestring/Text parity issues hold us up. Having a text lib is more
important than achieving perfection now, and here's the reasons I think
this particular perfection should be put off for a while (and if this
means indefinitely then alright):
* There are already lots of users of Text - it's not cost-free to break
API now either.
* We're not ready to do a super-thought-out renaming. First I would
wish the subsequence-based (as contrasted with element-based) functions
to be proposed and accepted into Data.List / Data.Bytestring as
appropriate, or at least have very concrete proposals. If we were very
much in the mood to do it, we could, but relating to both Bryan and the
community at this juncture, we're (IMHO) not. (Neither the beautiful
path forwards, nor the choice to choose it if it fully materializes, is
crystal clear.)
* Incidentally, I suspect that separating functions into two modules, as
I advocate in this email, is actually a less difficult API breakage for
clients to fix than a renaming of several functions is.
================ Conclusion ==============
So please comment on my hare-brained idea of separating some of the
Data.Text functions into a separate module.
-Isaac
More information about the Libraries
mailing list