Summary and call for discussion on text proposal

Mon Nov 8 03:21:44 EST 2010

On 11/07/10 20:00, wren ng thornton wrote:
> ...The
> work on ByteString was an immense step forward and has been widely
> embraced and blessed, but its Word8-based organization means that it is
> not a complete solution to the problem of properly handling textual
> information. The text library offers such a solution and a high-quality
> solution at that; it is certainly on par with ByteString, IMO. I am of
> the opinion that adding text to the HP and encouraging it to be widely
> adopted is the only sensible solution to this larger issue of correct
> and efficient handling of textual information.

I'll use this as a jumping-off point for a doubt I have (Outside of my 
steering-committee role. Duncan and Thomas did a fine job of steering 
less than 24 hours ago.)

==========Intro=========

Perhaps the API in Data.Text
http://hackage.haskell.org/packages/archive/text/0.10.0.0/doc/html/Data-Text.html
is actually still too list-like and un-Unicode-ish.  Functions like 
justifyRight/justifyLeft/center really only make sense for strings where 
one Char = one grapheme, and monospace fonts. (At least given the 
current implementation of those functions).  And some more (see below 
"Opinions" section, which is the most important section of this email -- 
or at least the section I need responses to).  My feeling is that these 
functions should be included, but not in the base 'Data.Text' module -- 
perhaps in 'Data.Text.Char' or such.  So that we don't regret it later. 
  Sort of like how we might regret 'lines'/'words' standing out as the 
only element-type-specific function in Data.List.  Or how people use 
String-based and unsafeInterleaveIO-ish System.IO.readFile because it's 
in a standard place, without thinking of why they not want those 
properties.  With 'text', people will try character based things like 
'Data.Text.map toUpper' or 'Data.Text.length' from parallelism with 
their List experience, without even stopping to look at the Text 
documentation*, and never know it might be a bad choice for unicode 
text... Yes, it's no worse than doing those things on a String a.k.a. 
[Char], but we should do better than that.  *go look at the Data.Text 
"Case conversion" haddocks right now, and choose Data.Text.toUpper!

(After having explored this doubt for myself, I decided I didn't have 
any of the other doubts about including 'text' in the Platform anymore, 
see section "I approve the rest of 'text'".)

=========Unicode musing===========
As someone (Ian?) pointed out, even some substring-ing operations can 
produce peculiar results when they split in the middle of a logical unit 
(including the case of combining characters, a` vs. à, but keep in mind: 
not all languages have a NFC combined version for everything they use, 
and some languages have logical groupings in ways beyond just 
combining-characters).

The Unicode technical report on regexes is pretty representative of how 
complex it is. http://www.unicode.org/reports/tr18/

Ideally we would document each function with references to relevant 
Unicode reports, documenting the behavioural tradeoffs we made between 
implementation and interface simplicity, and textual meaningfulness. 
That regexes document, for example, has three levels of conformance; a 
couple of the Level 2 principles, 2.1 Canonical Equivalents and 2.2 
Extended Grapheme Clusters, are relevant even for literal string search 
("break"/"find").  I think, from a cursory inspection of the code 
(Data.Text.Search.indices), that we don't meet 2.1 or 2.2.  2.1 could be 
met e.g. by the search function itself doing normalizing, or by clearly 
documenting the need to normalize beforehand.  2.2 would require a 
separate search mode/function to say you only want to split at complete 
"extended grapheme clusters" and a more complicated implementation (it 
is quite plausible that core 'text' library would not provide this, but 
ideally the docs would point to a library that would -- perhaps that's 
'text-icu', also maintained by Bryan O'Sullivan, which is bindings to 
the C library of that name: http://hackage.haskell.org/package/text-icu ).

It's also worth noting in the docs that the Ord instance is purely 
lexicographical on the codepoints, and is not an algorithm to collate 
for layman human consumption.  (It may be obvious if you think about it. 
  We just ought to remind people to think about it.)

==========Opinions on the Data.Text functions=========

"Yes", "no", and "maybe" sections in that order.  Disagree if you disagree.

===="Yes" -- Text-based functions====

In my opinion the following definitely makes sense for Data.Text:

pack, unpack, empty, append, null, intercalate, replace, toCaseFold, 
toLower, toUpper, concat, strip, stripStart, stripEnd, break(aka 
breakSubstring), breakEnd, group(?), split, lines, words, unlines, 
unwords, isPrefixOf, isSuffixOf, isInfixOf, stripPrefix, stripSuffix, 
find(aka breaks/breakSubstrings), count

===="No" -- Splitting-by-codepoint functions====

In my opinion, because they take apart a piece of text code-point by 
code-point (a.k.a. Char by Char) or similar, the following should go in 
their own module:

uncons, (unsnoc (except it doesn't exist)), head, last, tail, init, 
length, compareLength, map, intersperse, transpose, reverse, 
justifyRight, justifyLeft, center, fold*, concatMap, maximum, minimum, 
scan*, mapAccumL/R, take, drop, splitAt, inits, tails, chunksOf, zip, 
zipWith; (possibly) index and findIndex.

(in fact some piece-by-piece code even ought to encode to UTF-something 
and analyze byte-by-byte.  And on the flip side, of course some will 
need to analyze by larger logical units than Chars, for which these 
functions also are not suitable.  I'm guessing usually these 
code-point-based functions are mainly only useful either for 
implementing higher-level Text functions, or when you know something 
that limits the possible text you could be dealing with, e.g. if you're 
writing an ASCII game like Angband - though beware still of future 
developments - some modern Angbanders are probably coding 'á' for 
ant-with-a-fedora already, etc.!)

===="Maybe" -- Somewhat codepoint based functions====

And I'm not sure about these:

These create a new Text from Chars, so they're structurally sound.
singleton, cons, snoc, unfoldr, unfoldrN (results = Text)

These merely search in a Char-based way, (somewhat subjectively 
separated from splitting-by-codepoint functions -- I guess I think these 
are more likely to be useful / less likely to be abused)
any, all (results = Bool)
takeWhile, dropWhile, dropWhileEnd, dropAround, spanBy(aka span), 
breakBy, splitBy, findBy, partitionBy (results = Texts)
groupBy(??), filter(?) (hmm)

So perhaps (assuming they're commonly used enough to warrant remaining), 
segregate them in a different section in the Data.Text documentation. 
Or find some way to mark them.  Or more likely we don't even need to do 
that -- just mention it (a small caution) in the Data.Text module 
header, since you can easily see by the presence of Char in the type -- 
in fact whether the 'Char' is in a contravariant position (f::Char->x) 
or covariant position (f::(Char->x)->y) tells you whether it is 
creation, or searching, respectively (If the splitting-by-codepoint 
functions remain, it's a bit more complicated).

====================I approve the rest of 'text'===============

generally: Everything else I see about the API (meaning all modules 
'text' exports) looks like the state of the art. (i.e. we might have 
something better 3-5 years down the road with improved compiler/language 
and a billion other things, but that's life.  For example, the 
type-level difference between the strict-Text world and the lazy-Text 
world might not be the most fun thing in the world sometimes, but 
anything else we Haskellers have come up with in the past few years has 
more problems / less wondrousness.).

documentation: I don't think we need to perfect the documentation before 
it is accepted into the Platform, because we can do that later (heck, I 
volunteer to do the work if no one else wants to).  And by 'perfect', I 
do mean largely 'warn the user how it can go wrong, and give 
corresponding advice' (perhaps using example).  The docs are already 
quite high quality.

list/bytestring/text parity: I've been convinced by now not to let the 
List/Bytestring/Text parity issues hold us up. Having a text lib is more 
important than achieving perfection now, and here's the reasons I think 
this particular perfection should be put off for a while (and if this 
means indefinitely then alright):
* There are already lots of users of Text - it's not cost-free to break 
API now either.
* We're not ready to do a super-thought-out renaming.  First I would 
wish the subsequence-based (as contrasted with element-based) functions 
to be proposed and accepted into Data.List / Data.Bytestring as 
appropriate, or at least have very concrete proposals.  If we were very 
much in the mood to do it, we could, but relating to both Bryan and the 
community at this juncture, we're (IMHO) not. (Neither the beautiful 
path forwards, nor the choice to choose it if it fully materializes, is 
crystal clear.)
* Incidentally, I suspect that separating functions into two modules, as 
I advocate in this email, is actually a less difficult API breakage for 
clients to fix than a renaming of several functions is.

================ Conclusion ==============

So please comment on my hare-brained idea of separating some of the 
Data.Text functions into a separate module.

-Isaac